【问题标题】:Fill consecutive NaNs with cumsum, to increment by one on each consecutive NaN用 cumsum 填充连续的 NaN,在每个连续的 NaN 上加一
【发布时间】:2018-12-27 10:46:42
【问题描述】:
给定一个数据框,在某个反面中具有大量缺失值,我想要的输出数据框应该有所有连续的NaN,从第一个有效值开始填充cumsum,并为每个NaN添加1 .
给定:
shop_id calendar_date quantity
0 2018-12-12 1
1 2018-12-13 NaN
2 2018-12-14 NaN
3 2018-12-15 NaN
4 2018-12-16 1
5 2018-12-17 NaN
期望的输出:
shop_id calendar_date quantity
0 2018-12-12 1
1 2018-12-13 2
2 2018-12-14 3
3 2018-12-15 4
4 2018-12-16 1
5 2018-12-17 2
【问题讨论】:
标签:
pandas
dataframe
missing-data
cumulative-sum
【解决方案1】:
用途:
g = (~df.quantity.isnull()).cumsum()
df['quantity'] = df.fillna(1).groupby(g).quantity.cumsum()
shop_id calendar_date quantity
0 0 2018-12-12 1.0
1 1 2018-12-13 2.0
2 2 2018-12-14 3.0
3 3 2018-12-15 4.0
4 4 2018-12-16 1.0
5 5 2018-12-17 2.0
详情
使用.isnull()检查quantity在哪里有有效值,并取布尔系列的cumsum:
g = (~df.quantity.isnull()).cumsum()
0 1
1 1
2 1
3 1
4 2
5 2
使用fillna
这样当您按g 分组并采用cusmum 时,值将从任何值开始增加:
df.fillna(1).groupby(g).quantity.cumsum()
0 1.0
1 2.0
2 3.0
3 4.0
4 1.0
5 2.0
【解决方案2】:
另一种方法?
数据
shop_id calender_date quantity
0 0 2018-12-12 1.0
1 1 2018-12-13 NaN
2 2 2018-12-14 NaN
3 3 2018-12-15 NaN
4 4 2018-12-16 1.0
5 5 2018-12-17 NaN
6 6 2018-12-18 NaN
7 7 2018-12-17 NaN
使用 np.where
where = np.where(data['quantity'] >= 1)
r = []
for i in range(len(where[0])):
try:
r.extend(np.arange(1,where[0][i+1] - where[0][i]+1))
except:
r.extend(np.arange(1,len(data)-where[0][i]+1))
data['quantity'] = r
打印(数据)
shop_id calender_date quantity
0 0 2018-12-12 1
1 1 2018-12-13 2
2 2 2018-12-14 3
3 3 2018-12-15 4
4 4 2018-12-16 1
5 5 2018-12-17 2
6 6 2018-12-18 3
7 7 2018-12-17 4