在 Python pandas 中查找某些值的累积和答案

【问题标题】：Find the cumulative sum of certain values in Python pandas在 Python pandas 中查找某些值的累积和
【发布时间】：2017-12-15 16:33:37
【问题描述】：

我有一个这样的数据框：

timestamp             variance
2017-07-10 20:42:42   0
2017-07-10 20:42:42   1
2017-07-10 20:42:42   2
2017-07-10 20:42:43   6
2017-07-10 20:42:43   7
2017-07-10 20:42:43   9
2017-07-10 20:42:43   3
2017-07-10 20:42:43   4
2017-07-10 20:42:43   5
2017-07-10 20:42:43   1
2017-07-10 20:42:43   4
2017-07-10 20:42:43   1
2017-07-10 20:42:43   3
2017-07-10 20:42:43   7
2017-07-10 20:42:43   9

我想添加一个新列，对于方差等于或大于的每一行递增 5. 当值低于 5 时，计数应该减少。如果值达到 0，它应该保持在 0。

它应该是这样的：

timestamp             variance  cumvar
2017-07-10 20:42:42   0         0
2017-07-10 20:42:42   1         0
2017-07-10 20:42:42   2         0
2017-07-10 20:42:43   6         1
2017-07-10 20:42:43   7         2
2017-07-10 20:42:43   9         3
2017-07-10 20:42:43   3         2
2017-07-10 20:42:43   4         1
2017-07-10 20:42:43   5         2
2017-07-10 20:42:43   1         1
2017-07-10 20:42:43   4         0
2017-07-10 20:42:43   1         0
2017-07-10 20:42:43   3         0
2017-07-10 20:42:43   7         1
2017-07-10 20:42:43   9         2

我最接近这样做的是：

df['cumvar'] = np.where((df['variance'] > 5), 1, -1).cumsum()

当然，这并不适用于累积和的最小值 0。我该如何调整它以实现上述目标？

【问题讨论】：

也许可以递归地使用scipy.signal.lfilter，参见帖子here 和here。

标签： python pandas

【解决方案1】：

我会尝试不同的方法。我会遍历df['variance'].values 并创建一个列表，然后将一个新系列附加到数据框：

x=0
l=[]
for val in df['variance'].values:
    x = max(x+1 if val > 5 else x-1,0)
    l.append(x)
s=pd.DataFrame([l]).T
df=pd.concat([df,s],axis=1,ignore_index=True, join_axes=[df1.index])

【讨论】：

【解决方案2】：

这可能不是最优雅的方式，但它确实有效：

def cum_sum_limited(val, threshold=5, min_sum=0):
    global tot
    tot -= 1 if val < threshold else -1
    tot = 0 if tot < 0 else tot
    return tot

tot = 0
df['cumvar'] = df.variance.apply(cum_sum_limited)

让我知道你的想法

【讨论】：

【解决方案3】：

单线：

pd.expanding_apply(df['variance'], 
                   lambda s: reduce(lambda x,y : max(x+(1 if y-5 > 0 else -1), 0), s, 0))

当然，可读性很差 =)

你可以按照你开始做的方式去做：

pd.expanding_apply(np.where((df['variance'] > 5), 1, -1), lambda s: reduce(lambda x,y : max(x+y, 0), s, 0))

提取reduce函数可以提高可读性：

def tricky_func(acc, y):
    next_value = 1 if y - 5 > 0 else -1 
    return max(acc + next_value, 0)

pd.expanding_apply(df['variance'], lambda s: reduce(tricky_func, s))

编辑：

你需要先从 functools 导入 reduce，你使用的是 python 3

如果你使用的是 pandas 0.18+，你应该使用

df['variance'].expanding().apply(lambda s: reduce(tricky_func, s))

符号（感谢 Brad Solomon）

【讨论】：

这个很好回答~谢谢（Ps：我也是这么想的reduce函数）+1
不错的答案，可能需要为 3.x 指定 from functools import reduce，并且不推荐使用 expanding_apply 以支持 .expanding.apply。 (New API.)