熊猫系列的部分总和答案

【问题标题】：Partial sums over series in pandas熊猫系列的部分总和
【发布时间】：2018-11-08 14:00:53
【问题描述】：

我有一个DataFrame，看起来像

       A      B
0     1.2     1
1     1.2     6
2     1.2     4
3     2.3     2
4     2.3     5
5     1.2     7

并且我想获得具有相同 A 值的组的部分总和，但前提是它们彼此相邻。对于这种情况，我期望另一个 DataFrame 像

0    1.2    11
3    2.3    7
5    1.2    7

我有一种感觉，我可以使用.groupby，但我只能管理它，不管A 的组是否彼此相邻。

【问题讨论】：

标签： python pandas group-by sum

【解决方案1】：

通过助手Series 使用groupby 与聚合first 和sum：

df = df.groupby(df.A.ne(df.A.shift()).cumsum(), as_index=False).agg({'A':'first','B':'sum'})
print (df)
     A   B
0  1.2  11
1  2.3   7
2  1.2   7

详情：

将shiftd 列与ne (!=) 进行比较，并为连续组添加cumsum Series：

print (df.A.ne(df.A.shift()).cumsum())
0    1
1    1
2    1
3    2
4    2
5    3
Name: A, dtype: int32

感谢@user2285236 的评论：

当 dtype 为 float 时，检查相等性可能会导致不需要的结果。 np.isclose 在这里可能是一个更好的选择

df = df.groupby(np.cumsum(~np.isclose(df.A, df.A.shift())), as_index=False).agg({'A':'first','B':'sum'})
print (df)
     A   B
0  1.2  11
1  2.3   7
2  1.2   7

print (np.cumsum(~np.isclose(df.A, df.A.shift())))
[1 1 1 2 2 3]

【讨论】：

当 dtype 为 float 时，检查相等性可能会导致不需要的结果。 np.isclose 在这里可能是更好的选择。
您不能通过将df.A 转换为分类来避免np.isclose 吗？
@jpp - 在我看来，如果 A 列的唯一值数量较少，它应该会很好用。

【解决方案2】：

`itertools.groupby`

遇到@user2285236 强调的相同问题

g = groupby(df.itertuples(index=False), key=lambda x: x.A)
pd.DataFrame(
    [[a, sum(t.B for t in b)] for a, b in g],
    columns=df.columns
)

     A   B
0  1.2  11
1  2.3   7
2  1.2   7

【讨论】：