对 DataFrame 的一列求和重复的子集行条件答案

【问题标题】：sum duplicate subset rows condition on a column of DataFrame对 DataFrame 的一列求和重复的子集行条件
【发布时间】：2021-09-12 15:53:37
【问题描述】：

我们可以对列子集的重复行求和：

df = pd.DataFrame({"source": [1, 1, 3, 1, 1],
                   "target":[2, 2, 5, 3, 3],
                   "value": [0.5, 1.0, 1.51, 0.2, 0.5]})
print(df)
print(df.groupby(['source','target'], as_index=False)["value"].sum())

   source  target  value
0       1       2   0.50
1       1       2   1.00
2       3       5   1.51
3       1       3   0.20
4       1       3   0.50

   source  target  value
0       1       2   1.50
1       1       3   0.70
2       3       5   1.51

如何以重复行为条件，例如仅与目标 2 重复的行

输出应该是这样的：

   source  target  value
0       1       2   1.50
2       3       5   1.51
3       1       3   0.20
4       1       3   0.50

编辑：稍后可以删除其他重复的行 (df.drop_duplicates(subset=["source","target"]))。

【问题讨论】：

标签： python pandas

【解决方案1】：

正如您所说，其他重复行可以稍后删除，这是一种方法，首先删除目标不是 2 的重复行，然后对剩余的行进行分组和聚合

c = ['source', 'target']
df[~df.duplicated(c) | df['target'].eq(2)].groupby(c, as_index=False).sum()

   source  target  value
0       1       2   1.50
1       1       3   0.20
2       3       5   1.51

【讨论】：

不是 OP 想要的输出。

【解决方案2】：

一种方法是创建一个布尔索引以将 DataFrame 与 groupby 部分过滤，而不是 groupby 然后 concat 将 DataFrame 片段组合在一起，如：

# Condition to filter DataFrame with (target == 2)
m = df['target'].eq(2)
new_df = pd.concat([
    # DataFrame rows that meet condition `m`
    df[m].groupby(['source', 'target'], as_index=False)["value"].sum(),
    # DataFrame rows which do not (~) meet the condtion
    df[~m]
])

new_df:

   source  target  value
0       1       2   1.50
2       3       5   1.51
3       1       3   0.20
4       1       3   0.50

【讨论】：

【解决方案3】：

或者尝试groupby和duplicated与agg、sum和first：

df.groupby((~(df.duplicated(['source', 'target'], keep=False) & df['target'].eq(2))).cumsum()).agg({'source': 'first', 'target': 'first', 'value': sum})

或者只是将列名分配为列表：

cols = ['source', 'target']
df.groupby((~(df.duplicated(cols, keep=False) & df['target'].eq(2))).cumsum()).agg({**dict.fromkeys(cols, 'first'), 'value': sum})

   source  target  value
0       1       2   1.50
1       3       5   1.51
2       1       3   0.20
3       1       3   0.50

【讨论】：