Pandas groupby 多个字段然后区分答案

【问题标题】：Pandas groupby multiple fields then diffPandas groupby 多个字段然后区分
【发布时间】：2018-06-29 01:44:03
【问题描述】：

所以我的数据框看起来像这样：

         date    site country  score
0  2018-01-01  google      us    100
1  2018-01-01  google      ch     50
2  2018-01-02  google      us     70
3  2018-01-03  google      us     60
4  2018-01-02  google      ch     10
5  2018-01-01      fb      us     50
6  2018-01-02      fb      us     55
7  2018-01-03      fb      us    100
8  2018-01-01      fb      es    100
9  2018-01-02      fb      gb    100

每个site 都有不同的分数，具体取决于country。我正在尝试为每个 site/country 组合找到 scores 的 1/3/5 天差异。

输出应该是：

          date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

我首先尝试按site/country/date 进行排序，然后按site 和country 进行分组，但我无法集中精力从分组对象中获得差异。

【问题讨论】：

如何为 python3 获取 StringIO？我正在尝试重现您的问题
@JulianRachman 使用 io
好的，等一下，我正在尝试重现您的问题
@Alex @ayhan 我已经编辑了预期的输出。本质上，es 和 gb 会出现在 us 之前。
@Craig 你可以添加df.sort_values(by=['site', 'country', 'date'], ascending=[False, True, True]) 和@ayhan 的答案

标签： python pandas dataframe group-by

【解决方案1】：

首先，对 DataFrame 进行排序，然后你只需要groupby.diff()：

df = df.sort_values(by=['site', 'country', 'date'])

df['diff'] = df.groupby(['site', 'country'])['score'].diff().fillna(0)

df
Out: 
         date    site country  score  diff
8  2018-01-01      fb      es    100   0.0
9  2018-01-02      fb      gb    100   0.0
5  2018-01-01      fb      us     50   0.0
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   0.0
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   0.0
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

sort_values 不支持任意排序。如果您需要任意排序（例如，谷歌在 fb 之前），您需要将它们存储在一个集合中并将您的列设置为分类。然后 sort_values 将尊重您在此处提供的排序。

【讨论】：

无论出于何种原因，上面的行一直抛出像TypeError: diff() got an unexpected keyword argument 'axis' 这样的错误。然而这有效：df.groupby(['site', 'country'])['score'].transform(pd.Series.diff).fillna(0).
@JohanDettmar 出现该异常的原因是因为您在一个只有一列而不是 DataFrame 的系列上调用 diff()。系列diff() 没有axis 参数，因为只有一个轴。
为什么我不按日期分组？ diff 能识别日期吗？我没有找到任何关于自动检测日期pandas.pydata.org/pandas-docs/stable/reference/api/…
@Auss 因为我们试图找出这些值在不同日期之间的差异。如果我们还按日期分组，每个组都会有一个观察结果。相反，我们需要在每组中进行多次观察（针对不同的日期），以便我们可以找到这些日期值之间的差异。

【解决方案2】：

您可以移动和减去分组值：

df.sort_values(['site', 'country', 'date'], inplace=True)

df['diff'] = df['score'] - df.groupby(['site', 'country'])['score'].shift()

结果：

         date    site country  score  diff
8  2018-01-01      fb      es    100   NaN
9  2018-01-02      fb      gb    100   NaN
5  2018-01-01      fb      us     50   NaN
6  2018-01-02      fb      us     55   5.0
7  2018-01-03      fb      us    100  45.0
1  2018-01-01  google      ch     50   NaN
4  2018-01-02  google      ch     10 -40.0
0  2018-01-01  google      us    100   NaN
2  2018-01-02  google      us     70 -30.0
3  2018-01-03  google      us     60 -10.0

要使用0 填充NaN，请使用df['diff'].fillna(0, inplace=True)。

【讨论】：