使用熊猫按组获取计数[重复]答案

【问题标题】：Get counts by group using pandas [duplicate]使用熊猫按组获取计数[重复]
【发布时间】：2018-06-03 00:36:00
【问题描述】：

我有一个 pandas 数据框，其中包含如下所示的数据：

ID  year_month_id   Class
1   201612          A
2   201612          D
3   201612          B
4   201612          Other
5   201612          Other
6   201612          Other
7   201612          A
8   201612          Other
9   201612          A
1   201701          B

因此，一个 ID 可以在特定月份属于任何班级，下个月他的班级可能会发生变化。现在我想做的是为每个 ID 获取它在特定类别下的月数，以及它所属的最新类别。如下所示：

ID  Class_A Class_B Class_D Other Latest_Class
1   2        3       4         0    B
2   12       0       0         0    D

我如何在 python 中实现这一点。有人可以帮我吗？另外，由于真实的数据集很大，无法手动验证，我怎样才能获得超过 1 个类别的 ID 列表？

【问题讨论】：

标签： python pandas dataframe group-by pandas-groupby

【解决方案1】：

您可以获得groupby + value_counts + unstack 参加的课程数量 -

g = df.groupby('ID')
i = g.Class.value_counts().unstack(fill_value=0)

要获取最后一个类，请使用groupby + last -

j = g.Class.last()

连接得到你的结果 -

pd.concat([i, j], 1).rename(columns={'Class': 'LastClass'})

    A  B  D  Other LastClass
ID                          
1   1  1  0      0         B
2   0  0  1      0         D
3   0  1  0      0         B
4   0  0  0      1     Other
5   0  0  0      1     Other
6   0  0  0      1     Other
7   1  0  0      0         A
8   0  0  0      1     Other
9   1  0  0      0         A

要获取每行超过 1 个 ID 的列表，请使用 sum + 掩码 -

k = i.sum(axis=1)
k[k > 1]

ID
1    2
dtype: int64

【讨论】：

投反对票的人，如果答案有问题，请告诉我，以便我更正。谢谢。
@jezrael 有人将圣诞节误认为愚人节。

【解决方案2】：

当一个只旋转 2 列并将count 用作aggfunc 时，用零填充缺失的条目（正是这种情况）值得考虑使用pd.crosstab：

 >> new_df = pd.crosstab(df.ID, df.Class)
 >> new_df
Class  A  B  D  Other
ID
1      1  1  0      0
2      0  0  1      0
3      0  1  0      0
4      0  0  0      1
5      0  0  0      1
6      0  0  0      1
7      1  0  0      0
8      0  0  0      1
9      1  0  0      0

您从初始数据框中获取类的最后一个值，并按 ID 分组并选择最后一个条目：

>> df.groupby('ID').Class.last()
ID
1        B
2        D
3        B
4    Other
5    Other
6    Other
7        A
8    Other
9        A

然后你可以把它们串联起来：

>> new_df = pd.concat([new_df, df.groupby('ID').Class.last()], 1)
    A  B  D  Other  Class
ID
1   1  1  0      0      B
2   0  0  1      0      D
3   0  1  0      0      B
4   0  0  0      1  Other
5   0  0  0      1  Other
6   0  0  0      1  Other
7   1  0  0      0      A
8   0  0  0      1  Other
9   1  0  0      0      A

并完全按照您的要求获得输出：

>> new_df = new_df.rename(columns={'Class':'LastClass'})
    A  B  D  Other LastClass
ID
1   1  1  0      0         B
2   0  0  1      0         D
3   0  1  0      0         B
4   0  0  0      1     Other
5   0  0  0      1     Other
6   0  0  0      1     Other
7   1  0  0      0         A
8   0  0  0      1     Other
9   1  0  0      0         A

将所有内容放在一起作为一个线：

>> new_df = pd.concat([pd.crosstab(df.ID, df.Class),df.groupby('ID').Class.last()],1).rename(columns={'Class':'LastClass'})

>> new_df
    A  B  D  Other LastClass
ID
1   1  1  0      0         B
2   0  0  1      0         D
3   0  1  0      0         B
4   0  0  0      1     Other
5   0  0  0      1     Other
6   0  0  0      1     Other
7   1  0  0      0         A
8   0  0  0      1     Other
9   1  0  0      0         A

【讨论】：

【解决方案3】：

我们可以使用数据透视表和concat，即

ndf = df.pivot_table(index=['ID'],columns=['Class'],aggfunc='count',fill_value=0)\
    .xs('year_month_id', axis=1, drop_level=True)

ndf['latest'] = df.sort_values('ID').groupby('ID')['Class'].tail(1).values

Class  A  B  D  Other latest
ID                          
1      1  1  0      0      B
2      0  0  1      0      D
3      0  1  0      0      B
4      0  0  0      1  Other
5      0  0  0      1  Other
6      0  0  0      1  Other
7      1  0  0      0      A
8      0  0  0      1  Other
9      1  0  0      0      A

【讨论】：

在这里使用pivot 是一个不错的选择，我猜应该是最快的。
当一个只旋转 2 列并使用 count 作为 aggfunc，填充零（正是这种情况）时，值得考虑使用 pd.crosstab。
非常感谢@Dark。由于数据量很大，我无法手动检查每个 ID 的输出是否正确，我怎样才能得到一个 ID 列表，其中条目为 1 在多于 1 列中。

【解决方案4】：

您可以通过groupby 与聚合count 获得计数，通过unstack 重塑。最后用drop_duplicates添加新列：

df1 = df.groupby(['ID','Class'])['year_month_id'].count().unstack(fill_value=0)
df1['Latest_Class'] = df.drop_duplicates('ID', keep='last').set_index('ID')['Class']
print (df1)
Class  A  B  D  Other Latest_Class
ID                                
1      1  1  0      0            B
2      0  0  1      0            D
3      0  1  0      0            B
4      0  0  0      1        Other
5      0  0  0      1        Other
6      0  0  0      1        Other
7      1  0  0      0            A
8      0  0  0      1        Other
9      1  0  0      0            A

【讨论】：

投反对票的人，如果我的回答有问题，请告诉我，以便我更正。谢谢。