Groupby并计算唯一值的数量（Pandas）答案

【问题标题】：Groupby and count the number of unique values (Pandas)Groupby并计算唯一值的数量（Pandas）
【发布时间】：2017-08-03 21:55:24
【问题描述】：

我有一个包含 2 个变量的数据框：ID 和 outcome。我首先尝试groupbyID，然后计算outcome 在该ID 中的唯一值的数量。

df
ID    outcome
1      yes
1      yes
1      yes
2      no
2      yes
2      no

预期输出：

ID    yes    no
1      3     0
2      1     2

我的代码df[['PID', 'outcome']].groupby('PID')['outcome'].nunique() 给出了唯一值本身的编号，这样：

ID
1   2
2   2

但是我需要yes 和no 的计数，我该如何实现呢？谢谢！

【问题讨论】：

标签： python pandas dataframe count unique

【解决方案1】：

pd.crosstab怎么样？

In [1217]: pd.crosstab(df.ID, df.outcome)
Out[1217]: 
outcome  no  yes
ID              
1         0    3
2         2    1

【讨论】：

我必须找到一个新的选项2
@piRSquared 其他人发现这是可能的只是时间问题：p
怎么样？这太棒了！
@Kay 干杯。 :)

【解决方案2】：

选项 2
pd.factorize + np.bincount
这令人费解且痛苦……但速度非常快。

fi, ui = pd.factorize(df.ID.values)
fo, uo = pd.factorize(df.outcome.values)

n, m = ui.size, uo.size
pd.DataFrame(
    np.bincount(fi * m + fo, minlength=n * m).reshape(n, m),
    pd.Index(ui, name='ID'), pd.Index(uo, name='outcome')
)

outcome  yes  no
ID              
1          3   0
2          1   2

选项 C

pd.get_dummies(d.ID).T.dot(pd.get_dummies(d.outcome))

   no  yes
1   0    3
2   2    1

选项 IV。

df.groupby(['ID', 'outcome']).size().unstack(fill_value=0)

【讨论】：

意思是“可怕”哈哈哈
@Kay 数据帧转置

【解决方案3】：

在ID 列上分组，然后在outcome 列上使用value_counts 聚合。这将产生一个系列，因此您需要使用 .to_frame() 将其转换回数据帧，以便您可以取消堆叠是/否（即将它们作为列）。然后用零填充空值。

df_total = df.groupby('ID')['outcome'].value_counts().to_frame().unstack(fill_value=0)
df_total.columns = df_total.columns.droplevel()
>>> df_total
outcome  no  yes
ID              
1         0    3
2         2    1

【讨论】：

@piRSquared 是的。谢谢。

【解决方案4】：

使用set_index 和pd.concat

df1 = df.set_index('ID')
pd.concat([df1.outcome.eq('yes').sum(level=0),
          df1.outcome.ne('yes').sum(level=0)], keys=['yes','no'],axis=1).reset_index()

输出：

   ID  yes   no
0   1  3.0  0.0
1   2  1.0  2.0

【讨论】：

【解决方案5】：

最有效的设置，将防止任何过去、现在和未来的错误并利用 FAST 矢量化函数是执行（非常简单）以下事情：

df['dummy_yes'] = df.outcome == 'yes'
df['dummy_no'] = df.outcome == 'no'

df.groupby('ID').sum()

【讨论】：

为什么您认为其他解决方案会导致过去、现在或未来的错误？
这适用于任何语言。过去我在尝试做更多恐慌的事情时遇到了一些严重而微妙的错误
@coldspeed 这里是stackoverflow.com/questions/36337012/…