根据不同的列操作数据框答案

【问题标题】：Manipulating data frames based on different columns根据不同的列操作数据框
【发布时间】：2017-06-28 01:57:38
【问题描述】：

我有一个数据框 df，其中有两列分别称为 Rule_ID 和 Location。它有类似的数据 -

Rule_ID                         Location
[u'2c78g',u'df567',u'5ty78']    US
[u'2c78g',u'd67gh',u'df890o']   India
[u'd67gh',u'df890o',u'5ty78']   Japan
[u'2c78g',u'5ty78',u'df890o']   US

我想要两个结果。每个位置的唯一规则 ID 计数。这里应该看起来像 -

Location    Count_of_unique_rule_ids
US          4
India       3
Japan       3

其次，我想按位置计算 rule_ids。这里看起来像 -

Rule_ID    Count   Location
u'2c78g'   2       US
u'df567'   1       US 
u'5ty78'   2       US

等等！

这是对这里问题的扩展 - Manipulating data frames

【问题讨论】：

@piRSquared 很想看看你的方法。
Psidom 的回答是我会做的。

标签： python pandas dataframe

【解决方案1】：

这是一种方法

使用apply

In [235]: df.groupby('Location')['Rule_ID'].apply(lambda x: len(set(x.sum())))
Out[235]:
Location
India    3
Japan    3
US       4
Name: Rule_ID, dtype: int64

In [236]: (df.groupby('Location')
             .apply(lambda x: pd.Series(x['Rule_ID'].sum()))
             .reset_index()
             .groupby(['Location', 0]).size())
Out[236]:
Location  0
India     2c78g     1
          d67gh     1
          df890o    1
Japan     5ty78     1
          d67gh     1
          df890o    1
US        2c78g     2
          5ty78     2
          df567     1
          df890o    1
dtype: int64

详情

x.sum() on list 加入他们，你可以通过计算列表的集合来获得唯一计数。

In [237]: df.groupby('Location')['Rule_ID'].apply(lambda x: x.sum())
Out[237]:
Location
India                         [2c78g, d67gh, df890o]
Japan                         [d67gh, df890o, 5ty78]
US       [2c78g, df567, 5ty78, 2c78g, 5ty78, df890o]
Name: Rule_ID, dtype: object

在列表上应用pd.Series 将创建新行，然后在位置和测量上应用groupby。

In [240]: df.groupby('Location').apply(lambda x: pd.Series(x['Rule_ID'].sum()))
Out[240]:
Location
India     0     2c78g
          1     d67gh
          2    df890o
Japan     0     d67gh
          1    df890o
          2     5ty78
US        0     2c78g
          1     df567
          2     5ty78
          3     2c78g
          4     5ty78
          5    df890o
dtype: object

【讨论】：

这太棒了。谢谢！

【解决方案2】：

您需要将数据框转换为长格式（未嵌套列 Rule_ID），之后可以直接总结：

df_long = pd.DataFrame({
        "Rule_ID": [e for s in df.Rule_ID for e in s],
        "Location": df.Location.repeat(df.Rule_ID.str.len())
    })

df_long.groupby('Location').Rule_ID.nunique()

#Location
#India    3
#Japan    3
#US       4
#Name: Rule_ID, dtype: int64

df_long.groupby(['Rule_ID', 'Location']).size()

#Rule_ID    Location
#u'2c78g'   India       1
#           US          2
#u'5ty78'   Japan       1
#           US          2
#u'd67gh'   India       1
#           Japan       1
#u'df567'   US          1
#u'df890o'  India       1
#           Japan       1
#           US          1
#dtype: int64

【讨论】：