【问题标题】:Pandas expand value counts after groupby as columnsPandas 在 groupby 之后将值计数扩展为列
【发布时间】:2021-12-29 16:48:18
【问题描述】:

作为特征工程的一部分,我想在 groupby 之后使用列的计数作为模型的特征,这是我尝试过的

>>> import pandas as pd
>>> from collections import Counter
>>> df = pd.DataFrame({'col1':['a','b','a','c','a','b'],'col2':['val1','val2','val2','val1','val2','val2'],'col3':['val3','val4','val3','val4','val3','val4']})
>>> df
   col1  col2  col3
0    a  val1  val3
1    b  val2  val4
2    a  val2  val3
3    c  val1  val4
4    a  val2  val3
5    b  val2  val4
>>> test = df.groupby('col1').agg(list)
                    col2                col3
col1
a     [val1, val2, val2]  [val3, val3, val3]
b           [val2, val2]        [val4, val4]
c                 [val1]              [val4]
>>> test['col2'] = test['col2'].apply(lambda x: Counter(x))
>>> test['col3'] = test['col3'].apply(lambda x: Counter(x))
>>> test
                        col2         col3
col1
a     {'val1': 1, 'val2': 2}  {'val3': 3}
b                {'val2': 2}  {'val4': 2}
c                {'val1': 1}  {'val4': 1}

稍后我可以将字典扩展为单独的列,因此最终输出为:

>>> final = pd.concat([test.drop(['col2'], axis=1), test['col2'].apply(pd.Series)], axis=1)
>>> final = pd.concat([final.drop(['col3'], axis=1), final['col3'].apply(pd.Series)], axis=1)
   val1 val2 val3 val4
a  1.0  2.0  3.0  NaN
b  NaN  2.0  NaN  2.0
c  1.0  NaN  NaN  1.0

我觉得有一个更简单的解决方案,感谢任何帮助。

【问题讨论】:

    标签: python-3.x pandas dataframe pandas-groupby


    【解决方案1】:

    是的,melt+crosstab

    df2 = df.melt(id_vars='col1', value_name='count')
    pd.crosstab(df2['col1'], df2['count'])
    

    输出:

    count  val1  val2  val3  val4
    col1                         
    a         1     2     3     0
    b         0     2     0     2
    c         1     0     0     1
    

    如果你想要NaN:

    df3 = pd.crosstab(df2['col1'], df2['count'])
    df3.mask(df3.eq(0))
    

    输出:

    count  val1  val2  val3  val4
    col1                         
    a       1.0   2.0   3.0   NaN
    b       NaN   2.0   NaN   2.0
    c       1.0   NaN   NaN   1.0
    

    【讨论】:

      【解决方案2】:
      df = pd.concat([df[['col1','col2']], df[['col1','col3']].rename(columns={"col3": "col2"})])
      df = df.pivot_table(index = 'col1', columns = 'col2',aggfunc=len)
      print(df)
      

      输出:

      col2  val1  val2  val3  val4
      col1
      a      1.0   2.0   3.0   NaN
      b      NaN   2.0   NaN   2.0
      c      1.0   NaN   NaN   1.0
      

      【讨论】:

      • 这与melt + crosstab 相同,我认为这会比这个解决方案更快
      【解决方案3】:

      另一个结合了melt、groupby和unstack的选项:

      (df.melt('col1')
         .groupby(['col1', 'value'])
         .size()
         .unstack()
         .rename_axis(index=None, columns=None)
      )
             val1  val2  val3  val4
      a       1.0   2.0   3.0   NaN
      b       NaN   2.0   NaN   2.0
      c       1.0   NaN   NaN   1.0
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2017-12-23
        • 2017-06-22
        • 2020-05-18
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多