带有 Groupby 的数据框列中标记词的 Python Pandas NLTK 频率分布答案

【问题标题】：Python Pandas NLTK Frequency Distribution for Tokenized Words in Dataframe Column with a Groupby带有 Groupby 的数据框列中标记词的 Python Pandas NLTK 频率分布
【发布时间】：2018-11-28 21:43:04
【问题描述】：

我有以下示例数据框：

No  category    problem_definition
175 2521       ['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420']
211 1438       ['galley', 'work', 'table', 'stuck']
912 2698       ['cloth', 'stuck']
572 2521       ['stuck', 'coffee']

problem_definition 字段已被标记化，已删除停用词。

我想创建一个输出另一个 Pandas 数据帧的频率分布：

1) 在problem_definition中每个词出现的频率 2）problem_definition中每个词的出现频率按类别字段

示例 1) 所需的输出如下：

text       count
coffee     2
maker      1
brewing    1
properly   1
2          1
420        3
stuck      3
galley     1
work       1
table      1
cloth      1

案例 2) 的以下示例所需输出：

category    text       count
2521        coffee     2
2521        maker      1
2521        brewing    1
2521        properly   1
2521        2          1
2521        420        3
2521        stuck      1
1438        galley     1
1438        work       1
1438        table      1
1438        stuck      1
2698        cloth      1
2698        stuck      1

我尝试了以下代码来完成1）：

from nltk.probability import FreqDist
import pandas as pd

fdist = FreqDist(df['problem_definition_stopwords'])

TypeError: unhashable type: 'list'

我不知道该怎么做 2)

【问题讨论】：

您期望的counts 是否按category 分组？
是的，按类别分组的不同单词的计数

标签： python pandas nltk counter word

【解决方案1】：

使用unnesting，我一步一步介绍了解决此类问题的几种方法，为了好玩，我在这里链接question

unnesting(df,['problem_definition'])
Out[288]: 
  problem_definition   No  category
0             coffee  175      2521
0              maker  175      2521
0            brewing  175      2521
0           properly  175      2521
0                  2  175      2521
0                420  175      2521
0                420  175      2521
0                420  175      2521
1             galley  211      1438
1               work  211      1438
1              table  211      1438
1              stuck  211      1438
2              cloth  912      2698
2              stuck  912      2698
3              stuck  572      2521
3             coffee  572      2521

那么就为案例 2 做常规的groupby + size

unnesting(df,['problem_definition']).groupby(['category','problem_definition']).size()
Out[290]: 
category  problem_definition
1438      galley                1
          stuck                 1
          table                 1
          work                  1
2521      2                     1
          420                   3
          brewing               1
          coffee                2
          maker                 1
          properly              1
          stuck                 1
2698      cloth                 1
          stuck                 1
dtype: int64

关于案例1value_counts

unnesting(df,['problem_definition'])['problem_definition'].value_counts()
Out[291]: 
stuck       3
420         3
coffee      2
table       1
maker       1
2           1
brewing     1
galley      1
work        1
cloth       1
properly    1
Name: problem_definition, dtype: int64

自己定义函数

def unnesting(df, explode):
    idx=df.index.repeat(df[explode[0]].str.len())
    df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
    df1.index=idx
    return df1.join(df.drop(explode,1),how='left')

【讨论】：

谢谢，'unnesting' 没有定义...我需要导入包才能使用 unnesting 吗？
@PineNuts0 转到我的链接页面并在最后运行我自己的 def 函数 stackoverflow.com/a/53218939/7964527 函数
现在试试
@PineNuts0 见上文。 ...见另一页底部接受答案
所以案例 1 我得到了错误：ValueError: zero-dimensional arrays cannot be concatenated 但案例 2 有效！

【解决方案2】：

您也可以按类别展开列表，然后执行groupby 和size。

import pandas as pd
import numpy as np

df = pd.DataFrame( {'No':[175,572],
                    'category':[2521,2521],
                    'problem_definition': [['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
                                          ['stuck', 'coffee']]} )

c = df.groupby('category')['problem_definition'].agg('sum').reset_index()

lst_col = 'problem_definition'

c = pd.DataFrame({
      col:np.repeat(c[col].values, c[lst_col].str.len())
      for col in c.columns.drop(lst_col)}
    ).assign(**{lst_col:np.concatenate(c[lst_col].values)})[c.columns]

c.groupby(['category','problem_definition']).size()
>>
category  problem_definition
2521      2                     1
          420                   3
          brewing               1
          coffee                2
          maker                 1
          properly              1
          stuck                 1
dtype: int64

或者您也可以使用计数器来帮助您存储按category 分组的计数值：

import pandas as pd
import numpy as np
from collections import Counter

df = pd.DataFrame( {'No':[175,572],
                    'category':[2521,2521],
                    'problem_definition': [['coffee', 'maker', 'brewing', 'properly', '2', '420', '420', '420'],
                                          ['stuck', 'coffee']]} )

c = df.groupby('category')['problem_definition'].agg('sum').reset_index()
c['problem_definition'] = c['problem_definition'].apply(lambda x: Counter(x).items())

lst_col = 'problem_definition'

s = pd.DataFrame({
      col:np.repeat(c[col].values, c[lst_col].str.len())
      for col in c.columns.drop(lst_col)}
    ).assign(**{'text':np.concatenate(c[lst_col].apply(lambda x: [k for (k,v) in x]))}
    ).assign(**{'count':np.concatenate(c[lst_col].apply(lambda x: [v for (k,v) in x]))} )

s

【讨论】：