【问题标题】:Most Frequent Words from Sentences grouped by category按类别分组的句子中最常见的单词
【发布时间】:2018-10-03 23:56:31
【问题描述】:

我正在尝试按类别对 10 个最常用的词进行分组。我已经看过this 的回答,但我不能完全修改它以获得我想要的输出。

category | sentence
  A           cat runs over big dog
  A           dog runs over big cat
  B           random sentences include words
  C           including this one

所需的输出:

category | word/frequency
   A           runs, 2
               cat: 2
               dog: 2
               over: 2
               big: 2
   B           random: 1
   C           including: 1

由于我的数据框非常大,我只想获得前 10 个最常出现的单词。我也看过这个answer

df.groupby('subreddit').agg(lambda x: nltk.FreqDist([w for wordlist in x for w in wordlist]))

但此方法也返回字母计数。

【问题讨论】:

    标签: python pandas


    【解决方案1】:

    如果您想按出现频率最高的词进行过滤,下面的行将执行(在这种情况下,每个类别有 2 个最频繁出现的词):

    from collections import Counter
    
    df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
    
    category
    A            [(cat, 2), (runs, 2)]
    B    [(random, 1), (sentences, 1)]
    C      [(including, 1), (this, 1)]
    Name: sentence, dtype: object
    

    性能方面:

    %timeit df.groupby("category")["sentence"].apply(lambda x: Counter(" ".join(x).split()).most_common(2))
    2.07 ms ± 87.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    
    %timeit df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
    4.96 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
    

    【讨论】:

    • 谢谢!我会多玩一些以获得我想要的格式,但这是一个很好的开始
    【解决方案2】:

    您可以在对句子进行标记后加入行并应用 FreqDist

    df.groupby('category')['sentence'].apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
    

    输出:

    category           
    a         big          2.0
              cat          2.0
              dog          2.0
              over         2.0
              runs         2.0
    c         include      1.0
              random       1.0
              sentences    1.0
              words        1.0
    d         including    1.0
              one          1.0
              this         1.0
    Name: sentence, dtype: float64
    

    【讨论】:

      【解决方案3】:
      # Split the sentence into Series    
      df1 = pd.DataFrame(df.sentence.str.split(' ').tolist())
      
      # Add category with as not been adding with the split
      df1['category']  = df['category']
      
      # Melt the Series corresponding to the splited sentence
      df1 = pd.melt(df1, id_vars='category', value_vars=df1.columns[:-1].tolist())
      
      # Groupby and count (reset_index will create a column nammed 0)
      df1 = df1.groupby(['category', 'value']).size().reset_index()
      
      # Keep the 10 largests numbers 
      df1 = df1.nlargest(10, 0)
      

      【讨论】:

      • 这很接近 - 对于我的最终输出,我试图实现更像 df1.groupby('category')[0].value_counts()
      猜你喜欢
      • 1970-01-01
      • 2013-03-10
      • 2020-12-29
      • 2018-01-28
      • 1970-01-01
      • 2017-03-02
      • 1970-01-01
      • 2018-04-01
      • 2020-01-19
      相关资源
      最近更新 更多