【问题标题】:How do I get the number of occurrences of a list of words (substrings) in a pandas dataframe?如何获取熊猫数据框中单词列表(子字符串)的出现次数?
【发布时间】:2018-05-05 09:27:49
【问题描述】:

我有一个大约 150 万行的 pandas 数据框。我想在某个列中找到特定的选定单词(都是已知的)的出现次数。这适用于一个单词。

d = df["Content"].str.contains("word").value_counts()

但我想从列表中找出多个已知单词的出现次数,例如“word1”、“word2”。 word2 也可以是 word2 或 wordtwo,如下所示:

word1           40
word2/wordtwo   120

我该如何做到这一点?

【问题讨论】:

  • 你有单词列表吗?
  • 是的,我有整个列表

标签: python pandas dataframe find-occurrences


【解决方案1】:

IMO 最有效的方法之一是使用 sklearn.feature_extraction.text.CountVectorizer 向其传递一个词汇表(您想要计算的单词列表)。

演示:

In [21]: text = """
    ...: I have a pandas data frame with approximately 1.5 million rows. I want to find the number of occurrences of specific, selected words in a certain colu
    ...: mn. This works for a single word. But I want to find out the occurrences of multiple, known words like "word1", "word2" from a list. Also word2 could
    ...: be word2 or wordtwo, like so"""

In [22]: df = pd.DataFrame(text.split('. '), columns=['Content'])

In [23]: df
Out[23]:
                                             Content
0  \nI have a pandas data frame with approximatel...
1  I want to find the number of occurrences of sp...
2                       This works for a single word
3  But I want to find out the occurrences of mult...
4      Also word2 could be word2 or wordtwo, like so

In [24]: from sklearn.feature_extraction.text import CountVectorizer

In [25]: vocab = ['word', 'words', 'word1', 'word2', 'wordtwo']

In [26]: vect = CountVectorizer(vocabulary=vocab)

In [27]: res = pd.Series(np.ravel((vect.fit_transform(df['Content']).sum(axis=0))),
                         index=vect.get_feature_names())

In [28]: res
Out[28]:
word       1
words      2
word1      1
word2      3
wordtwo    1
dtype: int64

【讨论】:

  • 很好地使用CountVectorizer - 没想到!
  • 接受这个作为答案,即使另一个更简单.. 很好的方法,因为我正在处理数百万行,所以这会派上用场。谢谢。
  • @rayanisran,我稍微修正了我的解决方案,删除(.A - 将稀疏矩阵转换为密度矩阵),所以它现在应该更节省内存...... :-)
【解决方案2】:

你可以像这样创建一个字典:

{w: df["Content"].str.contains(w).sum() for w in words}

假设words 是单词列表。

【讨论】:

  • 没想到用字典。谢谢。
猜你喜欢
  • 2018-09-22
  • 2023-03-16
  • 2018-10-25
  • 2017-10-12
  • 1970-01-01
  • 2013-07-21
  • 2018-04-26
  • 2018-04-04
  • 1970-01-01
相关资源
最近更新 更多