使用python计算文章中单词列表的最快方法答案

【问题标题】：Fastest way to count a list of words in an article using python使用python计算文章中单词列表的最快方法
【发布时间】：2016-11-27 01:48:35
【问题描述】：

我正在查找在一篇文章中找到了多少次所有字词袋中的字词。我对每个词的频率不感兴趣，而是对所有这些词在文章中出现的总次数感兴趣。当我从互联网上检索它们时，我必须分析数百篇文章。我的算法需要很长时间，因为每篇文章大约 800 字。

这是我所做的（其中 amount 是在一篇文章中找到单词的次数，article 包含构成文章内容的所有单词的字符串，我使用 NLTK 进行标记。）

bag_of_words = tokenize(bag_of_words)
tokenized_article = tokenize(article)

occurrences = [word for word in tokenized_article
                    if word in bag_of_words]

amount = len(occurrences)

tokenized_article 的样子：

[u'sarajevo', u'bosnia', u'herzegovi', u'war', ...]

bag_of_words也是如此。

我想知道是否有任何更有效/更快的方法来使用 NLTK 或 lambda 函数，例如。

【问题讨论】：

我不确定 NTLK 在这里如何为您提供帮助——您正在比较字符串，仅此而已。现在，有一些方法可以更有效地解决这个问题：将bag_of_words 设置为set，因为它具有恒定时间的成员资格检查（而不是列表大小的线性时间）。现在，您可以在 O(N) 的任何集合中计算 N 个单词的出现次数，这是您无法击败的（据我所知）。

标签： python text count set

【解决方案1】：

我建议您使用 set 来计算您正在计算的单词 - set 具有恒定时间成员资格测试，因此比使用列表（具有线性时间成员资格测试）更快。

例如：

occurrences = [word for word in tokenized_article
                    if word in set(bag_of_words)]

amount = len(occurrences)

一些计时测试（人为创建的列表，重复十次）：

In [4]: words = s.split(' ') * 10

In [5]: len(words)
Out[5]: 1060

In [6]: to_match = ['NTLK', 'all', 'long', 'I']

In [9]: def f():
   ...:     return len([word for word in words if word in to_match])

In [13]: timeit(f, number = 10000)
Out[13]: 1.0613768100738525

In [14]: set_match = set(to_match)

In [15]: def g():
    ...:     return len([word for word in words if word in set_match])

In [18]: timeit(g, number = 10000)
Out[18]: 0.6921310424804688

其他一些测试：

In [22]: p = re.compile('|'.join(set_match))

In [23]: p
Out[23]: re.compile(r'I|all|NTLK|long')

In [24]: p = re.compile('|'.join(set_match))

In [28]: def h():
    ...:     return len(filter(p.match, words))

In [29]: timeit(h, number = 10000)
Out[29]: 2.2606470584869385

【讨论】：

【解决方案2】：

使用集合进行成员测试。

另一种检查方法是计算每个单词的出现次数，如果该单词存在，则添加出现次数，假设文章包含一些重复单词的频率并且文章不是很短。比方说一篇文章包含10个“the”，现在我们只检查一次成员而不是10次。

from collections import Counter
def f():
    return sum(c for word, c in Counter(check).items() if word in words)

【讨论】：

【解决方案3】：

如果您不想要计数，它不再是“词袋”，而是一组词。因此，将您的文档转换为set if 确实如此。

避免 for 循环和 lambda 函数，尤其是嵌套的。这需要大量的解释器工作，而且速度很慢。相反，请尝试使用优化的调用，例如 intersection（为了提高性能，numpy 等库也非常好，因为它们在低级 C/Fortran/Cython 代码中完成工作）

即

count = len(bag_of_words_set.intersection( set(tokenized_article) ))

其中word_set 是您感兴趣的词，如set。

如果您想要经典字数统计，请使用collections.Counter：

from collections import Counter
counter = Counter()
...
counter.update(tokenized_article)

这将计算所有个单词，包括不在您列表中的单词。你可以试试这个，但结果可能会因为循环而变慢：

bag_of_words_set = set(bag_of_words)
...
for w in tokenized_article:
   if w in bag_of_words_set: # use a set, not a list!
      counter[w] += 1

使用两个Counters 有点复杂，但可能更快。一份总计，一份用于文件。

doc_counter.clear()
doc_counter.update( tokenized_article )
for w in doc_counter.keys():
  if not w in bag_of_words_set: del doc_counter[w]
counter.update(doc_counter) # untested.

如果您有许多重复的不需要的单词，则对文档使用计数器是有益的，您可以在其中保存查找。对于多线程操作也更好（更容易同步）

【讨论】：

谢谢。执行速度显然更快，但计算的字数不对。另外，我使用了 len(word_set.intersection( set(document) )) 因为没有属性像 'intersection_size' 一样糟糕。
这个版本对每个文档的每个单词只计算一次，因为这似乎是你所描述的。使用from collections import Counter，而不是设置每次出现的次数。
对不起，我认为 python 有一个优化的intersection_size，但显然它没有。然后 len 是一种解决方法，但速度较慢。但显然你不想要那个级别的集合。