Python3 中的 NLP - 计算大字符串中特定术语的出现次数答案

【问题标题】：NLP in Python3 - count up occurrences of specific terms in a large stringPython3 中的 NLP - 计算大字符串中特定术语的出现次数
【发布时间】：2019-08-23 22:08:07
【问题描述】：

我有很多文件，其中包含多页文本。在遍历每个文件时，我想提取我特别感兴趣的术语的计数。

例如，我有类似下面的内容（简化示例 - 实际是 2-5 页文本）：

to_process = 'soccer football soccer asdlkj assdasda asdsasad  football soccer'
print(to_process)

我想统计一下“soccer”和“football”在文本中出现的次数：

dict_of_counts = {'soccer':0,'football':0}
print(dict_of_counts)

预期输出为：

expected_output = {'soccer':3,'football':2}

谁能提供一些线索，告诉我如何以最有效的方式解决这个问题（我有数千篇论文和数百个我要寻找的术语）。

【问题讨论】：

标签： python-3.x pandas numpy nlp

【解决方案1】：

您可以使用字典理解（使用collections.Counter 和re.sub）：

import re
from collections import Counter

to_process = '>>SocceR... !football! soccer *asdlkj assdasda? asdsasad ; FOOtball;  soCCer'

words = ['soccer', 'football']

all_counts = Counter(re.sub(r'\W+', ' ', to_process).lower().split())

dict_of_counts = {w : all_counts[w] for w in words}

print(dict_of_counts)

输出：

{'soccer': 3, 'football': 2}

【讨论】：

感谢您的回复！只要不涉及标点符号，这种方法就可以很好地工作。
@FlyingZebra1 要处理这种情况，请使用 NLTK（我刚刚编辑了答案以展示如何使用它）。
谢谢。 NLTK 绝对是我以前听说过的一个包，但需要研究为什么它仍然没有拾取诸如足球之类的凌乱标点符号..（注意两个句点）
@FlyingZebra1 在这种情况下，您不需要任何第三方库，只需使用 Python 的 re 和 collections 模块（检查编辑）。

【解决方案2】：

为了让您的代码处理大小写和标点符号，我建议使用 flashtext 包：

to_process = 'Soccer, football soccer, asdlkj assdasda asdsasad  football; soccer.'
from flashtext import KeywordProcessor
kp = KeywordProcessor()
words_to_look_for = ['soccer', 'football']
for a in words_to_look_for:
    kp.add_keyword(a)
foundList = kp.extract_keywords(to_process)
dict_of_counts = {}
for a in foundList:
    dict_of_counts[a] = dict_of_counts.get(a, 0) +1
print(dict_of_counts)
#{'soccer': 3, 'football': 2}

【讨论】：

感谢您为我提前考虑标点符号/大写字母。我确实继续将所有文本转换为小写，但这个包似乎确实可以很好地处理标点符号。