使用 nltk 从文本文件中提取所有名词答案

【问题标题】：Extracting all Nouns from a text file using nltk使用 nltk 从文本文件中提取所有名词
【发布时间】：2020-08-26 16:22:02
【问题描述】：

有没有更有效的方法来做到这一点？我的代码读取一个文本文件并提取所有名词。

import nltk

File = open(fileName) #open file
lines = File.read() #read all lines
sentences = nltk.sent_tokenize(lines) #tokenize sentences
nouns = [] #empty to array to hold all nouns

for sentence in sentences:
     for word,pos in nltk.pos_tag(nltk.word_tokenize(str(sentence))):
         if (pos == 'NN' or pos == 'NNP' or pos == 'NNS' or pos == 'NNPS'):
             nouns.append(word)

如何降低这段代码的时间复杂度？有没有办法避免使用嵌套的 for 循环？

提前致谢！

【问题讨论】：

将 if 条件替换为 if pos.startswith('NN'): ，也使用 set 或 collections.Counter，不要保留列表。并做一些 map/reduce 而不是列表理解。否则，试试shallow parsing，又名chunking

标签： python nltk

【解决方案1】：

如果您对NLTK 以外的选项持开放态度，请查看TextBlob。它可以轻松提取所有名词和名词短语：

>>> from textblob import TextBlob
>>> txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the inter
actions between computers and human (natural) languages."""
>>> blob = TextBlob(txt)
>>> print(blob.noun_phrases)
[u'natural language processing', 'nlp', u'computer science', u'artificial intelligence', u'computational linguistics']

【讨论】：

您说“它可以轻松提取所有名词和名词短语”，但我没有看到仅提取名词的选项。在您的示例中，我怎么能单独使用名词，例如“计算机”或“科学”？
您可以使用blob.tags 过滤掉NN 仅像[n for n,t in blob.tags if t == 'NN'] 这样的东西。
就我个人而言，我发现TextBlob 的性能不如nltk
代码可能更简单，但textblob 调用 NLTK 进行标记和标记。这不能降低 OP 代码的“时间复杂度”。

【解决方案2】：

import nltk

lines = 'lines is some string of words'
# function to test if something is a noun
is_noun = lambda pos: pos[:2] == 'NN'
# do the nlp stuff
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if is_noun(pos)] 

print nouns
>>> ['lines', 'string', 'words']

有用的提示：通常情况下，列表推导式是一种比在“for”循环中使用 .insert() 或 append() 方法将元素添加到列表中更快的方法来构建列表。

【讨论】：

答案是正确的思路。使用它更干净：is_noun = lambda pos: True if pos[:2] == 'NN'。注意：列表推导不需要比 for 循环更快。只是您不必物化列表并将嵌套循环作为生成器而不是列表来处理。
@alvas - 我没有使用... pos[:2] == 'NN'... 之类的东西，因为它可能匹配不需要的字符串。据我所知，可能有一个pos 的值为'NNA'，我们不想匹配它。严格来说，True if 和else False 部分也不是必需的，但为了清楚起见，我将它们包括在内。关于列表推导不一定比循环快的好点（我想我在那里很狡猾） - 我已经相应地编辑了帖子。
只是出于好奇，你能举一个'NNA'的例子吗？这样我们就可以在 NLTK 中对与此问题无关的其他事情进行一些检查 =) 。从技术上讲，此标签集之外不应有任何标签：ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
@alvas - 我提出的场景是假设性的，我要说明的是，我事先不知道“pos”变量可能取什么值（也许我应该有说像'NNABCDEFG'而不是'NNA'这样的东西来使这个概念更清楚），所以为了安全起见，我使用了原始问题中提出的条件参数。该条件行以及我提供的答案的任何其他部分都可以根据需要进行修改；我怀疑 'pos[:2]' 变体和我提出的长条件之间的性能差异非常微不足道。
@alvas - 好吧 - 我已经编辑了帖子以包含您的建议，以使答案更清晰。干杯 ;)

【解决方案3】：

您可以使用nltk、Textblob、SpaCy 或许多其他库中的任何一个来获得良好的结果。这些库都可以完成这项工作，但效率不同。

import nltk
from textblob import TextBlob
import spacy
nlp = spacy.load('en')
nlp1 = spacy.load('en_core_web_lg')

txt = """Natural language processing (NLP) is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages."""

在我的 windows 10 2 核、4 个处理器、8GB ram i5 hp 笔记本电脑上，在 jupyter notebook 上，我进行了一些比较，结果如下。

对于 TextBlob：

%%time
print([w for (w, pos) in TextBlob(txt).pos_tags if pos[0] == 'N'])

输出是

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 8.01 ms #average over 20 iterations

对于 nltk：

%%time
print([word for (word, pos) in nltk.pos_tag(nltk.word_tokenize(txt)) if pos[0] == 'N'])

输出是

>>> ['language', 'processing', 'NLP', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 7.09 ms #average over 20 iterations

对于spacy：

%%time
print([ent.text for ent in nlp(txt) if ent.pos_ == 'NOUN'])

输出是

>>> ['language', 'processing', 'field', 'computer', 'science', 'intelligence', 'linguistics', 'inter', 'actions', 'computers', 'languages']
    Wall time: 30.19 ms #average over 20 iterations

似乎nltk 和TextBlob 相当快，这是意料之中的，因为没有存储关于输入文本txt 的任何其他内容。 Spacy的速度要慢得多。还有一件事。 SpaCy 错过了名词 NLP 而 nltk 和 TextBlob 得到了它。我会为nltk 或TextBlob 拍摄，除非我想从输入txt 中提取其他内容。

查看spacy here 的快速入门。
查看有关 TextBlob here 的一些基础知识。
查看 nltk HowTos here

【讨论】：

SpaCy 错过了 NLP，因为它发现它是一个专有名词 (PNOUN)。 SpaCy 我们的速度较慢，因为它具有更多功能，但您可以禁用句法解析器并加快速度。

【解决方案4】：

import nltk
lines = 'lines is some string of words'
tokenized = nltk.word_tokenize(lines)
nouns = [word for (word, pos) in nltk.pos_tag(tokenized) if(pos[:2] == 'NN')]
print (nouns)

只是简单了一点。

【讨论】：

【解决方案5】：

我不是 NLP 专家，但我认为您已经很接近了，而且在这些外部循环中，可能没有比二次时间复杂度更好的方法了。

最近版本的 NLTK 有一个内置函数，可以手动执行您正在执行的操作，nltk.tag.pos_tag_sents，它还返回一个标记词列表的列表。

【讨论】：

【解决方案6】：

您的代码没有冗余：您读取文件一次并访问每个句子和每个标记的单词，恰好一次。无论您如何编写代码（例如，使用推导式），您都只会隐藏嵌套循环，不会跳过任何处理。

唯一的改进潜力在于它的空间复杂性：您可以逐步读取它，而不是一次读取整个文件。但是由于你需要一次处理一个完整的句子，它并不像一次读取和处理一行那么简单；所以我不会打扰，除非你的文件是整个千兆字节长；对于短文件，它不会有任何区别。

简而言之，您的循环很好。您的代码中有一两件事可以清理（例如，与 POS 标签匹配的 if 子句），但它不会改变任何效率方面的事情。

【讨论】：