nltk 仅处理 txt 文件中的最后一个字符串答案

【问题标题】：nltk only processing last string in txt filenltk 仅处理 txt 文件中的最后一个字符串
【发布时间】：2018-07-20 12:53:37
【问题描述】：

我有一个 .txt 文件，其中包含四个字符串，全部由换行符分隔。

当我标记文件时，它会处理每一行数据，这是完美的。

但是，当我尝试从文件中删除停用词时，它只会从最后一个字符串中删除停用词。

我想处理文件中的所有内容，而不仅仅是最后一句话。

我的代码：

 with open ('example.txt') as fin:
    for tkn in fin:
        print(word_tokenize(tkn))


#STOP WORDS
stop_words = set(stopwords.words("english"))

words = word_tokenize(tkn)

stpWordsRemoved = []

for stp in words:
    if stp not in stop_words:
        stpWordsRemoved.append(stp)

print("STOP WORDS REMOVED: " , stpWordsRemoved)

输出：

['this', 'is', 'an', 'example', 'of', 'how', 'stop', 'words', 'are', 'utilized', 'in', 'natural', 'language', 'processing', '.']
[]
['drive', 'driver', 'driving', 'driven']
[]
['smile', 'smiling', 'smiled']
[]
['there', 'are', 'multiple', 'words', 'here', 'that', 'you', 'should', 'be', 'able', 'to', 'use', 'for', 'lemmas/synonyms', '.']
STOP WORDS REMOVED:  ['multiple', 'words', 'able', 'use', 'lemmas/synonyms', '.']

如上所示，它只处理最后一行。

编辑：我的txt文件内容：

this is an example of how stop words are utilized in natural language processing.
A driver goes on a drive while being driven mad. He is sick of driving.  

smile smiling smiled 

there are multiple words here that you should be able to use for lemmas/synonyms.

【问题讨论】：

标签： python nltk tokenize stop-words

【解决方案1】：

考虑在你的 readline 循环中合并你的 remove stopwords 函数，如下所示：

import nltk
from nltk.corpus import stopwords

stop_words = set(stopwords.words("english"))
with open("d:/example.txt") as the_file:
    for each_line in the_file:
        print(nltk.word_tokenize(each_line))
        words = nltk.word_tokenize(each_line)
        stp_words_removed = []
        for word in words:
            if word not in stop_words:
                stp_words_removed.append(word)
        print("STOP WORDS REMOVED: ", stp_words_removed)

根据您的描述，您似乎只将最后一行输入停用词删除器。我不明白的是，如果是这种情况，您不应该得到所有这些空列表。

【讨论】：

这行得通。我将使用更多的 nltk 技术，例如词形还原、语音标记等。我是否必须在此循环中包含这些技术才能阅读整个内容？我想它会变得非常混乱。另外，我相信我得到的空列表输出只是换行符。
@Yunter 您也可以将其存储在其他变量中，例如将您的单词（标记）列表附加到另一个列表中以供将来使用。顺便说一句，如果您喜欢我的回答，请查看我的回答，我很乐意就这个话题与您进一步讨论。
已检查，谢谢。我肯定更喜欢将它存储在一个变量中，因为这样可能更干净。我试过这样做，但它仍然只存储最后一个字符串。
@Yunter 我会这样做：article = [] for each_line in the_file: print(nltk.word_tokenize(each_line)) words = nltk.word_tokenize(each_line) article.append(words)
感谢您的帮助，非常感谢。我现在有问题 lemmatizing 这个列表，因为它多次打印文件中的字符串（每个字符串打印的次数不同）。如果我无法调试它，我可能会问另一个问题，因为它与这个线程不同。再次感谢！

【解决方案2】：

您需要将 word_tokenize 的结果附加到列表中，然后处理该列表。在您的示例中，您仅在遍历文件后才获取文件的最后一行。

试试：

words = []
with open ('example.txt') as fin:
   for tkn in fin:
       if tkn:
           words.append(word_tokenize(tkn))

#STOP WORDS
stop_words = set(stopwords.words("english"))

stpWordsRemoved = []

for stp in words:
    if stp not in stop_words:
        stpWordsRemoved.append(stp)

print("STOP WORDS REMOVED: " , stpWordsRemoved)

【讨论】：

这会导致以下错误：if stp not in stop_words: TypeError: unhashable type: 'list'
对不起，我想我在 word_tokenize 方法周围添加了一组额外的括号。更新的解决方案可以再试一次吗？
您也可以发布您的文本内容吗？这样我就可以测试sn-p
我已经发布了。谢谢