【问题标题】:Removing Stop words from NLTK从 NLTK 中删除停用词
【发布时间】:2013-05-12 21:35:38
【问题描述】:

我正在尝试读取一个文本文件 (foo1.txt),删除所有 nltk 定义的停用词并写入另一个文件 (foo2.txt)。代码如下: 需要导入:from nltk.corpus import stopwords

def stop_words_removal(): 
    with open("foo1.txt") as f:
            reading_file_line = f.readlines() #entire content, return  list 
            #print reading_file_line #list
            reading_file_info = [item.rstrip('\n') for item in reading_file_line]
            #print reading_file_info #List and strip \n
            #print ' '.join(reading_file_info)
            '''-----------------------------------------'''
            #Filtering & converting to lower letter
            for i in reading_file_info:
                words_filtered = [e.lower() for e in i.split() if len(e) >= 4]                
                print words_filtered

            '''-----------------------------------------'''
            '''removing the strop words from the file'''
            word_list = words_filtered[:] 
            #print word_list
            for word in words_filtered:
                        if word in nltk.corpus.stopwords.words('english'): 
                            print word
                            print word_list.remove(word)

            '''-----------------------------------------'''
            '''write the output in a file'''
            z = ' '.join(words_filtered)
            out_file = open("foo2.txt", "w")
            out_file.write(z)
            out_file.close()  

问题是代码的第二部分“从文件中删除 strop 词”不起作用。任何建议将不胜感激。谢谢。

Example Input File: 
'I a Love this car there', 'positive',
'This a view is amazing there', 'positive',
'He is my best friend there', 'negative'

Example Output:
['love', "car',", "'positive',"]
['view', "amazing',", "'positive',"]
['best', "friend',", "'negative'"]

我按照link 中的建议进行了尝试,但它们都不起作用

【问题讨论】:

  • 你确定这是你想要的输出吗?你需要标点符号吗?
  • @elyase 感谢您的回复。实际上我不需要方括号,但我需要清楚地分隔每一行。您发布的以下代码仅适用于文件的最后一行。我想删除文本每一行中的停用词。
  • 好的,我编辑了我的答案
  • @elyase,谢谢伙计。您编写的以下代码就像一个魅力。正如你提到的,我刚刚导入了未来和字符串,因为我使用的是 python 2.7。再次感谢:)

标签: python nltk stop-words


【解决方案1】:

这就是我要做的,在你的函数中:

with open('input.txt','r') as inFile, open('output.txt','w') as outFile:
    for line in inFile:
        print(''.join([word for word in line.lower().translate(None, string.punctuation).split() 
              if len(word) >=4 and word not in stopwords.words('english')]), file=outFile)

别忘了补充:

from __future__ import print_function                   

如果您使用的是 Python 2.x。

【讨论】:

    猜你喜欢
    • 2015-01-20
    • 1970-01-01
    • 1970-01-01
    • 2013-10-08
    • 1970-01-01
    • 2014-04-29
    • 2019-01-21
    • 2018-12-06
    • 2021-02-02
    相关资源
    最近更新 更多