【问题标题】:Remove those lines from text file if line contains just any of from the stopwords如果行仅包含停用词中的任何一个,则从文本文件中删除这些行
【发布时间】:2019-07-27 18:22:25
【问题描述】:

我想从Myfile.txt 文件中删除这些行,如果该行仅包含并且仅包含停用词中的任何一个

例如Myfile.txt文件的样本是

Adh Dhayd
Abu Dhabi is      # here is "is" stopword but this line should not be removed because line contain #Abu Dhabi is
Zaranj
of                # this line contains just stop word, this line should be removed
on                # this line contains just stop word, this line should be removed
Taloqan
Shnan of          # here is "of" stopword but this line should not be removed because line contain #Shnan of
is                # this line contains just stop word, this line should be removed
Shibirghn
Shahrak
from              # this line contains just stop word, this line should be removed

我以这段代码为例

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize



example_sent = "This is a sample sentence, showing off the stop words filtration."

stop_words = set(stopwords.words('english'))

word_tokens = word_tokenize(example_sent)

filtered_sentence = [w for w in word_tokens if not w in stop_words]

filtered_sentence = []

for w in word_tokens:
    if w not in stop_words:
        filtered_sentence.append(w)

print(word_tokens)
print(filtered_sentence)

那么根据上面提到的Myfile.txt 的解决方案代码是什么。

【问题讨论】:

    标签: python python-3.x text nltk stop-words


    【解决方案1】:

    您可以查看该行是否匹配任何停用词,如果不将其附加到过滤的内容。也就是说,如果您要过滤所有仅包含一个stop_word 的行。如果包含多个停用词的行也应被过滤,请尝试标记该行,并与 stop_words 建立交集:

    f = open("test.txt","r+")
    filtered_content = []
    stop_words = set(stopwords.words('english'))
    for line in f.read().splitlines():
        if not line in stop_words:
            filtered_content.append(line)
    g = open("test_filter.txt","a+")
    g.write("\n".join(filtered_content))
    g.close()
    f.close()
    

    如果您想删除多个停用词,请使用此 if 语句。这将删除仅包含停用词的行。如果一个词不是停用词,则保留该行:

    if not len(set(word_tokenize(line)).intersection(stop_words)) == len(word_tokenize(line)):
    

    【讨论】:

    • 你能帮我如何在不区分大小写的情况下删除,它不应该区分大小写。
    • 您可以使用line.lower()。但是,请尝试搜索您的问题,因为这个问题(在您的评论中)在许多教程中得到了多次回答:) (stackoverflow.com/questions/6797984/…)
    猜你喜欢
    • 2022-01-17
    • 1970-01-01
    • 2014-08-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-12-18
    • 2012-06-24
    相关资源
    最近更新 更多