fdist 和前 10 个虚词

【问题标题】：Fdist and top 10 function wordsfdist 和前 10 个虚词
【发布时间】：2013-01-07 09:37:49
【问题描述】：

我必须编写一个脚本，按频率降序为我提供所有内容词。我需要 10 个最常见的实词，因此我不仅需要列出我的语料库中 10 个最常见的词，还需要过滤掉任何实词（和，或，任何标点符号……）。到目前为止我所拥有的是以下

fileids=corpus.fileids ()
text=corpus.words(fileids)
wlist=[]
ftable=nltk.FreqDist (text)
wlist.append(ftable.keys () )

这给了我一个按频率降序排列的非常简洁的所有单词列表，但是我如何过滤掉功能词呢？

谢谢。

【问题讨论】：

标签： python nltk stop-words

【解决方案1】：

您想过滤掉一组单词（停用词）。拍下core idea from this SO answer：

您需要在代码中引入几行代码：紧接着

fileids=corpus.fileids ()
text=corpus.words(fileids)

添加以下行：创建停用词列表并从文本中过滤掉它们

#get a list of the stopwords
stp = nltk.corpus.stopwords.words('english')

#from your text of words, keep only the ones NOT in stp
filtered_text = [w for w in text if not w in stp]

现在，继续往下看

wlist=[]
ftable=nltk.FreqDist (filtered_text)
wlist.append(ftable.keys () )

希望对您有所帮助。

【讨论】：

为什么，我不知道 NLTK 有一个内置的停用词列表，谢谢一百万
是的，NLTK 是一个很棒的资源，我总是在其中发现新的宝藏。