法语文本中最常见的词答案

【问题标题】：most frequent words in a french text法语文本中最常见的词
【发布时间】：2015-02-01 11:18:21
【问题描述】：

我正在使用 python nltk 包来查找法语文本中最常用的单词。我发现它真的不起作用...... 这是我的代码：

#-*- coding: utf-8 -*-

#nltk: package for text analysis
from nltk.probability import FreqDist
from nltk.corpus import stopwords
import nltk
import tokenize
import codecs
import unicodedata


#output French accents correctly
def convert_accents(text):
    return unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')



### MAIN ###

#openfile
text_temp=codecs.open('text.txt','r','utf-8').readlines()

#put content in a list
text=[]
for word in text_temp:
    word=word.strip().lower()
    if word!="":
        text.append(convert_accents(word))

#tokenize the list
text=nltk.tokenize.word_tokenize(str(text))

#use FreqDist to get the most frequents words
fdist = FreqDist()
for word in  text:
    fdist.inc( word )
print "BEFORE removing meaningless words"
print fdist.items()[:10]

#use stopwords to remove articles and other meaningless words
for sw in stopwords.words("french"):
     if fdist.has_key(sw):
          fdist.pop(sw)
print "AFTER removing meaningless words"
print fdist.items()[:10]

这是输出：

BEFORE removing meaningless words
[(',', 85), ('"', 64), ('de', 59), ('la', 47), ('a', 45), ('et', 40), ('qui', 39), ('que', 33), ('les', 30), ('je', 24)]
AFTER removing meaningless words
[(',', 85), ('"', 64), ('a', 45), ('les', 30), ('parce', 15), ('veut', 14), ('exigence', 12), ('aussi', 11), ('pense', 11), ('france', 10)]

我的问题是stopwords 没有丢弃所有无意义的词。例如，“,”不是单词，应该删除，“les”是文章，应该删除。

如何解决问题？

我使用的文本可以在这个页面找到： http://www.elysee.fr/la-presidence/discours-d-investiture-de-nicolas-sarkozy/

【问题讨论】：

如果nltk 提供的stopwords 不适合您，那么您应该自己列出要删除的停用词，或者寻找其他库。至于逗号,，您可以在全文中尝试newstr = oldstr.replace(",", "")，然后再进行任何其他工作。
接受您的建议。但是为什么nltk stopwords 功能不能完成它应该做的工作呢？！！！
我看了nltk的法语stopwords，我想说它很完整（我也说法语）。再多几个词，比如“ils”、“elles”、“les”、“leurs”（主要是复数），就可以了。我猜想写nltk 中使用的Stopwords Corpus 的人不太懂法语。但这也不是我们可以抱怨的，毕竟他们免费给了我们一个很棒的图书馆！
好的，谢谢，user823743 也给出了另一个很好的解释；）。

标签： python text nltk stop-words

【解决方案1】：

通常最好使用您自己的停用词列表。为此，您可以从here 获取法语停用词列表。文章单词“les”也在列表中。创建它们的文本文件并使用该文件从语料库中删除停用词。然后对于标点符号，您必须编写一个标点符号删除函数。您应该如何编写它，很大程度上取决于您的应用程序。但是，为了向您展示一些可以帮助您入门的示例，您可以编写：

import string
t = "hello, eric! how are you?"
print t.translate(string.maketrans("",""), string.punctuation)

输出是：

hello eric how are you

或者，另一种方式是简单地写：

t = t.split()
for w in t:
    w = w.strip('\'"?,.!_+=-')
    print w

因此，这实际上取决于您需要如何删除它们。在某些情况下，这些方法可能不会产生您真正想要的结果。但是，您可以在它们的基础上进行构建。如果您还有其他问题，请告诉我。

【讨论】：

接受您的建议。我想我最终会使用你的解决方案......但是nltk stopwords 功能为什么不能完成它应该做的工作？！！！
因为开发 nltk 的团队可能不会说 nltk 涵盖的所有语言，他们很可能为每种语言使用了大量的文本语料库，并找到了该语料库中最常用的单词作为 stop字。所以这个列表可能是自动生成的。此外，停用词因应用程序而异。例如，主题分类上下文中的停用词与情感分类上下文中的停用词不同。这就是为什么在一天结束时，从开发人员的角度来看，一份粗略的停用词列表就足够了。