如何在 nltk naivebayes 分类器中添加频率？答案

【问题标题】：How can I add frequency in nltk naivebayes classifier?如何在 nltk naivebayes 分类器中添加频率？
【发布时间】：2017-03-02 16:34:48
【问题描述】：

我现在正在使用 nltk 学习 naivebayes 分类器。

在document(http://www.nltk.org/book/ch06.html) 1.3文档分类中，有一个featureset例子。

featuresets = [(document_features(d), c) for (d,c) in documents]
train_set, test_set = featuresets[100:], featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)

all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words)[:2000] [1]

def document_features(document): [2]
    document_words = set(document) [3]
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

所以特征集形式的例子是 {('contains(waste)': False, 'contains(lot)': False, ...},'neg')...}

但我想将字典形式从 'contains(waste)': False 更改为 'contains(waste)': 2。我认为那个表格（'contains（waste）'：2）很好地解释了文件，因为它可以计算世界的频率。所以特征集将是 {('contains(waste)': 2, 'contains(lot)': 5, ...},'neg')...}强>

但我担心 'contains(waste)': 2 和 'contains(waste)': 1 是否与 naivebayesclassifier 完全不同。那么就无法解释 'contains(waste)': 2 和 'contains(waste)': 1 的相似性了。

{'contains(lot)': 1 and 'contains(waste)': 1} and {'contains(waste)' : 2 和 'contains(waste)': 1} 可以同编程。

nltk.naivebayesclassifier 能理解词频吗？

这是我使用的代码

def split_and_count_word(data):
    #belongs_to : Main
    #Role : make featuresets from korean words using konlpy.
    #Parameter : dictionary data(dict of contents ex.{'politic':{'parliament': [content,content]}..})
    #Return : list featuresets([{'word':True',...},'politic'] == featureset + category)

    featuresets = []
    twitter = konlpy.tag.Twitter()#Korean word splitter

    for big_cat in data:

        for small_cat in data[big_cat]:
            #save category name needed in featuresets 
            category = str(big_cat[0:3])+'/'+str(small_cat)
            count = 0; print(small_cat)

            for one_news in data[big_cat][small_cat]:
                count+=1; if count%100==0: print(count,end=' ')                
                #one_news is list in list so open it!
                doc = one_news
                #split word as using konlpy
                list_of_splited_word = twitter.morphs(doc[:-63])#delete useless sentences. 
                #get word length is higher than two and get list of splited words
                list_of_up_two_word = [word for word in list_of_splited_word if len(word)>1]
                dict_of_featuresets = make_featuresets(list_of_up_two_word)
                #save 
                featuresets.append((dict_of_featuresets,category))

    return featuresets


def make_featuresets(data):
    #belongs_to : split_and_count_word
    #Role : make featuresets
    #Parameter : list list_of_up_two_word(ex.['비누','떨어','지다']
    #Return : dictionary {word : True for word in data}

    #PROBLEM :(
    #cannot consider the freqency of word
    return {word : True for word in data}

def naive_train(featuresets):
    #belongs_to : Main
    #Role : Learning by naive bayes rule
    #Parameter : list featuresets([{'word':True',...},'pol/pal'])
    #Return : object classifier(nltk naivebayesclassifier object),
    #         list test_set(the featuresets that are randomly selected)

    random.shuffle(featuresets)
    train_set, test_set = featuresets[1000:], featuresets[:1000]
    classifier = naivebayes.NaiveBayesClassifier.train(train_set)

    return classifier,test_set

featuresets = split_and_count_word(data)
classifier,test_set = naive_train(featuresets)

【问题讨论】：

标签： python nltk naivebayes nl-classifier

【解决方案1】：

nltk 的朴素贝叶斯分类器将特征值视为逻辑上不同的。值不限于True 和False，但它们从不被视为数量。如果您有功能 f=2 和 f=3，它们将被视为不同的值。将数量添加到此类模型的唯一方法是将它们分类为“桶”，例如 f=1、f="few" (2-5)、f="several" (6-10)、f="many" (11 或更多)，例如。（注意：如果你走这条路，有一些算法可以为桶选择好的值范围。）即使这样，模型也不“知道”“很少”介于“一个”和“几个”之间。你需要一个不同的机器学习工具来直接处理数量。

【讨论】：

感谢您给我这个想法。那你的意思是我不能添加已经包含在特征字典中的单词？例如，字典是 {"hello":True,"hello":True,"my":True...}。那么，您能推荐其他有用的机器学习模块吗？
正如您在对@aberger 的评论中已经指出的那样，不，您不能在字典中使用相同的键两次。无法直接为您指出量化的解决方案，抱歉。 nltk 的 MaxentClassifier 使用数字权重，但它们通常由 API 根据您提供的“名义”特征创建；所以你必须四处寻找正确的使用方法。另请参阅 scikit-learn。最好的分类器取决于你的任务，所以尝试一些！
谢谢，我去试试！