0.7 - 0.75 是朴素贝叶斯情绪分析可接受的准确度吗？答案

【问题标题】：Is 0.7 - 0.75 an acceptable accuracy for Naive Bayes sentiment analysis?0.7 - 0.75 是朴素贝叶斯情绪分析可接受的准确度吗？
【发布时间】：2021-08-10 04:51:29
【问题描述】：

我提前为发布这么多代码道歉。

我正在尝试将 YouTube cmets 分类为包含意见（无论是正面还是负面）和不使用 NLTK 的朴素贝叶斯分类器的类别，但无论我在预处理阶段做什么，我都无法真正得到精度在 0.75 以上。与我见过的其他示例相比，这似乎有点低 - 例如，this 教程最终的准确度约为 0.98。

这是我的完整代码

import nltk, re, json, random

from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.tokenize import TweetTokenizer
from nltk import FreqDist, classify, NaiveBayesClassifier

from contractions import CONTRACTION_MAP
from abbreviations import abbrev_map
from tqdm.notebook import tqdm

def expand_contractions(text, contraction_mapping=CONTRACTION_MAP):
    text = re.sub(r"’", "'", text)
    if text in abbrev_map:
        return(abbrev_map[text])
    text = re.sub(r"\bluv", "lov", text)
    
    contractions_pattern = re.compile('({})'.format('|'.join(contraction_mapping.keys())), 
                                      flags=re.IGNORECASE|re.DOTALL)
    def expand_match(contraction):
        match = contraction.group(0)
        first_char = match[0]
        expanded_contraction = contraction_mapping.get(match)\
                                if contraction_mapping.get(match)\
                                else contraction_mapping.get(match.lower())                       
        expanded_contraction = first_char+expanded_contraction[1:]
        return expanded_contraction
        
    expanded_text = contractions_pattern.sub(expand_match, text)
    return expanded_text

def reduce_lengthening(text):
    pattern = re.compile(r"(.)\1{2,}")
    return pattern.sub(r"\1\1", text)

def lemmatize_sentence(tokens):
    lemmatizer = WordNetLemmatizer()
    lemmatized_sentence = []
    for word, tag in pos_tag(tokens):
        if tag.startswith('NN'):
            pos = 'n'
        elif tag.startswith('VB'):
            pos = 'v'
        else:
            pos = 'a'
        lemmatized_sentence.append(lemmatizer.lemmatize(word, pos))
    return lemmatized_sentence

def processor(comments_list):
    
    new_comments_list = []
    for com in tqdm(comments_list):
        com = com.lower()
        
        #expand out contractions
        tok = com.split(" ")
        z = []
        for w in tok:
            ex_w = expand_contractions(w)
            z.append(ex_w)
        st = " ".join(z)
        
        
        tokenized = tokenizer.tokenize(st)
        reduced = [reduce_lengthening(token) for token in tokenized]
        new_comments_list.append(reduced)
        
    lemmatized = [lemmatize_sentence(new_com) for new_com in new_comments_list]
    
    return(lemmatized)

def get_all_words(cleaned_tokens_list):
    for tokens in cleaned_tokens_list:
        for token in tokens:
            yield token

def get_comments_for_model(cleaned_tokens_list):
    for comment_tokens in cleaned_tokens_list:
        yield dict([token, True] for token in comment_tokens)
        
if __name__ == "__main__":
    #=================================================================================~
    tokenizer = TweetTokenizer(strip_handles=True, reduce_len=True)        
    
    with open ("english_lang/samples/training_set.json", "r", encoding="utf8") as f:
        train_data = json.load(f)
        
    pos_processed = processor(train_data['pos'])
    neg_processed = processor(train_data['neg'])
    neu_processed = processor(train_data['neu'])
    
    emotion = pos_processed + neg_processed
    random.shuffle(emotion)
    
    em_tokens_for_model = get_comments_for_model(emotion)
    neu_tokens_for_model = get_comments_for_model(neu_processed)

    em_dataset = [(comment_dict, "Emotion")
                         for comment_dict in em_tokens_for_model]

    neu_dataset = [(comment_dict, "Neutral")
                             for comment_dict in neu_tokens_for_model]

    dataset = em_dataset + neu_dataset


    random.shuffle(dataset)
    x = 700
    tr_data = dataset[:x]
    te_data = dataset[x:]
    classifier = NaiveBayesClassifier.train(tr_data)
    print(classify.accuracy(classifier, te_data))

如果需要，我可以发布我的训练数据集，但可能值得一提的是，YouTube cmets 本身的英语质量非常差且不一致（我想这是模型准确率低的原因）。无论如何，这会被认为是可接受的准确度吗？或者，我很可能把这一切都搞错了，并且有一个更好的模型可以使用，在这种情况下，请随时告诉我我是个白痴！提前致谢

【问题讨论】：

这个问题可能会在data science stack exchange 上找到更好的答案，所以如果您在这里没有得到答案，请记住这一点。

标签： python sentiment-analysis naivebayes

【解决方案1】：

将您的结果与不相关教程的结果进行比较在统计上无效。在恐慌之前，请对可能降低模型准确性的因素进行适当的研究。首先，您的模型不能表现出比数据集信息固有的精度更高的精度。例如，无论数据集如何，任何模型在预测随机二元事件方面的性能（从长远来看）都不能超过 50%。

我们没有合理的方法来评估理论信息内容。如果您需要检查，请尝试将一些其他模型类型应用于相同的数据，并查看它们产生的准确性。运行这些实验是数据科学的正常组成部分。

【讨论】：