【发布时间】:2019-10-05 19:52:09
【问题描述】:
我按照这里的教程:https://towardsdatascience.com/creating-the-twitter-sentiment-analysis-program-in-python-with-naive-bayes-classification-672e5589a7ed 创建了一个 twitter 情绪分析器,它使用 nltk 库中的朴素贝叶斯分类器将推文分类为正面、负面或中性,但它给出的标签只是中性的或无关的。我在下面包含了我的代码,因为我对任何机器学习都不是很有经验,所以我很感激任何帮助。
我尝试过使用不同的推文集进行分类,即使指定了像“快乐”这样的搜索关键字,它仍然会返回“中性”。我不喜欢
import nltk
def buildvocab(processedtrainingdata):
all_words = []
for (words, sentiment) in processedtrainingdata:
all_words.extend(words)
wordlist = nltk.FreqDist(all_words)
word_features = wordlist.keys()
return word_features
def extract_features(tweet):
tweet_words = set(tweet)
features = {}
for word in word_features:
features['contains(%s)' % word] = (word in tweet_words) #creates json key containing word x, its loc.
# Every key has a T/F according - true for present , false for not
return features
# Building the feature vector
word_features = buildvocab(processedtrainingdata)
training_features = nltk.classify.apply_features(extract_features, processedtrainingdata)
# apply features does the actual extraction
Nbayes_result_labels = [Nbayes.classify(extract_features(tweet[0])) for tweet in processedtestset]
# get the majority vote [?]
if Nbayes_result_labels.count('positive') > Nbayes_result_labels.count('negative'):
print('Positive')
print(str(100*Nbayes_result_labels.count('positive')/len(Nbayes_result_labels)))
elif Nbayes_result_labels.count('negative') > Nbayes_result_labels.count('positive'):
print(str(100*Nbayes_result_labels.count('negative')/len(Nbayes_result_labels)))
print('Negative sentiment')
else:
print('Neutral')
#the output is always something like this:
print(Nbayes_result_labels)
['neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'irrelevant', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral']
【问题讨论】:
-
我应该提到用于训练数据的推文的“语料库文件”包含 550 条积极标记的推文,等量的消极但中性和不相关的推文要高得多,大约有 4000 条。跨度>
-
另外,我尝试对训练数据集进行分类,它确实适用,即用 Nbayes_result_labels = [Nbayes] 替换 Nbayes_result_labels = [Nbayes.classify(extract_features(tweet[0])) .classify(extract_features(tweet[0])) 用于处理训练数据中的推文]