TL;DR
demo_liu_hu_lexicon 函数是一个演示函数,用于演示如何使用opinion_lexicon。它用于测试,不应直接使用。
长期
让我们看一下函数,看看我们如何重新创建一个类似的函数https://github.com/nltk/nltk/blob/develop/nltk/sentiment/util.py#L616
def demo_liu_hu_lexicon(sentence, plot=False):
"""
Basic example of sentiment classification using Liu and Hu opinion lexicon.
This function simply counts the number of positive, negative and neutral words
in the sentence and classifies it depending on which polarity is more represented.
Words that do not appear in the lexicon are considered as neutral.
:param sentence: a sentence whose polarity has to be classified.
:param plot: if True, plot a visual representation of the sentence polarity.
"""
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
tokenizer = treebank.TreebankWordTokenizer()
好的,在函数内部存在导入是一个奇怪的用途,但这是因为它是用于简单测试或文档的演示函数。
另外,treebank.TreebankWordTokenizer() 的用法比较奇怪,我们可以简单地使用nltk.word_tokenize。
让我们将导入移出并将demo_liu_hu_lexicon 重写为simple_sentiment 函数。
from nltk.corpus import opinion_lexicon
from nltk import word_tokenize
def simple_sentiment(text):
pass
接下来,我们看到
def demo_liu_hu_lexicon(sentence, plot=False):
"""
Basic example of sentiment classification using Liu and Hu opinion lexicon.
This function simply counts the number of positive, negative and neutral words
in the sentence and classifies it depending on which polarity is more represented.
Words that do not appear in the lexicon are considered as neutral.
:param sentence: a sentence whose polarity has to be classified.
:param plot: if True, plot a visual representation of the sentence polarity.
"""
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
tokenizer = treebank.TreebankWordTokenizer()
pos_words = 0
neg_words = 0
tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]
x = list(range(len(tokenized_sent))) # x axis for the plot
y = []
功能
- 首先对句子进行分词和小写
- 初始化肯定词和否定词的个数。
-
x 和 y 为稍后的一些绘图而初始化,所以让我们忽略它。
如果我们进一步向下函数:
def demo_liu_hu_lexicon(sentence, plot=False):
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
tokenizer = treebank.TreebankWordTokenizer()
pos_words = 0
neg_words = 0
tokenized_sent = [word.lower() for word in tokenizer.tokenize(sentence)]
x = list(range(len(tokenized_sent))) # x axis for the plot
y = []
for word in tokenized_sent:
if word in opinion_lexicon.positive():
pos_words += 1
y.append(1) # positive
elif word in opinion_lexicon.negative():
neg_words += 1
y.append(-1) # negative
else:
y.append(0) # neutral
if pos_words > neg_words:
print('Positive')
elif pos_words < neg_words:
print('Negative')
elif pos_words == neg_words:
print('Neutral')
循环简单地遍历每个标记并检查单词是否在正/负词典中。
最后,它检查编号。正面和负面的单词并返回标签。
现在让我们看看是否可以有更好的simple_sentiment 函数,现在我们知道demo_liu_hu_lexicon 做了什么。
无法避免步骤 1 中的标记化,因此我们有:
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
def simple_sentiment(text):
tokens = [word.lower() for word in word_tokenize(text)]
第 2-5 步有一个懒惰的方法是复制+粘贴并更改 print() -> return
from nltk.corpus import opinion_lexicon
from nltk.tokenize import treebank
def simple_sentiment(text):
tokens = [word.lower() for word in word_tokenize(text)]
for word in tokenized_sent:
if word in opinion_lexicon.positive():
pos_words += 1
y.append(1) # positive
elif word in opinion_lexicon.negative():
neg_words += 1
y.append(-1) # negative
else:
y.append(0) # neutral
if pos_words > neg_words:
return 'Positive'
elif pos_words < neg_words:
return 'Negative'
elif pos_words == neg_words:
return 'Neutral'
现在,你有了一个可以随心所欲的功能。
顺便说一句,这个演示真的很奇怪..
当我们看到一个肯定的词时添加 1,当我们看到一个否定的词时,我们添加 -1。
当pos_words > neg_words 时,我们会说一些积极的事情。
这意味着整数列表比较遵循一些可能没有语言或数学逻辑的 Pythonic 序列比较 =(参见 What happens when we compare list of integers?)