Python NLTK 格式化测试集答案

【问题标题】：Python NLTK formatting test setPython NLTK 格式化测试集
【发布时间】：2017-05-15 22:25:40
【问题描述】：

我一直在研究这个分类器，它似乎几乎可以工作。我遇到的唯一问题是测试集。

train = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'labeledTrainData.tsv'), header=0, \
                    delimiter="\t", quoting=3)

test = pd.read_csv(os.path.join(os.path.dirname(__file__), 'data', 'testData.tsv'), header=0, delimiter="\t", \
               quoting=3)

documents = []
for review in train.values:
    sentiment = 'pos' if review[1] == 1 else 'neg'
    split = review[2].split(), sentiment
    for word in split[0]:
        word = re.sub(r'[^\w\s]', '', word)
    documents.append(split)

word_features = nltk.FreqDist(chain(*[i for i, j in documents]))
word_features = list(word_features.keys())[:100]

train_set = [({i: (i in tokens) for i in word_features}, tag) for tokens, tag in documents[:1000]]

classifier = nltk.NaiveBayesClassifier
classifier.train(train_set)

print(nltk.classify.accuracy(classifier, test))
classifier.show_most_informative_features(5)

所以我找到的例子是，有一组被使用并以 90/10 的比例进行训练。这里我实际上有两个不同的数据集（一个标记，一个测试）。

train_set（缩写版本如下所示）是一个带有布尔值的元组列表，表示单词是否在 word_keys 中以及评论是正面还是负面：

 [({'beautician,': False, 'hubris,': False, '/>BTW:': False, 'nondenominational': False, 'diapered,': False, 'matter).': False, 'fascist\\"': False, 'Russian,gay': False, '/>\\"Ladies': False, 'purport': False, 'locker-room': False, 'Enjoy"': False, 'exposition': False, 'decisions\\"': False, 'N(n***as)': False, 'Duhllywood),': False, 'cataclysmic': False, 'reviews,': False, 'marry;': False, 'Gordon),': False, 'now-nostalgic': False, 'avoid!!!!"': False, 'coin;': False, 'infiltrators': False, 'smalltime': False, "`knows'": False, 'callous': False, 'actors...it': False, 'Fox,': False, "'78": False, 'Givney': False, 'cinematography):': False, 'misconstrued,': False, 'bathing;': False, 'Hepburn,': False, 'noise,': False, 'BG´s.': False, 'ship.In': False, "'60s.)<br": False, 'Odder': False, 'holes,disgustingly': False, '/>contact': False, 'Croasdell': False, 'trips\\"': False, 'acting.Yet': False, 'firearm.': False, 'businesspeople': False, 'Tomilinson': False, 'ways...<br': False, 'cast...ouch.': False, "Alexandra's": False, "lost.'": False, 'anwers,': False, 'dissertation': False, 'Perry': False, 'phenom': False, '\\"Cleopatra\\",': False, '"Revolt': False, 'secured': False, "romance',": False, 'retentively': False, '/>1/2': False, 'photography/\\"You': False, 'did--': False, 'consulate': False, 'ocurred.': False, 'profession': False, 'insane.': False, 'hysterics)': False, 'UPN.<br': False, 'effects--after': False, 'IMAGE,': False, 'recognizable.<br': False, "Kinky'with": False, 'death\x97it': False, 'Wizard\\"': False, 'pemberton,': False, 'Belting': False, 'boast.': False, 'Schlock!!': False, 'filmed)': False, 'overplotted': False, 'wiring,': False, 'comedy)': False, '`SS': False, 'foibles.': False, 'Germna': False, 'Waverly': False, 'Oxford-educated': False, 'reviews.Anyway': False, 'SANE': False, 'expressively': False, 'cr*p.': False, 'ex-priest': False, 'ITC': False, '/>Sara': False, 'exoticism-oriented': False, "'hello'": False, '"......in': False, 'hesitates': False}, 'neg')]

虽然测试集还是这样的：

               id                                             review
0      "12311_10"  "Naturally in a film who's main themes are of ...
1        "8348_2"  "This movie is a disaster within a disaster fi...
2        "5828_4"  "All in all, this is a movie for kids. We saw ...
3        "7186_2"  "Afraid of the Dark left me with the impressio...
4       "12128_7"  "A very accurate depiction of small time mob l...
...           ...                                                ...
24997    "2531_1"  "I was so disappointed in this movie. I am ver...
24998    "7772_8"  "From the opening sequence, filled with black ...
24999  "11465_10"  "This is a great horror film for people who do...

[25000 rows x 2 columns]

当然我现在遇到的问题是我不能简单地在这个数据集上训练，原来的看起来就像上面的 test_set，只有这个情绪包含一个值 1 或 0。我将如何进行培训并针对它使用测试集？我知道有一些例子，但它与我所做的并不完全相同。

【问题讨论】：

标签： python nltk

【解决方案1】：

测试集必须包含标签（答案）。 nltk 的评估方法期望它，实际上没有办法衡量性能，除非你已经有了标签。像您在示例中看到的那样将您的标记集拆分为 90-10，在 90% 上进行训练，并保留 10% 用于测试。

【讨论】：

嗯，我可能错过了什么吗？我要使用三个文件，分别是标记的训练集、未标记的训练集和测试集。我的理解是我使用标记集进行训练，并认为重点是放置一些未标记集（在本例中为测试集）来对抗它。那么没有标签的集合在哪里出现？最终，我想在我自己的评论上对其进行测试，这些评论将没有标签，并根据这个训练有素的集合标记这些评论。顺便谢谢你的回答，不胜感激！ @alexis
您可以在未标记的输入上使用分类器；这就是它的用途。但是你无法判断它分配的标签是否正确。