朴素贝叶斯之文档分类

sklearn 提供了 3 个朴素贝叶斯分类算法：

高斯朴素贝叶斯：特征变量是连续变量，符合高斯分布，比如说人的身高，物体的长度。

多项式朴素贝叶斯：特征变量是离散变量，符合多项分布，在文档分类中特征变量体现在一个单词出现的次数，或者是单词的 TF-IDF 值等。

伯努利朴素贝叶斯：特征变量是布尔变量，符合 0/1 分布，在文档分类中特征是单词是否出现。

什么是 TF-IDF 值呢？

TF-IDF 是一个统计方法，用来评估某个词语对于一个文件集或文档库中的其中一份文件的重要程度。

TF-IDF 实际上是两个词组 Term Frequency和 Inverse Document Frequency的总称，两者缩写为 TF 和 IDF，分别代表了词频和逆向文档频率。

词频 TF计算了一个单词在文档中出现的次数，它认为一个单词的重要性和它在文档中出现的次数呈正比。

逆向文档频率 IDF，是指一个单词在文档中的区分度。它认为一个单词出现在的文档数越少，就越能通过这个单词把该文档和其他文档区分开。IDF 越大就代表该单词的区分度越大。

所以 TF-IDF 实际上是词频 TF 和逆向文档频率 IDF 的乘积。这样我们倾向于找到 TF 和 IDF 取值都高的单词作为区分，即这个单词在一个文档中出现的次数多，同时又很少出现在其他文档中。这样的单词适合用于分类。

如何求 TF-IDF

在 sklearn 中我们直接使用 TfidfVectorizer 类，它可以帮我们计算单词 TF-IDF 向量的值。在这个类中，取 sklearn 计算的对数 log 时，底数是e，不是 10。

TfidfVectorizer 类的创建：

TfidfVectorizer(stop_words=stop_words, token_pattern=token_pattern)

现在想要计算文档里都有哪些单词，这些单词在不同文档中的 TF-IDF 值是多少呢？

首先我们创建 TfidfVectorizer 类：

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()

然后我们创建 4 个文档的列表 documents，并让创建好的 tfidf_vec 对 documents 进行拟合，得到 TF-IDF 矩阵：

documents = [
    \'this is the bayes document\',
    \'this is the second second document\',
    \'and the third one\',
    \'is this the document\'
]
tfidf_matrix = tfidf_vec.fit_transform(documents)

输出文档中所有不重复的词：

print(\'不重复的词:\', tfidf_vec.get_feature_names())


# 运行结果：
不重复的词: [\'and\', \'bayes\', \'document\', \'is\', \'one\', \'second\', \'the\', \'third\', \'this\']

输出每个单词对应的 id 值：

print(\'每个单词的 ID:\', tfidf_vec.vocabulary_)


# 运行结果：
每个单词的 ID: {\'this\': 8, \'is\': 3, \'the\': 6, \'bayes\': 1, \'document\': 2, \'second\': 5, \'and\': 0, \'third\': 7, \'one\': 4}

输出每个单词在每个文档中的 TF-IDF 值，向量里的顺序是按照词语的 id 顺序来的：

print(\'每个单词的 tfidf 值:\', tfidf_matrix.toarray())    # 矩阵转化为数组


# 运行结果：
每个单词的 tfidf 值: [[0.         0.63314609 0.40412895 0.40412895 0.         0.
  0.33040189 0.         0.40412895]
 [0.         0.         0.27230147 0.27230147 0.         0.85322574
  0.22262429 0.         0.27230147]
 [0.55280532 0.         0.         0.         0.55280532 0.
  0.28847675 0.55280532 0.        ]
 [0.         0.         0.52210862 0.52210862 0.         0.
  0.42685801 0.         0.52210862]]

如何对文档进行分类

模块 1：对文档进行分词

在英文文档中，最常用的是 NTLK 包。NTLK 包中包含了英文的停用词 stop words、分词和标注方法。

import nltk
word_list = nltk.word_tokenize(text) # 分词
nltk.pos_tag(word_list) # 标注单词的词性

在中文文档中，最常用的是 jieba 包。jieba 包中包含了中文的停用词 stop words 和分词方法。

import jieba
word_list = jieba.cut (text) # 中文分词

模块 2：加载停用词表

我们需要自己读取停用词表文件，从网上可以找到中文常用的停用词保存在 stop_words.txt，然后利用 Python的文件读取函数读取文件，保存在 stop_words 数组中：

stop_words = [line.strip().decode(\'utf-8\') for line in io.open(\'stop_words.txt\').readlines()]

模块 3：计算单词的权重

直接创建 TfidfVectorizer 类，然后使用 fit_transform 方法进行拟合，得到 TF-IDF 特征空间 features，你可以理解为选出来的分词就是特征。我们计算这些特征在文档上的特征向量，得到特征空间features。

tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5)
features = tf.fit_transform(train_contents)

这里 max_df 参数用来描述单词在文档中的最高出现率。假设 max_df=0.5，代表一个单词在 50% 的文档中都出现过了，那么它只携带了非常少的信息，因此就不作为分词统计。

模块 4：生成朴素贝叶斯分类器

# 多项式贝叶斯分类器
from sklearn.naive_bayes import MultinomialNB  
clf = MultinomialNB(alpha=0.001).fit(train_features, train_labels)

模块 5：使用生成的分类器做预测

首先我们需要得到测试集的特征矩阵。

方法是用训练集的分词创建一个 TfidfVectorizer类，使用同样的 stop_words 和 max_df，然后用这个 TfidfVectorizer 类对测试集的内容进行 fit_transform 拟合，得到测试集的特征矩阵test_features。

test_tf = TfidfVectorizer(stop_words=stop_words, max_df=0.5, vocabulary=train_vocabulary)
test_features=test_tf.fit_transform(test_contents)

然后我们用训练好的分类器对新数据做预测。方法是使用 predict 函数，传入测试集的特征矩阵 test_features，得到分类结果 predicted_labels。predict 函数做的工作就是求解所有后验概率并找出最大的那个。

predicted_labels=clf.predict(test_features)

模块 6：计算准确率

计算准确率实际上是对分类模型的评估。我们可以调用 sklearn 中的 metrics 包，在 metrics 中提供了accuracy_score 函数，方便我们对实际结果和预测的结果做对比，给出模型的准确率。

from sklearn import metrics
print metrics.accuracy_score(test_labels, predicted_labels)

练习题：

train_contents=[]
train_labels=[]
test_contents=[]
test_labels=[]
# 导入文件
import os
import io
start=os.listdir(r\'./text_classification-master/text classification/train\')
for item in start:
    test_path=\'./text_classification-master/text classification/test/\'+item+\'/\'
    train_path=\'./text_classification-master/text classification/train/\'+item+\'/\'
    for file in os.listdir(test_path):
        with open(test_path+file,encoding="GBK") as f:
            test_contents.append(f.readline())
            #print(test_contents)
            test_labels.append(item)
    for file in os.listdir(train_path):
        with open(train_path+file,encoding=\'gb18030\', errors=\'ignore\') as f:
            train_contents.append(f.readline())
            train_labels.append(item)
print(len(train_contents),len(test_contents))

#######################################################################
# 导入stop word
import jieba
from sklearn import metrics
from sklearn.naive_bayes import MultinomialNB 
stop_words = [line.strip() for line in io.open(\'./text_classification-master/text classification/stop/stopword.txt\', encoding=\'utf-8\').readlines()]

# 分词方式使用jieba,计算单词的权重
tf = TfidfVectorizer(tokenizer=jieba.cut,stop_words=stop_words, max_df=0.5)
train_features = tf.fit_transform(train_contents)
print(train_features.shape)

# 模块 4：生成朴素贝叶斯分类器
# 多项式贝叶斯分类器
clf = MultinomialNB(alpha=0.001).fit(train_features, train_labels)

# 模块 5：使用生成的分类器做预测
test_tf = TfidfVectorizer(tokenizer=jieba.cut,stop_words=stop_words, max_df=0.5, vocabulary=tf.vocabulary_)
test_features=test_tf.fit_transform(test_contents)

print(test_features.shape)
predicted_labels=clf.predict(test_features)
print(metrics.accuracy_score(test_labels, predicted_labels))

# 运行结果：
3306 200
(3306, 24581)
(200, 24581)
0.925