矢量化器 fit_transform 如何在 sklearn 中工作？答案

【问题标题】：How vectorizer fit_transform work in sklearn?矢量化器 fit_transform 如何在 sklearn 中工作？
【发布时间】：2018-06-02 13:24:30
【问题描述】：

我正在尝试理解以下代码

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

当我尝试打印 X 以查看将返回的内容时，我得到了以下结果：

(0, 1)  1

(0, 2)  1

(0, 6)  1

(0, 3)  1

(0, 8)  1

(1, 5)  2

(1, 1)  1

(1, 6)  1

(1, 3)  1

(1, 8)  1

(2, 4)  1

(2, 7)  1

(2, 0)  1

(2, 6)  1

(3, 1)  1

(3, 2)  1

(3, 6)  1

(3, 3)  1

(3, 8)  1

但是，我不明白这个结果的含义？

【问题讨论】：

那是稀疏矩阵的结果。使用 X.toarray() 将其转换为稠密，然后打印
但是这里的数字是什么意思例如："(3,6)1"。能详细解释一下吗？
在稀疏矩阵中，大多数条目为零，因此不存储以节省内存。括号中的数字是值在矩阵（行、列）中的索引，1 是值（一个词在矩阵的行所代表的文档中出现的次数）。
如果“1”是一个术语在文档中出现的次数，那么为什么在第一个文档中，“the”出现了 2 次但所有位置（从 (0,1) 到 ( 0.8) 具有相同的值 1 ?
也许“the”是一个停用词，不包含在所学词汇中。请通过打印vectorizer.get_feature_names()来检查此处使用索引的实际词汇

标签： python machine-learning scikit-learn

【解决方案1】：

它将文本转换为数字。因此，使用其他函数，您将能够计算每个单词在给定数据集中存在的次数。我是编程新手，所以也许还有其他领域可以使用。

【讨论】：

【解决方案2】：

您可以将其解释为“(sentence_index, feature_index) count”

因为有3个句子：从0开始到2结束

特征索引是你可以从vectorizer.vocabulary_获得的词索引

->词汇_字典{word:feature_index,...}

所以对于例子 (0, 1) 1

-> 0 : row[the sentence index]

-> 1 : get feature index(i.e. the word) from vectorizer.vocabulary_[1]

-> 1 : count/tfidf (as you have used a count vectorizer, it will give you count)

如果您使用 tfidf vectorizersee here 而不是计数矢量化器，它将为您提供 tfidf 值。我希望我说清楚了

【讨论】：

【解决方案3】：

正如@Himanshu 所写，这是一个“(sentence_index, feature_index) 计数”

这里的count部分是“一个词在文档中出现的次数”

例如，

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 2仅对于本例，计数“2”表示单词“and”在该文档/句子中出现了两次

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

让我们更改代码中的语料库。基本上，我在语料库列表的第二句中添加了两次“秒”这个词。

from sklearn.feature_extraction.text import CountVectorizer 

vectorizer = CountVectorizer() 

corpus = ['This is the first document.','This is the second second second second document.','And the third one.','Is this the first document?'] 

X = vectorizer.fit_transform(corpus)

(0, 1) 1

(0, 2) 1

(0, 6) 1

(0, 3) 1

(0, 8) 1

(1, 5) 4 对于修改后的语料库，计数“4”表示单词“second”在该文档/句子中出现了四次

(1, 1) 1

(1, 6) 1

(1, 3) 1

(1, 8) 1

(2, 4) 1

(2, 7) 1

(2, 0) 1

(2, 6) 1

(3, 1) 1

(3, 2) 1

(3, 6) 1

(3, 3) 1

(3, 8) 1

【讨论】：