【发布时间】:2021-05-26 18:13:45
【问题描述】:
我正在用简单的例子测试TfidfVectorizer,但我无法弄清楚结果。
corpus = ["I'd like an apple",
"An apple a day keeps the doctor away",
"Never compare an apple to an orange",
"I prefer scikit-learn to Orange",
"The scikit-learn docs are Orange and Blue"]
vect = TfidfVectorizer(min_df=1, stop_words="english")
tfidf = vect.fit_transform(corpus)
print(vect.get_feature_names())
print(tfidf.shape)
print(tfidf)
输出:
['apple', 'away', 'blue', 'compare', 'day', 'docs', 'doctor', 'keeps', 'learn', 'like', 'orange', 'prefer', 'scikit']
(5, 13)
(0, 0) 0.5564505207186616
(0, 9) 0.830880748357988
...
我正在计算第一句话的tfidf,我得到了不同的结果:
- 第一个文档 ("
I'd like an apple") 仅包含 2 个单词(删除停用词后(根据vect.get_feature_names()的打印(我们保留:“like”、“apple”))李> - TF("apple", Doucment_1) = 1/2 = 0.5
- TF("like", Doucment_1) = 1/2 = 0.5
-
apple这个词在语料库中出现了3次。 -
like这个词在语料库中出现了 1 次。 - IDF (“苹果”) = ln(5/3) = 0.51082
- IDF(“喜欢”)= ln(5/1) = 1.60943
所以:
-
tfidf("apple")在 document1 = 0.5 * 0.51082 = 0.255 != 0.5564 -
tfidf("like")在 document1 = 0.5 * 1.60943 = 0.804 != 0.8308
我错过了什么?
【问题讨论】:
标签: python scikit-learn nlp tf-idf tfidfvectorizer