【发布时间】:2021-05-11 10:33:37
【问题描述】:
考虑下面的例子。代表文件的重要词是“Bob”和“Sara”。但是对于max_features,输出往往会显示频繁出现的单词。当语料库很大时,情况会变得更糟。我们怎么才能只得到重要的词?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = [
'hi, my name is Bob.',
'hi, my name is Sara.'
]
vectorizer = TfidfVectorizer(max_features=2)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
输出:
,hi,is
0,0.7071067811865475,0.7071067811865475
1,0.7071067811865475,0.7071067811865475
【问题讨论】:
标签: python scikit-learn nlp tf-idf tfidfvectorizer