【发布时间】:2019-12-12 18:08:07
【问题描述】:
我正在尝试在 sklearn 上使用 SVC 解决文本分类问题。我还想检查哪个矢量化器最适合我的数据:Bag of Words CountVectorizer() 或 TF-IDF TfidfVectorizer()
到目前为止,我一直在做的是分别使用这两个矢量化器,一个接一个,然后比较它们的结果。
# Bag of Words (BoW)
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer()
features_train_cv = count_vectorizer.fit_transform(features_train)
# TF-IDF
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vec = TfidfVectorizer()
features_train_tfidf = tfidf_vec.fit_transform(features_train)
# Instantiate SVC
classifier_linear = SVC(random_state=1, class_weight='balanced', kernel = "linear", C=1000)
# Fit SVC with BoW features
classifier_linear.fit(features_train_cv,target_train)
features_test_cv = count_vectorizer.transform(features_test)
target_test_pred_cv = classifier_linear.predict(features_test_cv)
# Confusion matrix: SVC with BoW features
from sklearn.metrics import confusion_matrix
print(confusion_matrix(target_test, target_test_pred_cv))
[[ 689 517]
[ 697 4890]]
# Fit SVC with TF-IDF features
classifier_linear.fit(features_train_tfidf,target_train)
features_test_tfidf = tfidf_vec.transform(features_test)
target_test_pred_tfidf = classifier_linear.predict(features_test_tfidf)
# Confusion matrix: SVC with TF-IDF features
[[ 701 505]
[ 673 4914]]
我认为也许使用Pipeline 会使我的代码看起来更有条理。但我注意到,在建议的 Pipeline 代码中,sklearn tutorial from the module official page 包含两个矢量化器:both CountVectorizer()(词袋)和 TfidfVectorizer()
# from sklearn official tutorial
from sklearn.pipeline import Pipeline
>>> text_clf = Pipeline([
... ('vect', CountVectorizer()),
... ('tfidf', TfidfTransformer()),
... ('clf', MultinomialNB()),
我的印象是,您只需要为您的功能选择一个矢量化器。这是否意味着数据被向量化两次,一次是简单的词频,然后是 TF-IDF?
这段代码如何工作?
【问题讨论】:
标签: python scikit-learn nlp pipeline