【发布时间】:2020-11-12 14:43:04
【问题描述】:
我有一个 python 脚本,可以对文本进行正面或负面的分类。 我有一个数据集,在对得到的文本进行预处理后,我将其拆分为训练和测试数据
- 91% 的训练数据准确率
- 87% 的测试数据准确率
当我尝试使用真实数据时,它给出 20% 的准确度错误在哪里??
训练数据
Accuracy: 91.459%
Best parameters set found on development set:
{'bow__ngram_range': (1, 2), 'tfidf__use_idf': True}
Optimized model achieved an ROC of: 0.9998
测试数据
accuracy score: 0.8704919797610077
confusion matrix:
[[3920 699]
[ 504 4166]]
precision recall f1-score support
0 0.89 0.85 0.87 4619
1 0.86 0.89 0.87 4670
micro avg 0.87 0.87 0.87 9289
macro avg 0.87 0.87 0.87 9289
weighted avg 0.87 0.87 0.87 9289
我使用 Logistic Regression 作为 ML 模型,并使用 TfIdf 和 交叉验证。
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.model_selection import KFold
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import model_selection
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
cross_val = KFold(n_splits=3, random_state=42)
# create pipeline
pipeline = Pipeline([
('bow', CountVectorizer(strip_accents='ascii',
stop_words=['english'],
lowercase=True)), # strings to token integer counts
('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores
('classifier', LogisticRegression(C=15.075475376884423,penalty="l2")),
])
parameters = {'bow__ngram_range': [(1, 1), (1, 2)],
'tfidf__use_idf': (True, False),
}
clf = GridSearchCV(pipeline, param_grid=parameters, cv=cross_val, verbose=1, n_jobs=-1, scoring= 'roc_auc')
clf.fit(x_train, y_train)
【问题讨论】:
标签: python scikit-learn nltk sentiment-analysis text-classification