【发布时间】:2017-12-26 21:31:38
【问题描述】:
我正在使用文本特征测试一个多标签分类问题。我总共有 1503 个文本文档。每次手动运行脚本时,我的模型都会显示结果略有不同。我不确定我的模型是否过拟合或者这是否正常,因为我是初学者。
http://zacstewart.com/2015/04/28/document-classification-with-scikit-learn.html
我使用以下博客中的确切脚本构建了模型。一种变体是我使用 scikit learn 中的 Linear SVC
我的准确度分数在 89 到 90 之间,Kappa 在 87 到 88 之间。是否应该进行一些修改以使其稳定?
这是 2 次手动运行的示例
Total emails classified: 1503
F1 Score: 0.902158940397
classification accuracy: 0.902158940397
kappa accuracy: 0.883691169128
precision recall f1-score support
Arts 0.916 0.878 0.897 237
Music 0.932 0.916 0.924 238
News 0.828 0.876 0.851 242
Politics 0.937 0.900 0.918 230
Science 0.932 0.791 0.855 86
Sports 0.929 0.948 0.938 233
Technology 0.874 0.937 0.904 237
avg / total 0.904 0.902 0.902 1503
Second run
Total emails classified: 1503
F1 Score: 0.898181015453
classification accuracy: 0.898181015453
kappa accuracy: 0.879002051427
下面是代码
def compute_classification():
#- 1. Load dataset
data = DataFrame({'text': [], 'class': []})
for path, classification in SOURCES:
data = data.append(build_data_frame(path, classification))
data = data.reindex(numpy.random.permutation(data.index))
#- 2. Apply different classification methods
"""
SVM
"""
pipeline = Pipeline([
# SVM using TfidfVectorizer
('vectorizer', TfidfVectorizer(max_features = 25000, ngram_range=(1,2), sublinear_tf=True, max_df=0.95, min_df=2,stop_words=stop_words1)),
('clf', LinearSVC(loss='squared_hinge', penalty='l2', dual=False, tol=1e-3))
])
#- 3. Perform K Fold Cross Validation
k_fold = KFold(n=len(data), n_folds=10)
f_score = []
c_accuracy = []
k_score = []
confusion = numpy.array([[0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0]])
y_predicted_overall = None
y_test_overall = None
for train_indices, test_indices in k_fold:
train_text = data.iloc[train_indices]['text'].values
train_y = data.iloc[train_indices]['class'].values.astype(str)
test_text = data.iloc[test_indices]['text'].values
test_y = data.iloc[test_indices]['class'].values.astype(str)
# Train the model
pipeline.fit(train_text, train_y)
# Predict test data
predictions = pipeline.predict(test_text)
confusion += confusion_matrix(test_y, predictions, binary=False)
score = f1_score(test_y, predictions, average='micro')
f_score.append(score)
caccuracy = metrics.accuracy_score(test_y, predictions)
c_accuracy.append(caccuracy)
kappa = cohen_kappa_score(test_y, predictions)
k_score.append(kappa)
# collect the y_predicted per fold
if y_predicted_overall is None:
y_predicted_overall = predictions
y_test_overall = test_y
else:
y_predicted_overall = numpy.concatenate([y_predicted_overall, predictions])
y_test_overall = numpy.concatenate([y_test_overall, test_y])
# Print Metrics
print_metrics(data,k_score,c_accuracy,y_predicted_overall,y_test_overall,f_score,confusion)
return pipeline
【问题讨论】:
标签: python scikit-learn cross-validation text-classification multilabel-classification