【问题标题】:Sklearn text classification: Why is accuracy so low?Sklearn 文本分类:为什么准确率这么低?
【发布时间】:2020-08-25 11:22:18
【问题描述】:

好的,我正在关注 https://medium.com/@phylypo/text-classification-with-scikit-learn-on-khmer-documents-1a395317d195https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html 尝试根据类别对文本进行分类。我的数据框是这样布局的,命名为result:

target   type    post
1      intj    "hello world shdjd"
2      entp    "hello world fddf"
16     estj   "hello world dsd"
4      esfp    "hello world sfs"
1      intj    "hello world ddfd"

目标是按类型对帖子进行分类,目标只是为 16 种类型中的每一种分配编号 1-16。为了对文本进行分类,我这样做:

result = result[:1000] #shorten df - was :600

# split the dataset into training and validation datasets
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], result['type'], test_size=0.30, random_state=1)

# label encode the target variable
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

def tokenizersplit(str):
    return str.split()
tfidf_vect = TfidfVectorizer(tokenizer=tokenizersplit, encoding='utf-8', min_df=2, ngram_range=(1, 2), max_features=25000)

tfidf_vect.fit(result['post'])
tfidf_vect.transform(result['post'])

xtrain_tfidf = tfidf_vect.transform(train_x)
xvalid_tfidf = tfidf_vect.transform(valid_x)

def train_model(classifier, trains, t_labels, valids, v_labels):
    # fit the training dataset on the classifier
    classifier.fit(trains, t_labels)

    # predict the labels on validation dataset
    predictions = classifier.predict(valids)

    return metrics.accuracy_score(predictions, v_labels)

# Naive Bayes
accuracy = train_model(naive_bayes.MultinomialNB(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("NB accuracy: ", accuracy)

# Logistic Regression
accuracy = train_model(linear_model.LogisticRegression(), xtrain_tfidf, train_y, xvalid_tfidf, valid_y)
print ("LR accuracy: ", accuracy)

根据我在开始时缩短结果的程度,所有算法的准确度峰值都在 0.4 左右。它应该是 0.8-0.9。

我阅读了scikit very low accuracy on classifiers(Naive Bayes, DecissionTreeClassifier),但看不到如何将其应用于我的数据框。我的数据很简单 - 有类别 (type) 和文本 (post)。

这里有什么问题?

编辑 - 朴素贝叶斯取 2:

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])
text_clf.fit(result.post, result.target)

docs_test = result.post
predicted = text_clf.predict(docs_test)
np.mean(predicted == result.target)

print("Naive Bayes: ")
print(np.mean(predicted == result.target))

【问题讨论】:

  • 乍一看,train_y = encoder.fit_transform(train_y)valid_y = encoder.fit_transform(valid_y) 似乎很可疑。我建议在拆分之前对标签进行编码,或者在train_y 上执行fit_transformvalid_y 上的transform
  • @MikeXydas 谢谢 - 我是新手,从教程中得到了以上内容。你能提供一个例子/答案吗?

标签: python machine-learning scikit-learn text-classification


【解决方案1】:

你在做什么

我认为错误在于以下几行:

encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

通过拟合两次你重置了LabelEncoder的知识。
举个更简单的例子:

from sklearn import preprocessing

le = preprocessing.LabelEncoder()
y_train = le.fit_transform(["class1", "class2", "class3"])
y_valid = le.fit_transform(["class2", "class3"])
print(y_train)
print(y_valid)

输出这些标签编码:

[0 1 2]
[0 1]

这是错误的,因为编码标签 0class1 用于训练,class2 用于验证。

修复

我会将你的第一行改为:

result = result[:1000] #shorten df - was :600

# Encode the labels before splitting
encoder = preprocessing.LabelEncoder()
y_encoded = encoder.fit_transform(result['type'])

# CARE that I changed the target from result['type'] to y_encoded
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(result['post'], y_encoded, test_size=0.30, random_state=1)

def tokenizersplit(str):
    return str.split()

.
.
.

【讨论】:

  • 迈克,谢谢,但这不会改变结果。我的准确率仍然在 0.3 左右。我还收到错误“UserWarning:参数'token_pattern'将不会被使用,因为'tokenizer'不是None'warnings.warn(“参数'token_pattern'将不会被使用”
  • 好吧,看看我的编辑 - 似乎现在使用朴素贝叶斯的准确性很低。我做错了吗?
  • 没关系 - 显然 alpha 必须设置在朴素贝叶斯上!
  • 知道您是如何处理“token_pattern”错误@skyguy 的吗?我也一样。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2014-02-28
  • 2020-08-27
  • 2020-05-14
  • 2020-09-03
  • 2018-02-08
  • 2021-10-18
  • 2020-06-29
相关资源
最近更新 更多