【发布时间】:2018-10-18 20:30:09
【问题描述】:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('features', FeatureUnion([
('Comments',Pipeline([
('selector',ItemSelector(column = "Comments")),
('tfidf',TfidfVectorizer(use_idf=False,ngram_range=(1,2),max_df=0.95, min_df=0,sublinear_tf=True)),
])),
('Vendor', Pipeline([
('selector',ItemSelector(column = "Vendor Name")),
('tfidf',TfidfVectorizer(use_idf=False)),
]))
])),
('clf',RandomForestClassifier(n_estimators =200, max_features='log2',criterion = 'entropy',random_state = 45))
#('clf',LogisticRegression())
])
X_train, X_test, y_train, y_test = train_test_split(X,
df['code Description'],
test_size = 0.3,
train_size = 0.7,
random_state = 100)
model = pipeline.fit(X_train, y_train)
s = pipeline.score(X_test,y_test)
pred = model.predict(X_test)
predicted =model.predict_proba(X_test)
对于某些分类,我的predict 与预测分数匹配。但在某些情况下,
proba_predict = [0.3,0.18,0.155]
但不是将其归类为 A 类,而是归类为 B 类。
预测类:B
实际类:A
右侧栏是我的标签,左侧栏是我的输入文本数据:
【问题讨论】:
-
您能否提供一些发生这种情况的示例数据?根据所提供的信息,我们无法重现您的结果,也无法提供帮助。
-
@RafaelC 不,我说结果形式 predic_proba() 和 predict() 存在一些不匹配。对应于该类的 predict_proba() 的最大值应该是我的预测,但它显示为我的预测第二高。
-
是否值得仔细检查
model.classes_的课程顺序是否与您预期的顺序相同? -
Yae ,我正在从头开始检查它,根据源代码它应该只预测最大值。谢谢@Merlin1896
标签: python machine-learning classification random-forest text-classification