Predict_proba() 的随机森林分类器结果与 predict() 不匹配？答案

【问题标题】：Random forest classifier result from Predict_proba() does not match with predict()?Predict_proba() 的随机森林分类器结果与 predict() 不匹配？
【发布时间】：2018-10-18 20:30:09
【问题描述】：

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('features', FeatureUnion([
    ('Comments',Pipeline([
        ('selector',ItemSelector(column = "Comments")),
        ('tfidf',TfidfVectorizer(use_idf=False,ngram_range=(1,2),max_df=0.95, min_df=0,sublinear_tf=True)),
    ])),
    ('Vendor', Pipeline([
        ('selector',ItemSelector(column = "Vendor Name")),
        ('tfidf',TfidfVectorizer(use_idf=False)),

    ]))
])),
('clf',RandomForestClassifier(n_estimators =200, max_features='log2',criterion = 'entropy',random_state = 45))
 #('clf',LogisticRegression())
 ])


X_train, X_test, y_train, y_test = train_test_split(X,
                                df['code Description'],
                                test_size = 0.3, 
                                train_size = 0.7,
                                random_state = 100)
model = pipeline.fit(X_train, y_train)
s = pipeline.score(X_test,y_test)
pred = model.predict(X_test)
predicted =model.predict_proba(X_test)

对于某些分类，我的predict 与预测分数匹配。但在某些情况下，

proba_predict = [0.3,0.18,0.155]

但不是将其归类为 A 类，而是归类为 B 类。

预测类：B

实际类：A

右侧栏是我的标签，左侧栏是我的输入文本数据：

【问题讨论】：

您能否提供一些发生这种情况的示例数据？根据所提供的信息，我们无法重现您的结果，也无法提供帮助。
@RafaelC 不，我说结果形式 predic_proba() 和 predict() 存在一些不匹配。对应于该类的 predict_proba() 的最大值应该是我的预测，但它显示为我的预测第二高。
是否值得仔细检查 model.classes_ 的课程顺序是否与您预期的顺序相同？
Yae ，我正在从头开始检查它，根据源代码它应该只预测最大值。谢谢@Merlin1896

标签： python machine-learning classification random-forest text-classification

【解决方案1】：

我认为您陈述了以下情况：对于测试向量 X_test，您可以从 predict_proba() 方法中找到预测的概率分布 y=[p1, p2, p3]，其中 p1>p2 和 p1>p3 但predict() 方法不输出此向量的类 0。

如果您检查 sklearn 的 RandomForestClassifier 的 predict 函数的 source code，您会看到那里调用了 RandomForest 的 predict_proba() 方法：

proba = self.predict_proba(X)

根据这些概率，argmax 用于输出类。

因此，预测步骤使用predict_proba 方法作为其输出。对我来说，那里出现任何问题似乎是不可能的。

我会假设您在例程中混淆了一些类名并在那里感到困惑。但根据您提供的信息无法给出更详细的答案。

【讨论】：

嗨@Merlin1896 我正在尝试为随机森林回归器编写一个包装器。所以我尝试输入 self.predict_proba = super.predict_proba(X) 但这给出了一个错误，说 super 没有属性 predict_proba，顺便说一下，这是随机森林回归器类
@SamedSivaslıoğlu 请用代码示例提出一个新问题！
我可以带你到这里吗？ @Merlin1896 stackoverflow.com/questions/53697980/…