sklearn随机森林相互覆盖答案

【问题标题】：sklearn random forests overwriting each othersklearn随机森林相互覆盖
【发布时间】：2020-06-24 12:13:48
【问题描述】：

我正在使用 sklearn 进行随机森林分类。现在我想比较不同的描述符集（一个有 125 个特征，一个有 154 个特征）。因此我正在创建两个不同的随机森林，但它们似乎相互覆盖，然后导致错误： '模型的特征数量必须与输入相匹配。模型 n_features 为 125，输入 n_features 为 154'

rf_std = RandomForestClassifier(n_estimators = 150, max_depth = 200, max_features = 'sqrt')
rf_nostd = RandomForestClassifier(n_estimators = 150, max_depth = 200, max_features = 'sqrt')

rf_std=rf_std.fit(X_train_std,y_train_std)
print('Testing score std:',rf_std.score(X_test_std,y_test_std))

rf_nostd=rf_nostd.fit(X_train_nostd,y_train_nostd)
print('Testing score nostd:',rf_nostd.score(X_test_nostd,y_test_nostd))
# until here it works

fig, (ax1, ax2) = plt.subplots(1, 2)

disp = plot_confusion_matrix(rf_std, X_test_std, y_test_std,
                                 cmap=plt.cm.Blues,
                                 normalize='true',ax=ax1)
disp = plot_confusion_matrix(rf_nostd, X_test_nostd, y_test_nostd,
                                 cmap=plt.cm.Blues,
                                 normalize='true',ax=ax2)
plt.show()
#here i get the error

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-27-eee9fea5dbfb> in <module>
      3 disp = plot_confusion_matrix(rf_std, X_test_std, y_test_std,
      4                                  cmap=plt.cm.Blues,
----> 5                                  normalize='true',ax=ax1)
      6 disp = plot_confusion_matrix(rf_nostd, X_test_nostd, y_test_nostd,
      7                                  cmap=plt.cm.Blues,

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\metrics\_plot\confusion_matrix.py in plot_confusion_matrix(estimator, X, y_true, labels, sample_weight, normalize, display_labels, include_values, xticks_rotation, values_format, cmap, ax)
    183         raise ValueError("plot_confusion_matrix only supports classifiers")
    184 
--> 185     y_pred = estimator.predict(X)
    186     cm = confusion_matrix(y_true, y_pred, sample_weight=sample_weight,
    187                           labels=labels, normalize=normalize)

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict(self, X)
    610             The predicted classes.
    611         """
--> 612         proba = self.predict_proba(X)
    613 
    614         if self.n_outputs_ == 1:

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in predict_proba(self, X)
    654         check_is_fitted(self)
    655         # Check data
--> 656         X = self._validate_X_predict(X)
    657 
    658         # Assign chunk of trees to jobs

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\ensemble\_forest.py in _validate_X_predict(self, X)
    410         check_is_fitted(self)
    411 
--> 412         return self.estimators_[0]._validate_X_predict(X, check_input=True)
    413 
    414     @property

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\tree\_classes.py in _validate_X_predict(self, X, check_input)
    389                              "match the input. Model n_features is %s and "
    390                              "input n_features is %s "
--> 391                              % (self.n_features_, n_features))
    392 
    393         return X

ValueError: Number of features of the model must match the input. Model n_features is 125 and input n_features is 154

编辑：安装第二个随机森林会以某种方式覆盖第一个，如下所示：

rf_std=rf_std.fit(X_train_std,y_train_std)
print(rf_std.n_features_)
rf_nostd=rf_nostd.fit(X_train_nostd,y_train_nostd)
print(rf_std.n_features_)
Output:
154
125

为什么这两个模型不分开，有人可以帮忙吗？

【问题讨论】：

我正在尝试重现您的问题。你的输入形状是什么？
你的错误到底是什么？你可以通过编辑在帖子上显示它吗？
X_train_std 是一个 np 数组 (40000,154) y_train_std 是一个列表 (40000)，X_train_nostd 是一个 np 数组 (40000,125)，y_train_nostd 是一个列表 (40000)。 std 和 nostd 测试集的尺寸分别为 (10000,154) 和 (10000,125)

标签： python scikit-learn random-forest

【解决方案1】：

我能够在train 和test 输入形状不一致的情况下重现此错误。

试试这个：

assert X_train_std.shape[-1] == X_test_std.shape[-1], "Input shapes don't match."
assert X_train_nostd.shape[-1] == X_test_nostd.shape[-1], "Input shapes don't match."

这就是我重现您的错误的方式：

import numpy as np
from sklearn.ensemble import RandomForestClassifier

X_train_std = np.random.rand(400, 154)
X_test_std = np.random.rand(100, 125)

y_train_std = np.random.randint(0, 2, 400).tolist()
y_test_std = np.random.randint(0, 2, 100).tolist()

rf_std = RandomForestClassifier(n_estimators = 150, 
    max_depth = 200, max_features = 'sqrt')

rf_std=rf_std.fit(X_train_std,y_train_std)
print('Testing score std:',rf_std.score(X_test_std,y_test_std))

ValueError：模型的特征数量必须与输入匹配。模型 n_features 为 154，输入 n_features 为 125

【讨论】：

谢谢你的回答，我确实把输入搞砸了。

【解决方案2】：

这通常发生在您的训练/测试集与形状不匹配时。请检查以下形状信息是否匹配？

X_train_std.shape[1] == X_test_std.shape[1]  
X_train_nostd.shape[1] == X_test_nostd.shape[1]

如果它匹配你就很好，否则你必须寻找你发现不同的地方。

问候，
乔丹

【讨论】：