【发布时间】:2020-09-12 17:34:04
【问题描述】:
当更改 sklearn DecisionTreeClassifier 的输入列的顺序时,准确性似乎发生了变化。这不应该是这样的。我做错了什么?
from sklearn.datasets import load_iris
import numpy as np
iris = load_iris()
X = iris['data']
y = iris['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.90, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,1:], X_train[:,:1])), y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,2:], X_train[:,:2])), y_train)
print(clf.score(X_test, y_test))
clf = DecisionTreeClassifier(random_state=0)
clf.fit(np.hstack((X_train[:,3:], X_train[:,:3])), y_train)
print(clf.score(X_test, y_test))
运行此代码会产生以下输出
0.9407407407407408
0.22962962962962963
0.34074074074074073
0.3333333333333333
这已在 3 年前提出,但由于未提供代码,被质疑者被否决了。 Does feature order impact Decision tree algorithm in sklearn?
编辑
在上面的代码中,我忘记将列重新排序应用于测试数据。
我发现在将重新排序应用到整个数据集时也会持续存在不同的结果。
首先我导入数据并将其转换为 pandas 数据框。
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np
iris = load_iris()
y = iris['target']
iris_features = iris['feature_names']
iris = pd.DataFrame(iris['data'], columns=iris['feature_names'])
然后我通过原始有序特征名称选择所有数据。我训练和评估模型。
X = iris[iris_features].values
print(X.shape[1], iris_features)
# 4 ['petal length (cm)', 'petal width (cm)', 'sepal length (cm)', 'sepal width (cm)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(np.mean(y_test == pred))
# 0.7062937062937062
为什么我仍然得到不同的结果? 然后我选择相同列的不同顺序来训练和评估模型。
X = iris[iris_features[2:]+iris_features[:2]].values
print(X.shape[1], iris_features[2:]+iris_features[:2])
# 4 ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.95, random_state=0)
clf = DecisionTreeClassifier(random_state=0)
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(np.mean(y_test == pred))
# 0.8881118881118881
【问题讨论】:
标签: python pandas scikit-learn