【发布时间】:2018-01-06 15:01:09
【问题描述】:
请帮助我理解为什么梯度提升技术不起作用。是不是 GB 在内部使用了决策树回归[混淆请澄清]。我正在尝试集成技术来获得当前数据集的最佳分数。此外,递归特征消除 [RFE] 似乎存在问题,相关矩阵直觉和来自 SKLearn 的 RFE 应该会产生相似的特征重要性。 请帮助我理解,递归特征消除 [RFE]、相关矩阵直觉和 SKLearn 的 RFE 并没有赋予相似的特征重要性。
from IPython.display import clear_output
from io import StringIO
import pandas as pd
import requests
import numpy as np
import matplotlib.pyplot as plt
url='https://raw.githubusercontent.com/saqibmujtaba/Machine-
Learning/DataFiles/50_Startups.csv'
s=requests.get(url).text
dataset=pd.read_csv(StringIO(s))
相关矩阵清楚地表明,研发支出对预测利润 [标签] 的重要性最高,其次是营销支出?
from pandas.tools.plotting import scatter_matrix
scatter_matrix(dataset)
plt.show()
# Create Independent Variable
X=dataset.iloc[:,:-1].values
# Dependent Variable
Y=dataset.iloc[:,4].values
应用标签编码
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
很明显,LabelEncoding 工作正常。
输出
[[165349.2 136897.8 471784.1 2L]
[162597.7 151377.59 443898.53 0L]
[153441.51 101145.55 407934.54 1L]
[144372.41 118671.85 383199.62 2L]
[142107.34 91391.77 366168.42 1L]]
尝试一种热编码,
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
np.set_printoptions(formatter={'float': '{: 0.0f}'.format})
print(X[0:5,:])
输出
[[ 0 0 1 165349 136898 471784]
[ 1 0 0 162598 151378 443899]
[ 0 1 0 153442 101146 407935]
[ 0 0 1 144372 118672 383200]
[ 0 1 0 142107 91392 366168]]
避免虚拟变量陷阱和特征缩放
X = X[:, 1:]
np.set_printoptions(formatter={'float': '{: 0.0f}'.format})
print(X[0:5,:])
输出
[[ 0 1 165349 136898 471784]
[ 0 0 162598 151378 443899]
[ 1 0 153442 101146 407935]
[ 0 1 144372 118672 383200]
[ 1 0 142107 91392 366168]]
首先,即使正确给出了研发支出,也应该其次是营销支出?另外,为什么利润特征是选择的一部分,因为我已经清楚地将 Y 作为线性回归拟合中的标签传递了?我错过了什么吗?
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
# feature extraction
# Rank all features, i.e continue the elimination until the last one
rfe = RFE(estimator=lr, n_features_to_select=1)
fit = rfe.fit(X,Y)
print("Num Features: %d") % fit.n_features_
# an array with boolean values to indicate whether an attribute was selected
using RFE
print("Selected Features: %s") % fit.support_
print("Feature Ranking: %s") % fit.ranking_
names = dataset.columns.values
print names
print "Features sorted by their rank:"
print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))
输出
Num Features: 1
Selected Features: [ True False False False False]
Feature Ranking: [1 2 3 4 5]
['R&D Spend' 'Administration' 'Marketing Spend' 'State' 'Profit']
Features sorted by their rank:
[(1, 'R&D Spend'), (2, 'Administration'), (3, 'Marketing Spend'), (4,
'State'), (5, 'Profit')]
我对波士顿数据进行了尝试,它似乎有效。缩放是否在这里引起了问题?你能帮我了解应该应用什么样的缩放,我将如何在我未来的任务中确定它?
sc_X = StandardScaler().fit(X)
rescaledX = sc_X.fit_transform(X)
# Transform the Y based on the X Fittings.
rescaledY = sc_X.transform(Y)
# Using KFold
from sklearn.model_selection import KFold
kfold =KFold(n_splits=5,random_state=1)
选择提升模型和交叉验证
from sklearn.model_selection import cross_val_score
model = GradientBoostingRegressor(n_estimators=100, random_state=1)
results = cross_val_score(model, rescaledX, rescaledY, cv=kfold)
print(results)
[-5.28213131 -2.73927962 -7.55241606 -2.5951924 -2.51933385]
我不明白,什么是结果。我认为它应该给出我模型的平均分数 - 请更正
【问题讨论】:
标签: python machine-learning scikit-learn gradient-descent