如何使用 scikit 线性回归找到系数的特征名称？答案

【问题标题】：How to find the features names of the coefficients using scikit linear regression?如何使用 scikit 线性回归找到系数的特征名称？
【发布时间】：2016-04-11 12:54:10
【问题描述】：

#training the model
model_1_features = ['sqft_living', 'bathrooms', 'bedrooms', 'lat', 'long']
model_2_features = model_1_features + ['bed_bath_rooms']
model_3_features = model_2_features + ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long']

model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])

model_2 = linear_model.LinearRegression()
model_2.fit(train_data[model_2_features], train_data['price'])

model_3 = linear_model.LinearRegression()
model_3.fit(train_data[model_3_features], train_data['price'])

# extracting the coef
print model_1.coef_
print model_2.coef_
print model_3.coef_

如果我改变特征的顺序，coef仍然以相同的顺序打印，因此我想知道特征与coeff的映射

【问题讨论】：

您将如何更改功能的顺序？我通常使用一些 zip(coef,featurenames) 来正确打印它。
@RobinSpiess 示例 model_e_features = ['bedrooms_squared', 'log_sqft_living', 'lat_plus_long'] + model_2_features
这与这个更笼统的问题有关stackoverflow.com/questions/40485285/…

标签： python machine-learning scikit-learn linear-regression

【解决方案1】：

诀窍在于，在您训练完模型后，您就知道系数的顺序：

model_1 = linear_model.LinearRegression()
model_1.fit(train_data[model_1_features], train_data['price'])
print(list(zip(model_1.coef_, model_1_features)))

这将打印系数和正确的特征。（使用 pandas DataFrame 测试）

如果您以后想重用系数，也可以将它们放入字典中：

coef_dict = {}
for coef, feat in zip(model_1.coef_,model_1_features):
    coef_dict[feat] = coef

（您可以通过训练具有相同特征的两个模型来自己测试它，但正如您所说，打乱了特征的顺序。）

【讨论】：

我认为应该是print(list(zip(model_1.coef_, model_1_features)))，即coef_而不是coef_[0]。否则 zip 没有任何东西可以迭代。
@AlexFedulov 啊，是的，谢谢。我认为我在示例中用于测试代码的 DataFrame 可能导致 sklearn 认为我提供了多个目标。因为如果给定多个目标，coef_ 将返回一个二维数组，所以我不得不使用 coef_[0]。但通常 coef_ 应该给出正确的结果。
当我在 Jupyter 中执行 print(list(zip(model_1.coef_, model_1_features))) 时，我得到的结果不容易阅读。（首先显示系数数组，其下方堆叠特征列表）。当我重塑两者以使其相反时，我的打印输出包含一些绒毛，这也使其难以阅读。例如：dtype='<U11')), (array([-0.47048405]), array([' feature1],
对我不起作用。 NameError: name 'classifier_features' 未定义
@robin Spiess 这不是一个好的解决方案（尽管这不是你的错）。如果我在项目过程中运行 200 个模型，将输入的名称保存在 单独的 字典中将需要我维护 400 个“事物”：每个模型一个对象和一个输入列表。相反，如果相关输入被捆绑在预测器中，我只需要维护 200 个东西。在其他系统中，例如 SAS，您只需要提供一个与原始训练集名称和类型相同的文件。使用 sklearn，位置也必须正确。

【解决方案2】：

这是我在 Jupyter 中用于漂亮地打印系数的方法。我不确定我是否理解为什么顺序是一个问题——据我所知，系数的顺序应该与你给它的输入数据的顺序相匹配。

请注意，第一行假设您有一个名为 df 的 Pandas 数据框，您最初将数据存储在其中，然后将其转换为用于回归的 numpy 数组：

fieldList = np.array(list(df)).reshape(-1,1)

coeffs = np.reshape(np.round(clf.coef_,5),(-1,1))
coeffs=np.concatenate((fieldList,coeffs),axis=1)
print(pd.DataFrame(coeffs,columns=['Field','Coeff']))

【讨论】：

【解决方案3】：

@Robin 发布了一个很好的答案，但对我来说，我必须对其进行一些调整才能按照我想要的方式工作，它是指我想要的 'coef_' np.array 的维度，即修改为此：model_1.coef_[0,:]，如下：

coef_dict = {}
for coef, feat in zip(model_1.coef_[0,:],model_1_features):
    coef_dict[feat] = coef

然后按照我的想象创建字典，其中包含 {'feature_name' : coefficient_value} 对。

【讨论】：

【解决方案4】：

借用 Robin，但简化语法：

coef_dict = dict(zip(model_1_features, model_1.coef_))

关于 zip 的重要说明：zip 假设其输入的长度相等，因此确认特征和系数的长度是否匹配尤其重要（在更复杂的模型中可能并非如此）。如果一个输入比另一个长，则较长的输入将截断其额外索引位置中的值。请注意以下示例中缺少的 7：

In [1]: [i for i in zip([1, 2, 3], [4, 5, 6, 7])]
Out[1]: [(1, 4), (2, 5), (3, 6)]

【讨论】：

【解决方案5】：

import pandas as pd

import numpy as np

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()
regressor.fit(X_train, y_train)

coef_table = pd.DataFrame(list(X_train.columns)).copy()
coef_table.insert(len(coef_table.columns),"Coefs",regressor.coef_.transpose())

【讨论】：

您可以创建一个数据框，其中一列中包含特征名称，另一列中包含这些特征的系数