Python中的逐步回归答案

【问题标题】：Stepwise Regression in PythonPython中的逐步回归
【发布时间】：2013-03-04 05:23:25
【问题描述】：

如何在python中进行逐步回归？ SCIPY 中有用于 OLS 的方法，但我无法逐步进行。在这方面的任何帮助将是一个很大的帮助。谢谢。

编辑：我正在尝试建立一个线性回归模型。我有 5 个自变量并使用前向逐步回归，我的目标是选择变量，使我的模型具有最低的 p 值。以下链接解释了目标：

https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk

再次感谢。

【问题讨论】：

scikits.learn 有 LARS/套索，如果有任何用处：scikit-learn.org/dev/modules/linear_model.html#lars-lasso
您能否详细说明您希望使用什么样的标准来选择预测变量？如果您想要一个示例，您可以发布或链接到一些示例数据吗？
不建议将模型基于 p 值。它们更像是一种健全性检查，其他标准（例如 AIC 或 BIC）更合适。
链接好像坏了：We're sorry, the page you've requested could not be located. You can return to the Mihaylo Home Page or report an error to the Webmaster.

标签： python scipy regression

【解决方案1】：

Trevor Smith 和我使用 statsmodels 为线性回归编写了一个小前向选择函数：http://planspace.org/20150423-forward_selection_with_statsmodels/ 您可以轻松修改它以最小化 p 值，或者只需多做一点工作就可以基于 beta p 值进行选择。

【讨论】：

【解决方案2】：

你可以试试 mlxtend，它有多种选择方法。

from mlxtend.feature_selection import SequentialFeatureSelector as sfs

clf = LinearRegression()

# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)

【讨论】：

【解决方案3】：

您可以根据statsmodels.api.OLS模型进行前后选择，如图in this answer。

但是，this answer 描述了为什么您不应该首先对计量经济模型使用逐步选择。

【讨论】：

我想指出，数据分区应该解决大卫链接的文章中提出的过度拟合/数据挖掘问题。发布的答案之一是关于数据分区：stats.stackexchange.com/a/20860/48197 话虽如此，文本（Wiley 的商业分析数据挖掘）讨论了数据分区的方法。换句话说，stepwise 应该没问题，只要你不在生产环境中使用训练模型的结果，你需要对验证数据进行 k 折测试，最终得到一个可行的列表。

【解决方案4】：

Statsmodels 有其他回归方法：http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html。我认为它将帮助您实现逐步回归。

【讨论】：

404 页面未找到 :(

【解决方案5】：

"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm

"""X_opt variable has all the columns of independent variables of matrix X 
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]

"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

使用摘要方法，您可以在内核中检查您的 p 值变量写为“P>|t|”。然后检查具有最高 p 的变量价值。假设 x3 具有最高值，例如 0.956。然后删除此列从您的阵列中提取并重复所有步骤。

X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

重复这些方法，直到删除所有 p 值高于显着性值（例如 0.05）的列。最后，您的变量 X_opt 将具有 p 值小于显着性水平的所有最优变量。

【讨论】：

【解决方案6】：

我开发了这个存储库https://github.com/xinhe97/StepwiseSelectionOLS

我的逐步选择类（最佳子集、前向逐步、后向逐步）与 sklearn 兼容。你可以用我的 Classes 来做 Pipeline 和 GridSearchCV。

我的代码的基本部分如下：

################### Criteria ###################
def processSubset(self, X,y,feature_index):
    # Fit model on feature_set and calculate rsq_adj
    regr = sm.OLS(y,X[:,feature_index]).fit()
    rsq_adj = regr.rsquared_adj
    bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
    rsq = regr.rsquared
    return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}

################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
    # Pull out predictors we still need to process
    remaining_predictors_index = [p for p in range(X.shape[1])
                            if p not in predictors_index]

    results = []
    for p in remaining_predictors_index:
        new_predictors_index = predictors_index+[p]
        new_predictors_index.sort()
        results.append(self.processSubset(X,y,new_predictors_index))
        # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    # Choose the model with the highest rsq_adj
    # best_model = models.loc[models['bic'].idxmin()]
    best_model = models.loc[models['rsq'].idxmax()]
    # Return the best model, along with model's other  information
    return best_model

def forwardK(self,X_est,y_est, fK):
    models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
    predictors_index = []

    M = min(fK,X_est.shape[1])

    for i in range(1,M+1):
        print(i)
        models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
        predictors_index = models_fwd.loc[i,'predictors_index']

    print(models_fwd)
    # best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
    best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
    # best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
    best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
    return best_model_fwd, best_predictors

【讨论】：

虽然我感谢您的贡献，但我无法抗拒，但要注意，仅在 r2 上选择模型（就像这里所做的那样？）不是一个好主意。

【解决方案7】：

这是我刚刚编写的一种方法，它使用“统计学习简介”中所述的“混合选择”。作为输入，它需要：

lm，一个 statsmodels.OLS.fit(Y,X)，其中 X 是 n 个数组，其中 n 是数据点的数量和 Y，其中 Y 是训练数据中的响应
curr_preds- 带有 ['const'] 的列表
potential_preds - 所有潜在预测变量的列表。还需要一个 pandas 数据框 X_mix，其中包含所有数据，包括“const”，以及与潜在预测变量对应的所有数据
tol，可选。最大 pvalue，如果未指定，则为 0.05

def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
  while (len(potential_preds) > 0):
    index_best = -1 # this will record the index of the best predictor
    curr = -1 # this will record current index
    best_r_squared = lm.rsquared_adj # record the r squared of the current model
    # loop to determine if any of the predictors can better the r-squared  
    for pred in potential_preds:
      curr += 1 # increment current
      preds = curr_preds.copy() # grab the current predictors
      preds.append(pred)
      lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
      new_r_sq = lm_new.rsquared_adj # record r squared for new model
      if new_r_sq > best_r_squared:
        best_r_squared = new_r_sq
        index_best = curr

    if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
      curr_preds.append(potential_preds.pop(index_best))
    else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
      break

    # fit a new lm using the new predictors, look at the p-values
    pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
    pval_too_big = []
    # make a list of all the p-values that are greater than the tolerance 
    for feat in pvals.index:
      if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
        pval_too_big.append(feat)

    # now remove all the features from curr_preds that have a p-value that is too large
    for feat in pval_too_big:
      pop_index = curr_preds.index(feat)
      curr_preds.pop(pop_index)

【讨论】：