Python：仅使用 1 个外生变量执行数百万次简单线性回归的最快方法答案

【问题标题】：Python: Fastest way to perform millions of simple linear regression with 1 exogenous variable onlyPython：仅使用 1 个外生变量执行数百万次简单线性回归的最快方法
【发布时间】：2020-10-16 21:34:47
【问题描述】：

我正在对时间序列数据执行组件明智回归。这基本上是我们将 y 与 x1，仅针对 x₂ 的 y，...，并采用最减少平方残差之和的回归并将其添加为基学习器。这被重复 M 次，最终模型是许多简单线性回归的总和，形式为 y 对 x_i（仅 1 个外生变量），基本上使用线性回归作为基础学习器进行梯度提升.

问题在于，由于我正在对时间序列数据执行滚动窗口回归，因此我必须执行超过一百万个 OLS 的 N × M × T 回归。虽然每个 OLS 都非常快，但在我弱小的笔记本电脑上运行需要几个小时。

目前，我正在使用 statsmodels.OLS.fit() 作为获取每个 y 的参数与 x_i 这样的线性回归的方法。 z_matrix 是数据矩阵，i 表示要对回归进行切片的第 i^th 列。行数约为 100，z_matrix 的大小约为 100 × 500。

    ols_model = sm.OLS(endog=endog, exog=self.z_matrix[:, i][..., None]).fit()
    return ols_model.params, ols_model.ssr, ols_model.fittedvalues[..., None]

我从 2016 年 Fastest way to calculate many regressions in python? 的前一篇文章中读到，使用重复调用 statsmodels 效率不高，我尝试了其中一个答案，其中建议使用 numpy 的 pinv，但不幸的是速度较慢：

    # slower: 40sec vs 30sec for statsmodel for 100 repeated runs of 150 linear regressions
    params = np.linalg.pinv(self.z_matrix[:, [i]]).dot(endog)
    y_hat = self.z_matrix[:, [i]]@params
    ssr = sum((y_hat-endog)**2)
    return params, ssr, y_hat

有没有人有更好的建议来加快线性回归的计算？我只需要估计的参数、残差平方和和预测的 ŷ 值。谢谢！

【问题讨论】：

参考问题是针对常见的 exog X 的情况。这里，X 不同，y 相同。最有可能最快的是基于矢量化相关系数。 scipy.stats.linregress 仅使用普通协方差来计算回归系数。我想这可以向量化。
我认为，对于“减少平方残差和的回归”的排名，在这种情况下，也可以直接从简单的相关性中获得，而不需要计算回归结果。
您好约瑟夫，感谢您的回复。您如何建议在不计算回归的情况下获得最大程度地降低 SSR 的预测变量？仅仅是与 y 最相关的预测变量 Xi 吗？

标签： python numpy regression linear-regression statsmodels

【解决方案1】：

这是一种方法，因为您总是在没有常数的情况下运行回归。这段代码在大约 0.5 秒内运行了大约 90 万个模型。它保留了 sse、每个 900K 回归的预测值和估计参数。

最大的想法是利用一个变量对另一个变量的回归背后的数学运算，即叉积与内积的比率（模型不包含常数）。这可以修改为还包括一个常数，方法是使用移动窗口降均值来估计截距。

import numpy as np
from statsmodels.regression.linear_model import OLS
import datetime

gen = np.random.default_rng(20210514)

# Number of observations
n = 1000
# Number of predictors
m = 1000
# Window size
w = 100
# Simulate data
y = gen.standard_normal((n, 1))
x = gen.standard_normal((n, m))

now = datetime.datetime.now()
# Compute rolling covariance and variance-like terms
# These assume the model is y = x*b + e w/o a constant
c = np.r_[np.zeros((1, m)), np.cumsum(x * y, axis=0)]
v = np.r_[np.zeros((1, m)), np.cumsum(x * x, axis=0)]
c_trimmed = c[w:] - c[:-w]
v_trimmed = v[w:] - v[:-w]
# Parameters are just the ratio
params = c_trimmed / v_trimmed

# Build a selector array to quickly reshape y and the columns of x
step = np.arange(m - w + 1)
sel = np.arange(w)
locs = step[:, None] + sel
# Get the blocked reshape of y. It has n - w + 1 rows with window observations
# and looks like
# [[y[0],y[1],...,y[99]],
#  [y[1],y[2],...,y[100]],
#  ...,
#  [y[900],y[901],...,y[999]],
y_block = y[locs, 0]
# Storage for the predicted values and the sse
y_pred = np.empty((x.shape[1],) + y_block.shape)
sse = np.empty((m - w + 1, n))
# Easiest to loop over columns.
# Could do broadcasting tricks, but noth worth the trouble since number of columns is modest
for i in range(x.shape[0]):
    # Reshape a columns of x like y
    x_block = x[locs, i]
    # Get the parameters and make sure it is 2d with shape (m-w+1, 1)
    # so the broadcasting works
    p = params[:, i][:, None]
    # Get the predicted value
    y_pred[i] = x_block * p
    # And the sse
    sse[:, i] = ((y_block - y_pred[i]) ** 2).sum(1)
print(f"Time: {(datetime.datetime.now() - now).total_seconds()}s")
# Some test code
# Test any single observation
start = 124
assert start <= m - w
column = 342
assert column < x.shape[1]
res = OLS(y[start : start + 100], x[start : start + 100, [column]]).fit()
np.testing.assert_allclose(res.params[0], params[start, column])
np.testing.assert_allclose(res.fittedvalues, y_pred[column, start])
np.testing.assert_allclose(res.ssr, sse[start, column])

【讨论】：