二维数组的sklearn线性回归答案

【问题标题】：sklearn linear regression for 2D array二维数组的sklearn线性回归
【发布时间】：2020-04-20 15:49:24
【问题描述】：

我有一个 Numpy 二维数组，其中行是单独的时间序列，列对应于时间点。我想为每一行拟合一条回归线来衡量每个时间序列的趋势，我想我可以（低效地）用这样的循环来做：

array2D = ...
for row in array2D:
    coeffs = sklearn.metrics.LinearRegression().fit( row, range( len( row ) ).coef_
    ...

有没有办法在没有循环的情况下做到这一点？ coeffs 的最终形状是什么？

【问题讨论】：

标签： python-3.x scikit-learn numpy-ndarray

【解决方案1】：

最小化线性回归误差的系数是

您可以使用 numpy 一次性解决所有行。

import numpy as np
from sklearn.linear_model import LinearRegression

def solve(timeseries):

    n_samples = timeseries.shape[1]
    # slope and offset/bias
    n_features = 2
    n_series = timeseries.shape[0]

    # For a single time series, X would be of shape
    # (n_samples, n_features) however in this case
    # it will be (n_samples. n_features, n_series)
    # The bias is added by having features being all 1's
    X = np.ones((n_samples, n_features, n_series))
    X[:, 1, :] = timeseries.T

    y = np.arange(n_samples)

    # A is the matrix to be inverted and will
    # be of shape (n_series, n_features, n_features)
    A = X.T @ X.transpose(2, 0, 1)
    A_inv = np.linalg.inv(A) 

    # Do the other multiplications step by step
    B = A_inv @ X.T
    C = B @ y 

    # Return only the slopes (which is what .coef_ does in sklearn)
    return C[:,1]

array2D = np.random.random((3,10))
coeffs_loop = np.empty(array2D.shape[0])
for i, row in enumerate(array2D):
    coeffs = LinearRegression().fit( row[:,None], range( len( row ) )).coef_
    coeffs_loop[i] = coeffs

coeffs_vectorized = solve(array2D)

print(np.allclose(coeffs_loop, coeffs_vectorized))

【讨论】：

嗯......还有两个问题：1. 为什么不使用 sklearn.metrics.LinearRegression 而不是自己滚动？ 2. 我们能以某种方式消除“for”循环吗？我的 array2D 有 >200k 行。
1) 因为sklearn 没有做你需要它做的事情。 LinearRegression 方法假定您将一个形状为(n_samples, n_features) 的数组传递给它。 2) 上述代码中唯一的for循环是验证solve函数是否得到正确答案。
@MarkLavin 或者LinearRegression().fit(array2D.T, np.arange(array2D.shape[1])).coef_) 如果您有比时间序列更多的样本，也会给您大致相同的答案，但听起来情况并非如此。
你能解释一下为什么你更喜欢时间系列的格式 LinearRegression().fit(THE_TIMESERIE, THE_RANGE) 而不是 LinearRegression().fit(THE_RANGE, THE_TIMESERIE) 吗？我一直在想它，我仍然不明白为什么要使用第一个版本。
@BernardoResolve 因为这是 OP 使用的，交换它们可以解决逆回归问题。

【解决方案2】：

对于像我这样喜欢 X 范围和 y 时间序列数据的人。

def linear_fit(periods, timeseries):
    # just format and add one column initialized at 1
    X_mat=np.vstack((np.ones(len(periods)), periods)).T
    
    # cf formula : linear-regression-using-matrix-multiplication
    tmp = np.linalg.inv(X_mat.T.dot(X_mat)).dot(X_mat.T)
    return tmp.dot(timeseries.T)[1] # 1 => coef_, 0 => intercept_

X = np.arange(8) # the periods, or the common range for all time series

y = np.array([ # time series
    [0., 0., 0., 0., 0., 73.92, 0., 114.32],
    [0., 0., 0., 0., 0., 73.92, 0., 114.32],
    [0., 10., 20., 30., 40., 50., 60., 70.]
])

linear_fit(X, y)
[1]: array([12.16666667, 12.16666667, 10.        ])

PS：这种方法（矩阵乘法的线性回归）是大型数据集的金矿

【讨论】：