如何创建手动执行线性回归的类答案

【问题标题】：how to create class that manually perform linear regression如何创建手动执行线性回归的类
【发布时间】：2021-10-30 18:16:55
【问题描述】：

我是一名学生，正在开始线性回归。我们一直采用手动回归公式： (X.T * X) **-1 * X.T * y 我们也有一个简单数组的例子：

from numpy.linalg import inv
import numpy as np
from matplotlib import pyplot as plt

X = np.array([[1, 50], [1, 60], [1, 70], [1, 100]]) # it MUST have this 1... a trivial variable... i don't understand for what
y = np.array([[10], [30], [40], [50]])

w = inv((X.T).dot(X)).dot(X.T).dot(y)
print(f'w_1 = {round(w[0][0], 2)},\nw_2 = {round(w[1][0], 3)}')

X_min = X[:, 1].min()
X_max = X[:, 1].max()

X_support = np.linspace(X_min, X_max, num=100)
Y_model = w[0][0] + w[1][0] * X_support

plt.scatter(x=X[:, 1], y=Y, color='g', alpha=0.8)
plt.plot(X_support, Y_model)

plt.show()

现在我想使用波士顿数据集对乘法变量做同样的事情。我需要创建一个与 LinearRegression() “相同”的类。它必须有 .fit() 方法和 .predict 方法。当数组有超过 1 列时如何采取行动的解释为零......所以我很困惑。

这是我最初所做的：

from numpy.linalg import inv
import numpy as np
from matplotlib import pyplot as plt
from sklearn.datasets import load_boston

    class CustomLinearReg:
        def __init__(self):
            pass
        
        def fit(X, y):
            return inv((X.T).dot(X)).dot(X.T).dot(y)
    
        def predict(X):
            pass
        
    boston_dataset = load_boston()
    
    X = boston_dataset.data
    y = boston_dataset.target
    
    reg = CustomLinearReg.fit(X, y)
    reg

但它只返回 1 个系数，我不确定它是否正确......我也不明白从哪里得到第二个。在那之后，我认为我需要那个“1” - 微不足道的变量并且这样做了：

boston_dataset = load_boston()

X = boston_dataset.data.tolist()

for n1, x in enumerate(X):
    for n2, y in enumerate(x):
        X[n1][n2] = [1, y]

X = np.array(X)
y = boston_dataset.target

reg = CustomLinearReg.fit(X, y)
reg

但它会返回

ValueError: shapes (2,13,506) and (506,13,2) not aligned: 506 (dim 2) != 13 (dim 1)

我尝试了更多，比如在一个循环中一个一个地计算每个系数......但失败了。

请帮我解决这个问题。

我需要一个用于 .fit(X,y) 的类返回成对的系数，然后是 .predict() 方法，该方法在 .fit() 之后生成模型。

【问题讨论】：

你为什么要写这些行：for n1, x in enumerate(X): for n2, y in enumerate(x): X[n1][n2] = [1, y]
使数字数组像 [1, 2, 3, 4, 5] 看起来像 [[1, 1], [1, 2], [1, 3], [1, 4 ], [1, 5]]。添加微不足道的变量。
你知道这一行吗：inv((X.T).dot(X)).dot(X.T).dot(y) 当 x.shape 为 506,13,2 时，你正在尝试多个 3d 矩阵
好的，您需要将数据集拆分为 x_train 和 y_train

标签： python linear-regression

【解决方案1】：

我认为你混淆了一些东西。让我们从你的公式开始： (X.T * X) **-1 * X.T * y。这来自最小二乘法。让我们取一条通过原点的简单直线方程：y(x) = a *x。

如果您想知道a 的值，它将为您的数据点提供最佳拟合，您确实可以使用您的公式进行回归。 X 将是您的 x 值的向量，例如：X = [1,2,3,4]。 Y 将成为您的 y 值的向量，例如 Y=[2,4,6,8]。这个等式的结果，我们称之为P，是参数向量。在这种情况下，等式将给出P = 2，因此只是一个标量，这将是a 的值。

如果您有多个要查找的参数，例如让我们采用y(x) = a*x + b 这一行，向量将变为矩阵。首先注意a*x + b 与[a, b] * [x, 1]^T 相同。这就是 X 矩阵中的那些来自的地方。

在这个组成的示例中，X = [[1,1],[2,1],[3,1],[4,1]] 和 Y 保持不变。使用您的等式现在将得到P = [2, 0]，这意味着a = 2 和b = 0。

修改你的类会得到类似：

from numpy.linalg import inv
import numpy as np


class CustomLinearReg:
    def __init__(self):
        self.a=0
        self.b=0
        
        
    def fit(self, X, y):
        R=inv(X.T@X)@X.T@y
        self.a=R[0]
        self.b=R[1]
    
    def predict(self, X):
        return X@np.array([self.a,self.b])


X = np.array([[1, 50], [1, 60], [1, 70], [1, 100]])
y = np.array([10, 30, 40, 50])

clr = CustomLinearReg()
clr.fit(X,y)

# checking on training data. 
print(clr.predict(X))

# predicting on new data
print(clr.predict(np.array([[1,20],[1,35],[1,60]])))

输出：

[18.21428571 25.35714286 32.5        53.92857143]
[-3.21428571  7.5        25.35714286]

【讨论】：

如果只将您的输入和输出数据作为类方法的输入，并构造包含类内的列的矩阵会更合乎逻辑。
这里是要处理的输入数据：from sklearn.datasets import load_boston
data = load_boston()

【解决方案2】：

如果训练数据为 1 列 - 训练和 2 列 - 测试，则代码为：

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

class CustomLinearReg:
    def __init__(self):
        pass
    
    def fit(X, y):
        return inv((X.T).dot(X)).dot(X.T).dot(y)

    def predict(w):
        X_min = X[:, 1].min() - X[:, 1].max() / 10
        X_max = X[:, 1].max() + X[:, 1].max() / 10
        X_support = np.linspace(X_min, X_max, num=20)
        prediction = w[0] + w[1] * X_support
        return prediction, X_support

data = pd.read_csv('..//..//Materials//3.10_non_linear.csv', sep=',')

X = np.array(data.x_train.apply(lambda x: [1, x]).tolist())
y = data.y_train

w = CustomLinearReg.fit(X, y)
y_model, X_support = CustomLinearReg.predict(w)


plt.scatter(x=X[:, 1], y=y, color='g', alpha=0.8)
plt.plot(X_support, y_model)

plt.show()

但它有多个列（功能）。所以....

data:
     x_train     y_train
0   0.138368    0.838812
1   0.157237    0.889313
2   0.188684    1.430040
3   0.685553    1.717309
4   0.874237    2.032588
5   1.182421    1.860341
6   1.251605    1.878928
7   1.270474    2.430015
8   1.402553    2.327856
9   1.427711    2.203649
10  1.471737    2.207708
11  1.534632    1.388039
12  1.553500    1.718544
13  1.842816    2.103264
14  2.018921    2.295177
15  2.289369    1.965152
16  2.641579    0.745949
17  2.685606    1.160798
18  2.798816    0.847264
19  2.823974    0.755585
20  2.912027    1.304593
21  2.924606    1.066442
22  3.270527    0.676944
23  3.314553    0.579166
24  3.528395    0.133375
25  3.597580    0.171235
26  3.761106    0.196110
27  3.773685    -0.072016
28  3.918343    0.118110
29  4.107027    0.466673
30  4.107027    0.315611
31  4.125895    0.214945
32  4.188790    0.050313
33  4.333448    0.106148
34  4.541001    0.057132
35  4.572448    -0.057252
36  4.622764    0.449234
37  4.641632    -0.336120
38  4.937238    -0.038237
39  5.125922    0.095250
40  5.157369    0.313029
41  5.258001    -0.231257
42  5.308317    0.024936
43  5.415238    0.418719
44  5.446685    0.165727
45  5.509580    0.266092
46  5.610212    0.669440
47  5.949843    0.892383
48  5.968712    1.265869
49  5.968712    0.664839

【讨论】：

这是您问题的答案吗？
no )) 这是一个问题的答案，如果它是 x_train 和 y_train 的数组。我有一个数组：from sklearn.datasets import load_bostondata = load_boston()