多项式回归度增加后训练分数降低答案

【问题标题】：Train score diminishes after polynomial regression degree increases多项式回归度增加后训练分数降低
【发布时间】：2018-05-22 21:41:20
【问题描述】：

我正在尝试使用线性回归将多项式拟合到来自添加了一些噪声的正弦信号的一组点，使用来自sklearn 的linear_model.LinearRegression。

正如预期的那样，训练和验证分数随着多项式次数的增加而增加，但在大约 20 次之后，事情开始变得奇怪并且分数开始下降，并且模型返回看起来一点也不像的多项式我用来训练它的数据。

下面是一些可以看到这一点的图，以及生成回归模型和图的代码：

在 degree=17 之前，事情如何运作良好。原始数据 VS 预测：

之后情况变得更糟：

验证曲线，增加多项式的次数：

from sklearn.pipeline import make_pipeline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.learning_curve import validation_curve

def make_data(N, err=0.1, rseed=1):
    rng = np.random.RandomState(1)
    x = 10 * rng.rand(N)
    X = x[:, None]
    y = np.sin(x) + 0.1 * rng.randn(N)
    if err > 0:
        y += err * rng.randn(N)
    return X, y

def PolynomialRegression(degree=4):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression())


X, y = make_data(400)

X_test = np.linspace(0, 10, 500)[:, None]
degrees = np.arange(0, 40)

plt.figure(figsize=(16, 8))
plt.scatter(X.flatten(), y)
for degree in degrees:
    y_test = PolynomialRegression(degree).fit(X, y).predict(X_test)
    plt.plot(X_test, y_test, label='degre={0}'.format(degree))
plt.title('Original data VS predicted values for different degrees')
plt.legend(loc='best');


degree = np.arange(0, 40)
train_score, val_score = validation_curve(PolynomialRegression(), X, y,
                                                 'polynomialfeatures__degree',
                                                 degree, cv=7)

plt.figure(figsize=(12, 6))
plt.plot(degree, np.median(train_score, 1), marker='o', 
         color='blue', label='training score')
plt.plot(degree, np.median(val_score, 1), marker='o',
         color='red', label='validation score')
plt.legend(loc='best')
plt.ylim(0, 1)
plt.title('Learning curve, increasing the degree of the polynomium')
plt.xlabel('degree')
plt.ylabel('score');

我知道预期的是，当模型的复杂性增加时，验证分数会下降，但为什么训练分数也会下降？我在这里可以缺少什么？

【问题讨论】：

标签： python machine-learning scikit-learn linear-regression polynomial-approximations

【解决方案1】：

首先，您可以通过为模型设置归一化标志True 来修复它；

def PolynomialRegression(degree=4):
    return make_pipeline(PolynomialFeatures(degree),
                         LinearRegression(normalize=True))

但是为什么呢？在线性回归中，fit() 函数使用Moore–Penrose inverse 找到最佳拟合模型，这是计算least-square 解决方案的常用方法。当您添加值的多项式时，如果您不进行归一化，您的增强特征会很快变得非常大。这些大值支配了最小二乘法计算的成本，并导致模型拟合更大的值，即高阶多项式值而不是数据。

情节看起来更好，而且它们应该是这样的。

【讨论】：

【解决方案2】：

由于模型对训练数据的过度拟合，预计训练分数也会下降。由于正弦函数的泰勒级数展开，验证错误下降。因此，随着多项式次数的增加，您的模型会改进以更好地拟合正弦曲线。

在理想情况下，如果您没有扩展至无限度的函数，您会看到训练误差下降（不是单调的，而是一般情况下）和验证误差在一定程度上上升（低度数高 -> 低度数更高的程度->在那之后增加）。

【讨论】：