【问题标题】:Gaussian Processes in scikit-learn: good performance on training data, bad performance on testing datascikit-learn 中的高斯过程:在训练数据上表现良好,在测试数据上表现不佳
【发布时间】:2020-02-22 12:34:51
【问题描述】:

我编写了一个 Python 脚本,它使用 scikit-learn 将高斯过程拟合到一些数据。

简而言之:我面临的问题是,虽然高斯过程似乎很好地学习了训练数据集,但对测试数据集的预测却不正确,在我看来,这背后存在标准化问题。

详细说明:我的训练数据集是一组 1500 时间序列。每个时间序列都有50 时间分量。高斯过程学习的映射在一组三个坐标x,y,z(代表我的模型的参数)和一个时间序列之间。换句话说,x,y,z 和一个时间序列之间存在 1:1 的映射,GP 学习这个映射。这个想法是,通过为训练有素的 GP 提供新坐标,他们应该能够为我提供与这些坐标相关的预测时间序列。

这是我的代码:

from __future__ import division
import numpy as np
from matplotlib import pyplot as plt

from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import Matern

coordinates_training = np.loadtxt(...) # read coordinates training x, y, z from file
coordinates_testing = np.loadtxt(..) # read coordinates testing x, y, z from file

# z-score of the coordinates for the training and testing data.
# Note I am using the mean and std of the training dataset ALSO to normalize the testing dataset

mean_coords_training = np.zeros(3)
std_coords_training = np.zeros(3)

for i in range(3):
    mean_coords_training[i] = coordinates_training[:, i].mean()
    std_coords_training[i] = coordinates_training[:, i].std()

    coordinates_training[:, i] = (coordinates_training[:, i] - mean_coords_training[i])/std_coords_training[i]
    coordinates_testing[:, i] = (coordinates_testing[:, i] - mean_coords_training[i])/std_coords_training[i]

time_series_training = np.loadtxt(...)# reading time series of training data from file
number_of_time_components = np.shape(time_series_training)[1] # 100 time components

# z_score of the time series
mean_time_series_training = np.zeros(number_of_time_components)
std_time_series_training = np.zeros(number_of_time_components)
for i in range(number_of_time_components):
    mean_time_series_training[i] = time_series_training[:, i].mean()
    std_time_series_training[i] = time_series_training[:, i].std()
    time_series_training[:, i] = (time_series_training[:, i] - mean_time_series_training[i])/std_time_series_training[i]

time_series_testing = np.loadtxt(...)# reading test data from file
# the number of time components is the same for training and testing dataset

# z-score of testing data, again using mean and std of training data
for i in range(number_of_time_components):
    time_series_testing[:, i] = (time_series_testing[:, i] - mean_time_series_training[i])/std_time_series_training[i]

# GPs        

pred_time_series_training = np.zeros((np.shape(time_series_training)))
pred_time_series_testing = np.zeros((np.shape(time_series_testing)))

# Instantiate a Gaussian Process model
kernel = 1.0 * Matern(nu=1.5)
gp = GaussianProcessRegressor(kernel=kernel)

for i in range(number_of_time_components):
    print("time component", i)

    # Fit to data using Maximum Likelihood Estimation of the parameters
    gp.fit(coordinates_training, time_series_training[:,i])

    # Make the prediction on the meshed x-axis (ask for MSE as well)
    y_pred_train, sigma_train = gp.predict(coordinates_train, return_std=True)
    y_pred_test, sigma_test = gp.predict(coordinates_test, return_std=True)

    pred_time_series_training[:,i] = y_pred_train*std_time_series_training[i] + mean_time_series_training[i]
    pred_time_series_testing[:,i] = y_pred_test*std_time_series_training[i] + mean_time_series_training[i]


# plot training
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
        ax[i].plot(time_series_training[100*i], color='blue', label='Original training')
        ax[i].plot(pred_time_series_training[100*i], color='black', label='GP predicted - training')

# plot testing
fig, ax = plt.subplots(5, figsize=(10,20))
for i in range(5):
        ax[i].plot(features_time_series_testing[100*i], color='blue', label='Original testing')
        ax[i].plot(pred_time_series_testing[100*i], color='black', label='GP predicted - testing')

这里是训练数据的表现示例。 这里是测试数据的性能示例。

【问题讨论】:

    标签: python machine-learning scikit-learn normalization


    【解决方案1】:

    首先您应该使用 sklearn 预处理工具来处理您的数据。

    from sklearn.preprocessing import StandardScaler
    

    还有其他有用的工具可以组织,但这个特定的工具可以规范化数据。 其次,您应该使用相同的参数对训练集和测试集进行归一化。模型将拟合数据的“几何”来定义参数,如果您使用其他比例训练模型,则类似使用错误的单位制。

    scale = StandardScaler()
    training_set = scale.fit_tranform(data_train)
    test_set = scale.transform(data_test)
    

    这将在集合中使用相同的转换。

    最后你需要规范化特征而不是 traget,我的意思是规范化 X 条目而不是 Y 输出,规范化有助于模型更快地找到答案 在优化过程中改变目标函数的拓扑输出不影响这个。

    我希望这能回答你的问题。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-01-20
      • 2020-08-28
      • 2020-01-06
      • 2019-04-10
      • 2020-06-22
      • 2016-11-26
      • 1970-01-01
      • 2020-02-08
      相关资源
      最近更新 更多