时间序列传感器数据的多元回归答案

【问题标题】：Multiple regression on Time Series sensor data时间序列传感器数据的多元回归
【发布时间】：2019-10-03 02:22:45
【问题描述】：

我正在研究一个回归问题，我有 12 个传感器数据（独立）列和 1 个输出列，全部以 48KHz 采样。我总共有 420 秒的火车数据。在测试数据集中，我有 12 个传感器数据列，需要预测输出。

到目前为止，我已经尝试过不考虑时间特征的经典机器学习算法。我是时间序列的新手，不确定这是否真的是时间序列预测问题。

我不确定我是否可以将其视为多变量时间序列问题并尝试 LSTM/RNN。我一直在关注https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/#comment-442845，但无法理解如何预测测试数据。

是否需要追加一个新列将测试数据从 (length,12) 转换为 (length, 13)，然后逐行预测并将输出用于下一次迭代？

另外，以上是解决此类问题的正确方法还是我必须考虑其他问题？

更新在下面的 cmets 上更新我的问题。假设我的火车数据如下所示（更新标题只是为了更好地解释）。我正在训练与上面链接中提到的相同的 LSTM 网络。我创建了 Y(t),Y(t-1),x1(t-1),x2(t-1),x3(t-1),x4(t-1),x5(t-1), x6(t-1) 使用 series_to_supervised 函数。

                           Y     x1   x2    x3         x4      x5      x6
date                                                                          
2010-01-02 00:00:00      129.0  -16  -4.0  1020.0      SE     1.79     0     
2010-01-02 01:00:00      148.0  -15  -4.0  1020.0      SE     2.68     0     
2010-01-02 02:00:00      159.0  -11  -5.0  1021.0      SE     3.57     0     
2010-01-02 03:00:00      181.0   -7  -5.0  1022.0      SE     5.36     1     
2010-01-02 04:00:00      138.0   -7  -5.0  1022.0      SE     6.25     2

现在，我有没有 Y 列的测试数据。例如，

                          x1   x2    x3         x4      x5      x6
date                                                                          
2010-01-02 00:00:00      -11  -6.0  1020.0      SE     1.79     0     
2010-01-02 01:00:00      -12  -1.0  1020.0      SE     2.68     0     
2010-01-02 02:00:00      -10  -4.0  1021.0      SE     3.57     0     
2010-01-02 03:00:00      -7   -2.0  1022.0      SE     5.36     1     
2010-01-02 04:00:00      -7   -5.0  1022.0      SE     6.25     2

我做了什么。我添加了带有 0 填充的假 Y 列，并将第一个值替换为 火车 Y 列的平均值。我的想法是在下一次预测中使用 t-1 预测值。我不知道我怎么能很容易地得到它。我想出了以下逻辑。

代码sn-p

#test_pd is panda frame of size Nx6
#train_pd is panda frame of size Nx5

test_pd['Y'] = 0
train_out_mean = train_pd[0].mean()
test_pd[0][0] = train_out_mean
test_pd = test_pd.values.reshape((test_pd.shape[0],1,test_pd.shape[1]))
out_list = list()
out_list.append(train_out_mean)
for i in range(test_pd.shape[0]):

    y = loaded_model.predict(test_pd[i].reshape(1,test_pd.shape[1],test_pd.shape[2]))
    y = y[0]
    out_list.append(y)
    if (i+1>=test_pd.shape[0]):
        break
    test_pd[i+1][0][0] = y

我有两个后续问题。

上述方法理论上解决问题是否正确？
如果是，那么有没有更好的方法来预测测试数据集？

【问题讨论】：

标签： python machine-learning keras time-series lstm

【解决方案1】：

在使用 LSTM 等更复杂的算法之前，我会考虑从更简单的方法开始。

在 StackOverflow 中，您应该客观地对代码提出一些疑问。因此，如果您在此处分享您的一些代码，我们可以尝试为您提供帮助。

考虑到您有这样的时间序列（链接中的示例）：

                     pollution  dew  temp   press wnd_dir  wnd_spd  snow  rain
date                                                                          
2010-01-02 00:00:00      129.0  -16  -4.0  1020.0      SE     1.79     0     0
2010-01-02 01:00:00      148.0  -15  -4.0  1020.0      SE     2.68     0     0
2010-01-02 02:00:00      159.0  -11  -5.0  1021.0      SE     3.57     0     0
2010-01-02 03:00:00      181.0   -7  -5.0  1022.0      SE     5.36     1     0
2010-01-02 04:00:00      138.0   -7  -5.0  1022.0      SE     6.25     2     0

更简单的方法：MLP 回归器

在一个更简单的方法中，假设您想预测污染，您可以构建一个MLP Regressor，因此在训练阶段，您应该将数据分为 7 个特征（露水、温度、压力、wnd_dir、wnd_spd、雪，雨）来预测污染。举个例子：

from sklearn.neural_network import MLPRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn import metrics

data = dataset.values

# integer encode WIND direction
encoder = LabelEncoder()
data[:,4] = encoder.fit_transform(data[:,4])

scaler = MinMaxScaler(feature_range=(0, 1))
scaled = scaler.fit_transform(data)

y, X = np.split(data,[1],axis=1) 

mlp = MLPRegressor(learning_rate_init=0.001)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

print (X_train.shape, y_train.shape)
print (X_test.shape, y_test.shape)

mlp.fit(X_train,y_train)

y_prediction = mlp.predict(X_test)

print("R2 score:", metrics.r2_score(y_test, y_prediction))

输出：

R2 score: 0.30376681842945985

在 LSTM 中（您需要：3D [样本、时间步长、特征]）

现在，假设此刻某些特征（风、气压等）** t-1 **、** t-2 **（1 小时、2 小时）对时刻** t **有一些影响。所以现在你打算通过捕捉风速的一些演变（例如）一段时间来解决你的问题作为时间序列。所以现在使用 LSTM 是有意义的。

因此，series_to_supervised 功能（链接示例）将帮助您创建新功能...

series_to_supervised 函数有 4 个参数：

数据：作为列表或二维 NumPy 数组的观察序列。
n_in：作为输入 (X) 的滞后观测数。值可能在 [1..len(data)]
n_out：作为输出的观察数 (y)。值可能在 [0..len(data)-1] 之间。
dropnan：布尔值是否删除具有 NaN 值的行

所以，假设这个系列只有一个特征 X 和标签 y：

                        X   y
2018-01-01 00:00:00     1   2
2018-01-01 01:00:00     2   3
2018-01-01 02:00:00     3   4
2018-01-01 03:00:00     4   5
2018-01-01 04:00:00     5   6
2018-01-01 05:00:00     6   7
2018-01-01 06:00:00     7   8
2018-01-01 07:00:00     8   9
2018-01-01 08:00:00     9   10
2018-01-01 09:00:00     10  11

使用这个函数series_to_supervised(df.values,n_in=2, n_out=1, dropnan=False)你会得到一些类似的东西（我做了一些改进以便理解）： p>

                        X(t-2)   y(t-2)   X(t-1)   y(t-1)   X(t)   y(t)
2018-01-01 00:00:00       NaN     NaN     NaN        NaN     1      2
2018-01-01 01:00:00       NaN     NaN     1.0        2.0     2      3
2018-01-01 02:00:00       1.0     2.0     2.0        3.0     3      4
2018-01-01 03:00:00       2.0     3.0     3.0        4.0     4      5
2018-01-01 04:00:00       3.0     4.0     4.0        5.0     5      6
2018-01-01 05:00:00       4.0     5.0     5.0        6.0     6      7
2018-01-01 06:00:00       5.0     6.0     6.0        7.0     7      8
2018-01-01 07:00:00       6.0     7.0     7.0        8.0     8      9
2018-01-01 08:00:00       7.0     8.0     8.0        9.0     9      10
2018-01-01 09:00:00       8.0     9.0     9.0        10.0    10     11

因此，在这种方法中，我们正在考虑预测，我们将至少有两条记录 X(t-2, t-1) 和 y(t-2, t-1) 来预测 y(t)，未来.

您为什么需要这样做？现在我想我将开始回答您的问题。在 LSTM 中，您需要在 3D 空间中转换 2D 数据。

因此，在使用 LSTM 之前，您需要将输入重塑为 3D [样本、时间步长、特征]。 因此，转换（使用此功能）您的数据只是一个准备工作。

回答您的问题。 您不需要只附加一列。您需要转换数据以便在 t-n、t-3、t-2、t-1 中拥有新特征来预测 t 中的某些特征。

我建议您先按照此博客上关于污染案例（由您引用）的步骤进行操作，然后再尝试适应您的案例。

【讨论】：

感谢您的详细解释。阅读您的 cmets 后，我更新了我的问题。