【问题标题】：LSTM on sequential data, predicting a discrete column序列数据上的 LSTM，预测离散列
【发布时间】：2019-01-14 13:06:15
【问题描述】：

我是 ML 的新手，只触及表面，如果我的问题没有意义，我深表歉意。

我对某个对象进行了一系列连续测量（捕获其重量、大小、温度等），并有一个确定对象属性的离散列（整数的有限范围，例如 0,1,2）。这是我要预测的列。

所讨论的数据确实是一个序列，因为属性列的值可能会根据其周围的上下文而有所不同，并且序列本身也可能存在一些周期性属性。简而言之：数据的顺序对我很重要。

一个小例子如下表所示

请注意，有两行包含相同的数据，但属性字段中的值不同。这个想法是属性字段的值可能取决于前面的行，因此行的顺序很重要。

我的问题是，我应该使用什么样的方法/工具/技术来解决这个问题？

我知道分类算法，但不知何故我认为它们不适用于这里，因为有问题的数据是连续的，我不想忽略这个属性。

我尝试使用 Keras LSTM 并假装 Property 列也是连续的。然而，我以这种方式获得的预测通常只是一个在这种情况下没有意义的常量十进制值。

解决此类问题的最佳方法是什么？

【问题讨论】：

为什么你需要深度学习——你不能只使用逻辑回归吗？另外，我不明白您的连续变量与离散变量有何不同。你的意思是后者是分类的吗？
如果属性有固定的取值范围，则将其作为分类标签，进行多标签分类。你谈论序列，但在你的数据中我看不到任何序列。每条记录都有几个特征，但它们都出现一次。时间序列是另一回事，扔掉 RNN 并坚持使用普通分类器（深度模型或 Josh 建议的更简单模型）
@GPhilo 感谢您的回复。数据是序列的关键在于数据的顺序很重要。例如，在表中，您会看到两行具有相同的数据和不同的属性值。这个想法是这个属性值也取决于前面的行。
@JoshFriedlander 我不确定，我正在尝试了解这里的最佳选择。是的，属性字段是分类的，它可以取的值是有限的。我使用深度学习的动机是我找到了处理时间序列的示例，上面的示例代表了这些示例。
您可能想要研究 AR 或 ARIMA，它们是用于建模时间序列数据的工具。

标签： tensorflow keras deep-learning classification lstm

【解决方案1】：

import tensorflow as tf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler

df = pd.DataFrame({'Temperature': [183, 10.7, 24.3, 10.7],
                   'Weight': [8, 11.2, 14, 11.2],
                   'Size': [3.97, 7.88, 11, 7.88],
                   'Property': [0,1,2,0]})

# print first 5 rows
df.head()

# adjust target(t) to depend on input (t-1)
df.Property = df.Property.shift(-1)

# parameters
time_steps = 1
inputs = 3
outputs = 1

# remove nans as a result of the shifted values
df = df.iloc[:-1,:]

# convert to numoy
df = df.values

数据预处理

# center and scale
scaler = MinMaxScaler(feature_range=(0, 1))    
df = scaler.fit_transform(df)

# X_y_split
train_X = df[:, 1:]
train_y = df[:, 0]

# reshape input to 3D array
train_X = train_X[:,None,:]

# reshape output to 1D array
train_y = np.reshape(train_y, (-1,outputs))

模型参数

learning_rate = 0.001
epochs = 500
batch_size = int(train_X.shape[0]/2)
length = train_X.shape[0]
display = 100
neurons = 100

# clear graph (if any) before running
tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, time_steps, inputs])
y = tf.placeholder(tf.float32, [None, outputs])

# LSTM Cell
cell = tf.contrib.rnn.BasicLSTMCell(num_units=neurons, activation=tf.nn.relu)
cell_outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)

# pass into Dense layer
stacked_outputs = tf.reshape(cell_outputs, [-1, neurons])
out = tf.layers.dense(inputs=stacked_outputs, units=outputs)

# squared error loss or cost function for linear regression
loss = tf.losses.mean_squared_error(labels=y, predictions=out)
# optimizer to minimize cost
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

在会话中执行

with tf.Session() as sess:
    # initialize all variables
    tf.global_variables_initializer().run()

    # Train the model
    for steps in range(epochs):
        mini_batch = zip(range(0, length, batch_size),
                   range(batch_size, length+1, batch_size))

        # train data in mini-batches
        for (start, end) in mini_batch:
            sess.run(training_op, feed_dict = {X: train_X[start:end,:,:],
                                               y: train_y[start:end,:]})

        # print training performance 
        if (steps+1) % display == 0:
            # evaluate loss function on training set
            loss_fn = loss.eval(feed_dict = {X: train_X, y: train_y})
            print('Step: {}  \tTraining loss (mse): {}'.format((steps+1), loss_fn))

    # Test model
    y_pred = sess.run(out, feed_dict={X: train_X})

    plt.title("LSTM RNN Model", fontsize=12)
    plt.plot(train_y, "b--", markersize=10, label="targets")
    plt.plot(y_pred, "k--", markersize=10, label=" prediction")
    plt.legend()
    plt.xlabel("Period")

'Output':
Step: 100       Training loss (mse): 0.15871836245059967
Step: 200       Training loss (mse): 0.03062588907778263
Step: 300       Training loss (mse): 0.0003023963945452124
Step: 400       Training loss (mse): 1.7712079625198385e-07
Step: 500       Training loss (mse): 8.750407516633363e-12

假设

我假设目标 Property 是 1 个时间步后输入序列的输出。
如果不是这种情况，数据输入/输出的序列格式可以很容易地重新建模以更正确地适应问题用例。我认为这里的总体思路是展示如何使用 tensorflow 解决多变量时间序列预测序列问题。

更新：分类变体

下面的代码将用例建模为一个分类问题，其中 RNN 算法尝试预测特定输入序列的类成员资格。

我再次假设目标(t), depends on the input sequencet-1`。

import tensorflow as tf
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

df = pd.DataFrame({'Temperature': [183, 10.7, 24.3, 10.7],
                   'Weight': [8, 11.2, 14, 11.2],
                   'Size': [3.97, 7.88, 11, 7.88],
                   'Property': [0,1,2,0]})

# print first 5 rows
df.head()

# adjust target(t) to depend on input (t-1)
df.Property = df.Property.shift(-1)

# parameters
time_steps = 1
inputs = 3
outputs = 3

# remove nans as a result of the shifted values
df = df.iloc[:-1,:]

# convert to numpy
df = df.values

数据预处理

# X_y_split
train_X = df[:, 1:]
train_y = df[:, 0]

# center and scale
scaler = MinMaxScaler(feature_range=(0, 1))    
train_X = scaler.fit_transform(train_X)

# reshape input to 3D array
train_X = train_X[:,None,:]

# one-hot encode the outputs
onehot_encoder = OneHotEncoder()
encode_categorical = train_y.reshape(len(train_y), 1)
train_y = onehot_encoder.fit_transform(encode_categorical).toarray()

模型参数

learning_rate = 0.001
epochs = 500
batch_size = int(train_X.shape[0]/2)
length = train_X.shape[0]
display = 100
neurons = 100

# clear graph (if any) before running
tf.reset_default_graph()

X = tf.placeholder(tf.float32, [None, time_steps, inputs])
y = tf.placeholder(tf.float32, [None, outputs])

# LSTM Cell
cell = tf.contrib.rnn.BasicLSTMCell(num_units=neurons, activation=tf.nn.relu)
cell_outputs, states = tf.nn.dynamic_rnn(cell, X, dtype=tf.float32)

# pass into Dense layer
stacked_outputs = tf.reshape(cell_outputs, [-1, neurons])
out = tf.layers.dense(inputs=stacked_outputs, units=outputs)

# squared error loss or cost function for linear regression
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits_v2(
        labels=y, logits=out))

# optimizer to minimize cost
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate)
training_op = optimizer.minimize(loss)

定义分类评估指标

accuracy = tf.metrics.accuracy(labels =  tf.argmax(y, 1),
                          predictions = tf.argmax(out, 1),
                          name = "accuracy")
precision = tf.metrics.precision(labels=tf.argmax(y, 1),
                                 predictions=tf.argmax(out, 1),
                                 name="precision")
recall = tf.metrics.recall(labels=tf.argmax(y, 1),
                           predictions=tf.argmax(out, 1),
                           name="recall")
f1 = 2 * accuracy[1] * recall[1] / ( precision[1] + recall[1] )

在会话中执行

with tf.Session() as sess:
    # initialize all variables
    tf.global_variables_initializer().run()
    tf.local_variables_initializer().run()

    # Train the model
    for steps in range(epochs):
        mini_batch = zip(range(0, length, batch_size),
                   range(batch_size, length+1, batch_size))

        # train data in mini-batches
        for (start, end) in mini_batch:
            sess.run(training_op, feed_dict = {X: train_X[start:end,:,:],
                                               y: train_y[start:end,:]})

        # print training performance 
        if (steps+1) % display == 0:
            # evaluate loss function on training set
            loss_fn = loss.eval(feed_dict = {X: train_X, y: train_y})
            print('Step: {}  \tTraining loss: {}'.format((steps+1), loss_fn))

    # evaluate model accuracy
    acc, prec, recall, f1 = sess.run([accuracy, precision, recall, f1],
                                     feed_dict = {X: train_X, y: train_y})

    print('\nEvaluation  on training set')
    print('Accuracy:', acc[1])
    print('Precision:', prec[1])
    print('Recall:', recall[1])
    print('F1 score:', f1)

'输出'：

Step: 100       Training loss: 0.5373622179031372
Step: 200       Training loss: 0.33380019664764404
Step: 300       Training loss: 0.176949605345726
Step: 400       Training loss: 0.0781424418091774
Step: 500       Training loss: 0.0373661033809185

Evaluation  on training set
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 score: 1.0

【讨论】：

Ekaba，感谢您撰写本文。我研究您的代码时的一些问题。 1.) (0,1) MinMax 缩放器是独立缩放每一列还是对所有列使用相同的缩放器？ 2）有没有办法让TF知道Property列是一个整数字段，损失函数是预测正确类的失败率？ 3) 最后，我怎样才能看到针对特定行预测的属性？我很抱歉问题泛滥，我正在慢慢尝试学习 ML。
(1)。 MinMax 缩放器“单独缩放和转换每一列，使其处于训练集的给定范围内，即在零和一之间”。 (2)。是的。我们将使用 softmax 交叉熵损失函数将其视为分类问题。为此，“属性”列将转换为单热编码数组。 (3)。这很简单，我们将简单地打印出变量 y_pred。使用更大的数据集，我们可以轻松编写查找机制。不用担心。我很高兴能帮上忙。如果您愿意，我可以根据此讨论更新代码。如果是这样，请告诉我。干杯。
谢谢。我很欣赏你有空并帮助我深入回复的方式。如果您可以修改代码以反映讨论，我将不胜感激，因为我肯定会在这一点上搞砸一些事情。也是一个后续问题。您的方法是否考虑到步骤 i 的属性也以某种方式依赖于先前的行，或者我是否必须自己将其编码到行中（通过添加先前的值）？此外，在测试准确性时，模型是否确保不查询第 i 行的属性，同时以某种方式告诉模型第 i+1 行？
我更新了分类设置的代码，其中Property 表示类成员资格。此代码假定输出 t 取决于输入 t-1。通过修改数据集的框架方式，可以使输出依赖于 t-2 的先前 time_steps 等等。
注意：我还更新了回归版本以处理一些观察到的错误。以及反映输出t 是从t-1 的输入序列预测的假设。