【发布时间】:2020-05-10 21:41:30
【问题描述】:
目前我可以根据本教程使用一个 csv 文件训练 LSTM 网络:https://machinelearningmastery.com/how-to-develop-lstm-models-for-time-series-forecasting/
此代码生成滑动窗口,其中保存最后一个n_steps 的特征以预测实际目标(类似于:Keras LSTM - feed sequence data with Tensorflow dataset API from the generator):
#%% Import
import pandas as pd
import tensorflow as tf
from tensorflow.python.keras.models import Sequential, model_from_json
from tensorflow.python.keras.layers import LSTM
from tensorflow.python.keras.layers import Dense
# for path
import pathlib
import os
#%% Define functions
# Function to split multivariate input data into samples according to the number of timesteps (n_steps) used for the prediction ("sliding window")
def split_sequences(sequences, n_steps):
X, y = list(), list()
for i in range(len(sequences)):
# find end of this pattern
end_ix = i + n_steps
# check if beyond maximum index of input data
if end_ix > len(sequences):
break
# gather input and output parts of the data in corresponding format (depending on n_steps)
seq_x, seq_y = sequences[i:end_ix, :-1], sequences[end_ix-1, -1]
X.append(seq_x)
y.append(seq_y)
#Append: Adds its argument as a single element to the end of a list. The length of the list increases by one.
return array(X), array(y)
# Set source files
csv_train_path = os.path.join(dir_of_file, 'SimulationData', 'SimulationTrainData', 'SimulationTrainData001.csv')
# Load data
df_train = pd.read_csv(csv_train_path, header=0, parse_dates=[0], index_col=0)
#%% Select features and target
features_targets_considered = ['Fz1', 'Fz2', 'Fz3', 'Fz4', 'Fz5', 'Fz_res']
n_features = len(features_targets_considered)-1 # substract the target
features_targets_train = df_train[features_targets_considered]
# "Convert" to array
train_values = features_targets_train.values
# Set number of previous timesteps, which are considered to predict
n_steps = 100
# Convert into input (400x5) and output (1) values
X, y = split_sequences(train_values, n_steps)
X_test, y_test = split_sequences(test_values, n_steps)
#%% Define model
model = Sequential()
model.add(LSTM(200, activation='relu', return_sequences=True, input_shape=(n_steps, n_features)))
model.add(LSTM(200, activation='relu', return_sequences=True))
model.add(LSTM(200, activation='relu'))
model.add(Dense(1))
model.compile(optimizer='adam', loss='mse', metrics=['mae'])
#%% Fit model
history = model.fit(X, y, epochs=200, verbose=1)
我现在想扩展此示例以使用不同的 csv 文件有效地训练网络。在数据文件夹中,我有文件“SimulationTrainData001.csv”、“SimulationTrainData002.csv”、...、“SimulationTrainData300.csv”(大约 14 GB)。 为了实现这一点,我尝试采用这个输入管道示例的代码:https://www.tensorflow.org/guide/data#consuming_sets_of_files,它在一定程度上起作用。通过此更改,我可以在文件夹中显示培训文件:
# Set source folders
csv_train_path = os.path.join(dir_of_file, 'SimulationData', 'SimulationTrainData')
csv_train_path = pathlib.Path(csv_train_path)
#%% Show five example files from training folder
list_ds = tf.data.Dataset.list_files(str(csv_train_path/'*'))
for f in list_ds.take(5):
print(f.numpy())
一个问题是,在示例中,文件是鲜花图片而不是时间序列值,我不知道在什么时候我可以使用split_sequences(sequences, n_steps) 函数创建滑动窗口以提供必要的数据格式训练 LSTM 网络。
另外,据我所知,如果将不同文件的生成窗口打乱,对训练过程会更好。我可以在每个csv文件上使用split_sequences(sequences, n_steps)函数(生成X_test,y_test)并将结果加入一个大变量或文件中并随机播放窗口,但我认为这不是一种有效的方法如果n_steps 将被更改,也必须重做。
如果有人可以建议一种(已建立的)方法或示例来预处理我的数据,我将非常感激。
【问题讨论】:
标签: python csv tensorflow lstm