This question 可能会提供一些帮助...尽管答案大多与 Tensorflow V1.x 相关
此任务可能不需要 CSV 数据集。您指出的数据大小可能适合内存,tf.data.Dataset 可能会将您的数据包装得比有价值的功能更复杂。只要您的所有数据都是整数,您就可以在没有数据集的情况下执行此操作(如下所示)。
如果您坚持使用 CSV 数据集方法,请了解 CSV 有多种使用方式,以及加载它们的不同方法(例如,请参阅 here 和 here)。因为 CSV 可以有多种列类型(数值、布尔值、文本、分类等),所以第一步通常是以 面向列的 格式加载 CSV 数据。这提供了通过它们的标签访问列 - 对于预处理很有用。但是,您可能希望向模型提供数据行,因此从列转换为行可能是造成混淆的原因之一。在某些时候,您可能需要将整数数据转换为浮点数,但这可能是某些预处理的副作用。
只要您的 CSV 仅包含整数,没有丢失数据,并且带有标题行,您就可以在没有 tf.data.Dataset 的情况下执行此操作,步骤如下:
import numpy as np
from numpy import genfromtxt
import tensorflow as tf
train_data = genfromtxt('train set.csv', delimiter=',')
test_data = genfromtxt('test set.csv', delimiter=',')
train_data = np.delete(train_data, (0), axis=0) # delete header row
test_data = np.delete(test_data, (0), axis=0) # delete header row
train_labels = train_data[:,[0]]
test_labels = test_data[:,[0]]
train_labels = tf.keras.utils.to_categorical(train_labels)
# count labels used in training set; categorise test set on same basis
# even if test set only uses subset of categories learning in training
K = len(train_labels[ 0 ])
test_labels = tf.keras.utils.to_categorical(test_labels, K)
train_data = np.delete(train_data, (0), axis=1) # delete label column
test_data = np.delete(test_data, (0), axis=1) # delete label column
# Data will have been read in as float... but you may want scaling/normalization...
scale = lambda x: x/1000.0 - 500.0 # change to suit
scale(train_data)
scale(test_data)
N_train = len(train_data[0]) # columns in training set
N_test = len(test_data[0]) # columns in test set
if N_train != N_test:
print("Datasets have incompatible column counts: %d vs %d" % (N_train, N_test))
exit()
M_train = len(train_data) # rows in training set
M_test = len(test_data) # rows in test set
print("Training data size: %d rows x %d columns" % (M_train, N_train))
print("Test set data size: %d rows x %d columns" % (M_test, N_test))
print("Training to predict %d classes" % (K))
model = Sequential()
model.add(Dense(H, activation='relu', input_dim=N_train)) # H not yet defined...
...
model.compile(...)
model.fit( train_data, train_labels, ... ) # see docs for shuffle, batch, etc
model.evaluate( test_data, test_labels )