【发布时间】:2019-03-15 13:56:54
【问题描述】:
我有一个大数据集,数据集有两个特征,第一个特征是数据,第二个特征是标签,数据集大小约为6GB,当我运行代码如下:
#data_from_dataset represent data from 4G dataset, data_from_dataset
#type is ndarray,The data_from_dataset shape is two dimension like (a
#very large num,15)
#label_from_dataset represent label from 4G dataset,,label_from_dataset type
#is ndarray also ndarray
#label_from_dataset #shape is two dimension like (a very large num,15)
data_from_dataset, label_from_dataset = load_train_data()
#calc total batch count
num_batch = len(data_from_dataset) // hp.batch_size
# Convert to tensor
X = tf.convert_to_tensor(data_from_dataset, tf.int32)
Y = tf.convert_to_tensor(label_from_dataset, tf.int32)
# Create Queues
input_queues = tf.train.slice_input_producer([X, Y])
# create batch queues
x, y = tf.train.shuffle_batch(input_queues,
num_threads=20,
batch_size=hp.batch_size,
capacity=hp.batch_size*64,
min_after_dequeue=hp.batch_size*32,
allow_smaller_final_batch=False)
运行很长时间后运行很慢,控制台提示错误如下:
Error:cannot create a tensor larger than 2GB
这些代码行似乎有问题:
# Convert to tensor
X = tf.convert_to_tensor(data_from_dataset, tf.int32)
Y = tf.convert_to_tensor(label_from_dataset, tf.int32)
我修改了将 NUMPY 转换为 TFRECORD 的代码,如下所示:
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def save_tfrecords(data_from_dataset, label_from_dataset, desfile):
with tf.python_io.TFRecordWriter(desfile) as writer:
for i in range(len(data_from_dataset)):
features = tf.train.Features(
feature = {
"data": _int64_feature(data[i]),
"label": _int64_feature(label[i])
}
)
example = tf.train.Example(features = features)
serialized = example.SerializeToString()
writer.write(serialized)
def read_and_decode(filename_queue):
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(
serialized_example,
features={
'data': tf.FixedLenFeature([], tf.string),
'label': tf.FixedLenFeature([], tf.string),
})
sent = features['data']
tag = features['label']
sent_decode=tf.decode_raw(sent,tf.int32)
sent_decode=tf.decode_raw(tag,tf.int32)
return sent, tag
fname_out="out.tfrecord"
save_tfrecords(data_from_dataset, label_from_dataset, fname_out)
filename_queue = tf.train.string_input_producer(fname_out, shuffle=True)
example, label = read_and_decode(filename_queue, 2)
x, y = tf.train.shuffle_batch([example, label],
num_threads=20,
batch_size=hp.batch_size,
capacity=hp.batch_size*64,
min_after_dequeue=hp.batch_size*32,
allow_smaller_final_batch=False)
它在代码行上提示错误如下:
def _int64_feature(value):
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
Error:only length-1 arrays can be converted to python scalars
如何将 numpy 转换为 tfrecord ?还有其他方法吗?
【问题讨论】:
-
data[i]、label[i]的形状是什么?
-
data[i] 类型为元组,形状为(15,) label[i] 类型为元组,形状为(15,)
标签: tensorflow