【问题标题】:How to convert numpy.ndarray to tfrecord?如何将 numpy.ndarray 转换为 tfrecord?
【发布时间】:2019-03-15 13:56:54
【问题描述】:

我有一个大数据集,数据集有两个特征,第一个特征是数据,第二个特征是标签,数据集大小约为6GB,当我运行代码如下:

#data_from_dataset represent data from 4G dataset, data_from_dataset 
#type is ndarray,The data_from_dataset shape is two dimension like (a 
#very large num,15)
#label_from_dataset represent label from 4G dataset,,label_from_dataset type 
#is ndarray also ndarray 
#label_from_dataset  #shape is two dimension like (a very large num,15)

data_from_dataset, label_from_dataset = load_train_data()

#calc total batch count
num_batch = len(data_from_dataset) // hp.batch_size

# Convert to tensor
X = tf.convert_to_tensor(data_from_dataset, tf.int32)
Y = tf.convert_to_tensor(label_from_dataset, tf.int32)

# Create Queues
input_queues = tf.train.slice_input_producer([X, Y])


# create batch queues
x, y = tf.train.shuffle_batch(input_queues,
                            num_threads=20,
                            batch_size=hp.batch_size,
                            capacity=hp.batch_size*64,
                            min_after_dequeue=hp.batch_size*32,
                            allow_smaller_final_batch=False)

运行很长时间后运行很慢,控制台提示错误如下:

Error:cannot create a tensor larger than 2GB

这些代码行似乎有问题:

# Convert to tensor
X = tf.convert_to_tensor(data_from_dataset, tf.int32)
Y = tf.convert_to_tensor(label_from_dataset, tf.int32)

我修改了将 NUMPY 转换为 TFRECORD 的代码,如下所示:

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

def save_tfrecords(data_from_dataset, label_from_dataset, desfile):
    with tf.python_io.TFRecordWriter(desfile) as writer:
        for i in range(len(data_from_dataset)):
            features = tf.train.Features(
                feature = {
                    "data": _int64_feature(data[i]),
                    "label": _int64_feature(label[i])

                }
            )
            example = tf.train.Example(features = features)
            serialized = example.SerializeToString()
            writer.write(serialized)

def read_and_decode(filename_queue):
    reader = tf.TFRecordReader()
    _, serialized_example = reader.read(filename_queue)
    features = tf.parse_single_example(
        serialized_example,
        features={
            'data': tf.FixedLenFeature([], tf.string),
            'label': tf.FixedLenFeature([], tf.string),
        })

    sent = features['data']
    tag = features['label']
    sent_decode=tf.decode_raw(sent,tf.int32)
    sent_decode=tf.decode_raw(tag,tf.int32)

    return sent, tag

fname_out="out.tfrecord"
save_tfrecords(data_from_dataset, label_from_dataset, fname_out)
filename_queue = tf.train.string_input_producer(fname_out, shuffle=True)
example, label = read_and_decode(filename_queue, 2)
x, y = tf.train.shuffle_batch([example, label],
                                num_threads=20,
                                batch_size=hp.batch_size,
                                capacity=hp.batch_size*64,
                                min_after_dequeue=hp.batch_size*32,
                                allow_smaller_final_batch=False)

它在代码行上提示错误如下:

   def _int64_feature(value):
      return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))

Error:only length-1 arrays can be converted to python scalars

如何将 numpy 转换为 tfrecord ?还有其他方法吗?

【问题讨论】:

  • data[i]、label[i]的形状是什么?
  • data[i] 类型为元组,形状为(15,) label[i] 类型为元组,形状为(15,)

标签: tensorflow


【解决方案1】:

函数tf.train.Int64List 不适用于数组。 您需要改用tf.train.BytesList

    data = np.random.rand(15,)
    writer = tf.python_io.TFRecordWriter('file.tfrecords')
    str = data.tostring() 
    example = tf.train.Example(features=tf.train.Features(feature={'1': _bytes_feature(str)}))
    writer.write(example.SerializeToString())
    writer.close()

然后您可以使用tf.decode_raw 对其进行解码,您可以使用以下命令检查 tfrecord 文件

for str_rec in tf.python_io.tf_record_iterator('file.tfrecords'):
    example = tf.train.Example()
    example.ParseFromString(str_rec)
    str = (example.features.feature['1'].bytes_list.value[0])
    your_data = np.fromstring(str, dtype)

【讨论】:

  • 我按照你的建议试过了,失败了这行的错误是str = data[i].tostring() #Convert to save space
  • data[i] 应该是 numpy.array 类型
  • 我按照你的建议修改代码,控制台报错:ValueError: GraphDef cannot be large than 2GB.,stackoverflow.com/questions/55311269/…
  • 这是 tensorflow 的内部限制。您可以使用tensorflow.org/guide/datasets 此处所述的占位符,也可以将数组拆分为几个偶数块。如果您在转换为 tfrecords 的过程中遇到此错误,最好使用 numpy。分裂
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-06-07
  • 1970-01-01
  • 2020-07-06
  • 2022-12-03
  • 2020-01-27
  • 2018-11-21
  • 2014-08-11
相关资源
最近更新 更多