从 tfrecord 读取的数组与写入的数组不匹配答案

【问题标题】：Array Read From tfrecord Does Not Match Array Written To It从 tfrecord 读取的数组与写入的数组不匹配
【发布时间】：2021-10-06 08:36:03
【问题描述】：

由于某种原因，我写入 tensorflow 记录的 numpy 数组（形状为 55,290）与我再次读入同一 tensorflow 记录的输出不匹配。

这是我用来编写 tfrecord 的代码：

def _int64_feature(value):
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=value))

def serialize_data(X, y):
    feature = {
        'n_wavelength_channels': _int64_feature(55),
        'n_time_steps': _int64_feature(290),
        'rel_radii': _float_feature(y),
        'rel_flux': _float_feature(X.flatten()),
    }
    return tf.train.Example(features=tf.train.Features(feature=feature)).SerializeToString()

def tf_record_generator():
        X_file_chunk = ["E:/ml_data_challenge_database/noisy_train/0001_01_01.txt"]
        y_file_chunk = ["E:/ml_data_challenge_database/params_train/0001_01_01.txt"]

        data = []
        labels = []
        for X_file, y_file in zip(X_file_chunk, y_file_chunk):
            X = np.genfromtxt(X_file, dtype=np.float32)[:,10:]
            y = np.genfromtxt(y_file, dtype=np.float32)
            yield serialize_data(X, y)

n_splits = 1
tfrecord_filename = "training_record_{}.tfrecords"

for index in range(n_splits): # Number of splits
    writer = tf.data.experimental.TFRecordWriter(tfrecord_filename.format(index))

    serialized_features_dataset = tf.data.Dataset.from_generator(tf_record_generator, output_types=tf.string, output_shapes=())

    writer.write(serialized_features_dataset)

这是我用来读取刚刚写入的记录的代码：

def parse_record(record):
    name_to_features = {
        'n_wavelength_channels': tf.io.FixedLenFeature([], tf.int64),
        'n_time_steps': tf.io.FixedLenFeature([], tf.int64),
        'rel_radii': tf.io.FixedLenFeature([55], tf.float32),
        'rel_flux': tf.io.FixedLenFeature([55*290], tf.float32),
    }
    return tf.io.parse_single_example(record, name_to_features)
def decode_record(record):
    parsed_record = parse_record(record)
    flux = parsed_record['rel_flux']
    radii = parsed_record['rel_radii']
    return flux, radii
def get_batched_dataset(filenames):
    option_no_order = tf.data.Options()
    option_no_order.experimental_deterministic = False
    dataset = tf.data.Dataset.list_files(filenames)
    dataset = dataset.with_options(option_no_order)
    dataset = dataset.interleave(tf.data.TFRecordDataset, num_parallel_calls=tf.data.AUTOTUNE)
    dataset = dataset.map(decode_record, num_parallel_calls=tf.data.AUTOTUNE)

    dataset = dataset.repeat()
    dataset = dataset.shuffle(2048)
    dataset = dataset.batch(BATCH_SIZE, drop_remainder=True) 
    dataset = dataset.prefetch(tf.data.AUTOTUNE) #

    return dataset
def get_training_dataset():
    return get_batched_dataset(training_filenames)

BATCH_SIZE=1
training_filenames = tf.io.gfile.glob("training_record_*.tfrecords")
training_data = get_training_dataset()
X_batch, y_batch = next(iter(training_data))

def show_batch(X_batch, y_batch):
    for i in X_batch:
        plt.plot(i.reshape(290,55))
        plt.show()


show_batch(X_batch.numpy(), y_batch.numpy())

这是我正在研究的神经网络输入的一部分，我尝试对其进行修改以从单个训练观察创建一个 tfrecord，然后输出该观察。

tfrecord 的输出如下所示：

这就是它的样子（原始观察）：

X = np.genfromtxt("E:/ml_data_challenge_database/noisy_train/0001_01_01.txt")
plt.plot(X.T[10:,:])
plt.show()

（同时绘制所有 55 行）。

从 tfrecord 读取的 y 值实际上与真正的 y 值匹配，但我不知道为什么 X 数据似乎不正确。我一直在密切关注一些指南，但在处理 TF 数据方面非常陌生。有人可以看看我的代码并指出我可能做错了什么吗？非常感谢您！

这是 X 数据的 a Google drive link（在 tf_record_generator 内的“X_file_chunk”中引用），这是 one to the y data（也在 tf_record_generator 内）

【问题讨论】：

只是猜测。您的输入数据是 np.float64，但在 tensorflow 中您使用 float，它相当于 np.float32。
这是我想检查的东西，所以我在 tf_record_generator() 中的 genfromtxt() 调用中添加了“dtype=np.float32”，以确保数据是原始数据以相同的方式读取 TF写了，但是结果和上面一样。我怀疑这与 X 数组的形状有关，因为从 tfrecord 读取的 y 值与真正的 y 值匹配，
听起来很合理。我目前对形状有点困惑，如果没有您的数据，我们将无法重现您的问题。您能否生成一些具有正确形状的虚拟数据并添加到您的问题中。也许你会通过这样做找出答案????
我刚刚添加了指向我上面使用的相同 X 和 y 数据的链接。 X 是一个形状数组 (55,300)，但我在 tf_record_generator() 中截断了前 10 列。 y 是一个形状数组 (55)。
这也可以，不过你的第一个文件是私有的

标签： python tensorflow neural-network tensorflow-datasets

【解决方案1】：

当您重新塑造回 2D 时，您会混淆尺寸 - 它应该是 i.reshape(55,290).T

在这种情况下，绘图与原始数据相同。

顺便说一句，您的数据确实是float64 格式，所以当您读取/绘制原始数据时，您使用float64。来自tf.Dataset 的数据是float32。虽然这不是你的情节不同的原因。

【讨论】：

哇，所以它最终变得很简单。非常感谢您的观看，非常感谢！
像往常一样?。如果您接受答案，将不胜感激