从两个 TFRecord 中获取所有记录组合答案

【问题标题】：Taking all combinations of records from two TFRecords从两个 TFRecord 中获取所有记录组合
【发布时间】：2020-02-04 17:06:21
【问题描述】：

我有两个 TFRecords A 和 B，大小不同，包含不同的数据元素。

我需要从A 和B 中获取所有可能的记录对。因此，在训练或测试期间，我希望 epoch 的信号只有在所有组合都用尽后才结束，之后该过程应该在下一个 epoch 恢复。

在这样做时，当然，我想指定一个batchsize。

我浏览了tf.data.Dataset 的文档，但没有发现任何类似的东西。

当然，如果我要写一个python生成器，这是可以实现的。但不幸的是，这没有用，因为根据文档，python 生成器将受到GIL 的限制，即global interpreter lock。

因此，假设，

A 包含{image1, image2, image3}，而B 包含{im1, im2, im3, im4, im5, im6}。我已经指定了2 的批量大小。然后我希望输出如下所示：

(image1, im1) and (image2, im4)

(image3, im2) and (image1, im2)

(image2, im1) and (image2, im3)

..............

15 more combinations

然后下一个纪元开始。

在 TensorFlow 中如何实现？

【问题讨论】：

标签： python tensorflow tensorflow2.0

【解决方案1】：

有一些关于如何使用 Numpy 或 Tensorflow 计算两个数组的笛卡尔积的 SO 帖子。

如果您的数组对于内存计算来说是两个大的，那么您最好的选择可能是使用两个 tf.data.Dataset（每个数组都打开）并进行双循环：

for a in dataset_A:
  for b in dataset_B.batch(2):
     batch = [[a, b[0]], [a, b[1]] # Or something similar (it should have a TF function to do it)

使用@tf.function，已知循环数据集很快。

【讨论】：

【解决方案2】：

您可以使用 tf.data.Dataset.from_generator 函数，其中生成器函数将实现您的逻辑，例如其他两个数据集的叉积。为了从压缩数据集db1 和db2 中随机抽取一对样本，我独立地打乱了每个数据集。

import tensorflow as tf
tf.enable_eager_execution()

A = [1, 2, 3, 4]
B = [5, 6, 7, 8]

db1 = tf.data.Dataset.from_tensor_slices(A).shuffle(len(A)).repeat()
db2 = tf.data.Dataset.from_tensor_slices(B).shuffle(len(B)).repeat()

def cross_db_generator():
    for db1_example, db2_example in zip(db1, db2):
        print(db1_example.numpy(), db2_example.numpy())
        yield db1_example, db2_example


cross_db = tf.data.Dataset.from_generator(cross_db_generator, output_types=(tf.uint8, tf.uint8))
cross_db = cross_db.batch(2)

for sample in cross_db:
    print((sample[0][0].numpy(), sample[1][0].numpy()), (sample[0][1].numpy(), sample[1][1].numpy()))

【讨论】：

使用from_generator 效率不高，因为它受 Python 的 GIL 约束，对吧？
对不起，我没有调查from_generator的效率。您可能想在parallelizing tf.data.Dataset.from_generator 上查看这个问题。