无法在 Tpu Google Colab InternalError 上训练：发现 9 个根错误答案

【问题标题】：Unable Train on Tpu Google Colab InternalError: 9 root error(s) found无法在 Tpu Google Colab InternalError 上训练：发现 9 个根错误
【发布时间】：2022-01-19 09:05:45
【问题描述】：

BATCH SIZE = 64
HEIGHT ,WIDTH = 124,124

Train_data set   = 14906 6 classes.
Validation_datat =  3726 6 classes.

with strategy.scope():
  model = create_model()
  model = complile_model(model,lr=0.0001)
  callbacks = create_callbacks()
epochs = 5
steps_per_epoch  = 14906//BATCH_SIZE
validation_steps = 3726//BATCH_SIZE

history = model.fit(train_dataset,
                    epochs=epochs,
                    steps_per_epoch=steps_per_epoch,
                    validation_data=validation_dataset, 
                    validation_steps=validation_steps)

我正在尝试在 google collab 提供的 TPU 上对其进行训练，但无法这样做，请帮助我解决这个问题。附上截图

【问题讨论】：

标签： data-science google-colaboratory

【解决方案1】：

由于 ImageDataGenerator 在底层也使用 PyFunction，它与 TPU 不兼容。相反，您必须使用 tf.data API 来加载图像。本教程解释了如何做到这一点。

【讨论】：

哪个教程？我真的认为只有 ImageDatagenrerator 会导致问题！
Dataset.from_generator 预计不适用于 TPU，因为它使用的 py_function 与 Cloud TPU 2VM 设置不兼容。如果您想从大型数据集中读取数据，不妨尝试在磁盘上实现它并改用 TFRecordDataest。
请在您的答案中添加一些解释，以便其他人可以从中学习

【解决方案2】：

数据集必须repeat():

def get_dataset(filenames, batch_size):
    dataset = (
        tf.data.TFRecordDataset(filenames, num_parallel_reads=AUTOTUNE)
        .map(parse_tfrecord_fn, num_parallel_calls=AUTOTUNE)
        .map(prepare_sample, num_parallel_calls=AUTOTUNE)
        .repeat()
        .shuffle(batch_size * 10)
        .batch(batch_size)
        .prefetch(AUTOTUNE)
    )
    return dataset

【讨论】：