使用 TPU 训练 Keras 模型时的 InternalError答案

【问题标题】：InternalError when using TPU for training Keras model使用 TPU 训练 Keras 模型时的 InternalError
【发布时间】：2022-01-25 11:05:31
【问题描述】：

我正在尝试使用 link 从 Tensorflow Hub 在 Google Colab 上微调 BERT 模型。

但是，我遇到了以下错误：

InternalError: RET_CHECK failure (third_party/tensorflow/core/tpu/graph_rewrite/distributed_tpu_rewrite_pass.cc:2047) arg_shape.handle_type != DT_INVALID  input edge: [id=2693 model_preprocessing_67660:0 -> cluster_train_function:628]

当我运行 model.fit(...) 函数时。

此错误仅在我尝试使用 TPU 时发生（在 CPU 上运行良好，但训练时间很长）。

这是我设置 TPU 和模型的代码：

TPU 设置：

import os
os.environ["TFHUB_MODEL_LOAD_FORMAT"]="UNCOMPRESSED"

cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='')
tf.config.experimental_connect_to_cluster(cluster_resolver)
tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
strategy = tf.distribute.TPUStrategy(cluster_resolver)

模型设置：

def build_classifier_model():
  text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
  preprocessing_layer = hub.KerasLayer('https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3', name='preprocessing')
  encoder_inputs = preprocessing_layer(text_input)
  encoder = hub.KerasLayer('https://tfhub.dev/google/experts/bert/wiki_books/sst2/2', trainable=True, name='BERT_encoder')
  outputs = encoder(encoder_inputs)
  net = outputs['pooled_output']
  net = tf.keras.layers.Dropout(0.1)(net)
  net = tf.keras.layers.Dense(1, activation=None, name='classifier')(net)
  return tf.keras.Model(text_input, net)

模型训练

with strategy.scope():

  bert_model = build_classifier_model()
  loss = tf.keras.losses.BinaryCrossentropy(from_logits=True)
  metrics = tf.metrics.BinaryAccuracy()
  epochs = 1
  steps_per_epoch = 1280000
  num_train_steps = steps_per_epoch * epochs
  num_warmup_steps = int(0.1*num_train_steps)

  init_lr = 3e-5
  optimizer = optimization.create_optimizer(init_lr=init_lr,
                                          num_train_steps=num_train_steps,
                                          num_warmup_steps=num_warmup_steps,
                                          optimizer_type='adamw')
  bert_model.compile(optimizer=optimizer,
                         loss=loss,
                         metrics=metrics)
  print(f'Training model')
  history = bert_model.fit(x=X_train, y=y_train,
                               validation_data=(X_val, y_val),
                               epochs=epochs)

请注意，X_train 是 str 类型的 numpy 数组，形状为 (1280000,)，y_train 是形状为 (1280000, 1) 的 numpy 数组

【问题讨论】：

为什么不使用 GPU 代替？ TPU 需要特殊格式的输入数据。

标签： python tensorflow machine-learning bert-language-model tpu

【解决方案1】：

因为我不完全知道您在代码中做了哪些更改...我不知道您的数据集。但是我可以看到您正在尝试用一个时期训练整个数据集并直接通过每个时期的步骤。我建议这样写

如果您不想批量处理数据集，请将 batch_size 设置为 2^n 次幂（例如 16 或 32 等），只需将 batch_size 设置为 1

batch_size = 16
steps_per_epoch = training_data_size // batch_size

代码的问题很可能是训练数据集的大小。我认为您手动传递训练数据集的值是错误的。

如果您从 tfds 加载数据集，请使用（如链接所示）：

train_dataset, train_data_size = load_dataset_from_tfds(
  in_memory_ds, tfds_info, train_split, batch_size, bert_preprocess_model)

如果您使用的是自定义数据集，请在变量中获取已清理数据集的大小，然后使用该变量来使用训练数据的大小。尽量避免手动将值放入代码中。

【讨论】：

是的，你是对的。批量训练比我在这里尝试的方法要好得多，谢谢！