使用 TensorFlow MirroredStrategy 的 Keras 多 GPU答案

【问题标题】：Keras multi GPU using Tensorflow MirroredStrategy使用 TensorFlow MirroredStrategy 的 Keras 多 GPU
【发布时间】：2020-11-13 21:58:15
【问题描述】：

es = EarlyStopping(monitor='val_loss', mode='min', patience=100, restore_best_weights=True, verbose=0)
strategy = tf.distribute.MirroredStrategy(devices=['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3'])
with strategy.scope():
   model = RESNET()
history = model.fit(samples2Fit, validation_data=samples2Validate, epochs=args.epochs, callbacks=[es], verbose=0)

RESNET()模型编译为：model.compile(loss=tf.keras.losses.Huber(), optimizer=tf.keras.optimizers.Adam(epsilon=1e-08), metrics=[tf.keras.losses.Huber()])，其他所有模块也来自tensorflow.keras.**

当我使用 4 个 GPU 运行此程序时，我收到以下错误：ValueError: Please use tf.keras.losses.Reduction.SUM or tf.keras.losses.Reduction.NONE for loss reduction when loss is used with tf.distribute.Strategy outside the built-in training loops...

我正在按照https://keras.io/guides/distributed_training/ 中给出的示例进行操作，那么我缺少什么以及为什么需要使用这些减少？ 在内置训练循环之外是什么意思？

【问题讨论】：

stackoverflow.com/questions/60106201/…

标签： keras tensorflow2.0 multi-gpu

【解决方案1】：

尝试适应策略范围

with strategy.scope():
   model = RESNET()
   history = model.fit(samples2Fit, validation_data=samples2Validate, 
         epochs=args.epochs, callbacks=[es], verbose=0)

默认情况下MirroredStrategy 将使用cross_device_ops 和NcclAllReduce()

cross_device_ops：可选，CrossDeviceOps 的继承者。如果这是未设置，默认使用 NcclAllReduce()。一个人会定制如果 NCCL 不可用或特殊实现利用可用的特定硬件。

您可以尝试不同的cross_device_ops 选项https://www.tensorflow.org/api_docs/python/tf/distribute/MirroredStrategy

【讨论】：