【问题标题】：How to find the origin of a TensorFlow NaN error on a Multi GPU system with NVIDIA Tesla P100 GPUs?如何在具有 NVIDIA Tesla P100 GPU 的多 GPU 系统上找到 TensorFlow NaN 错误的根源？
【发布时间】：2018-06-29 02:03:41
【问题描述】：

基线情况

我在具有 8 个 NVIDIA Tesla P100 GPU 的 GPU 集群上进行训练。脚本代码基于 TensorFlow 教程Convolutional Neural Networks。作为训练数据，我创建了基于Cifar10 data set 的二进制文件，其中仅包含 50,000 张图像中的 5000 张。我每次培训课程只使用其中一个文件。

以下是一些关于训练的基本数据：

纪元：100,000
批量：128
训练数据数：5000
初始学习率：0.1
学习率衰减因子：0.1
每次衰减 350.0 的 Epoch 数
没有指数移动平均线
...
如果您需要更多信息，请发表评论

问题是，如果我使用多个 GPU 进行训练，迟早会出现 NaN 错误。然后损失值在几个时期内从~0.4 到~1.e+26 以上的值爆炸到我认为的无穷大，然后出现NaN 错误。

到目前为止我做了什么

到目前为止，我已经尝试了以下方法来确定 NaN 错误的来源，以便我可以修复它。

几乎每次张量返回后，我都添加了tf.check_numerics()。（错误信息如下）
我添加了tf.add_check_numerics_ops()。这些错误信息对我来说和tf.check_numerics() 的错误信息一样难以理解：-)
我检查了输入数据的 NaN，数据正常。
我已经降低了学习率，所以错误出现在后面。
- 初始学习率：0.01
- 学习率衰减因子：0.01
运行 simpleP2P（Cuda 示例）并通过测试。（输出如下）
仅修改了TensorFlow Tutorial Code，使tf.train.string_input_producer() 仅获得1 个文件而不是5 个文件。 filename_queue = tf.train.string_input_producer([/path/traindata.bin]) 并将训练数据数量设置为 5,000。
在 gpu:0 上保存变量而不是 cpu:0（请参阅 TensorFlow 教程 here 和 here）并仅使用 gpu:1-7 进行训练。但这太慢了，它不是一个选择，我把它中断了。（也许我也做错了）
仅使用 4 个 GPU（0-3 或 4-7）进行训练，但后来出现了 NaN 错误。

系统

Linux 内核：4.4.72-18.12-default x86_64
8x NVIDIA Tesla P100-PCIE-16GB
Cuda 8.0 - V8.0.61
TenserFlow 1.4.1
Python3

一些代码和错误信息

tf.check_numerics() 错误信息：

2018..: E tensorflow/core/kernels/check_numerics_op.cc:157] abnormal_detected_host @0x12e49401900 = {0, 1} NaN: cnn()conv2
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: E tensorflow/core/kernels/check_numerics_op.cc:157] abnormal_detected_host @0x12e49401a00 = {0, 1} NaN: cnn()conv2
2018..: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: NaN: cnn()conv2 : Tensor had Inf values
     [[Node: tower_5/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:5"](tower_5/conv2/conv2)]]
2018-01-19 17:31:30.439453: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: NaN: cnn()conv2 : Tensor had Inf values
     [[Node: tower_5/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:5"](tower_5/conv2/conv2)]]
Traceback (most recent call last):
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN: cnn()conv2 : Tensor had Inf values
     [[Node: tower_7/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:7"](tower_7/conv2/conv2)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cnn_base.py", line 1703, in <module>
    training()
  File "cnn_base.py", line 1314, in training
    _, loss_value = sess.run([train_op, loss])
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN: cnn()conv2 : Tensor had Inf values
     [[Node: tower_7/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:7"](tower_7/conv2/conv2)]]

Caused by op 'tower_7/conv2/CheckNumerics_3', defined at:
  File "cnn_base.py", line 1703, in <module>
    training()
  File "cnn_base.py", line 1228, in training
    loss = tower_loss(scope, image_batch, label_batch)
  File "cnn_base.py", line 1110, in tower_loss
    logits = cnn(images)
  File "cnn_base.py", line 1018, in cnn
    conv2 = tf.check_numerics(conv2, 'NaN: cnn()conv2')
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/ops/gen_array_ops.py", line 569, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): NaN: cnn()conv2 : Tensor had Inf values
     [[Node: tower_7/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:7"](tower_7/conv2/conv2)]]

有时 NaN 错误来自pool1、nom1、conv2、...或local3。但绝不来自x。而且并不总是相同的 GPU。

def cnn(x):
    #### NaN detect:
    if DEBUG_NAN:
        x = tf.check_numerics(x, 'NaN: cnn(x)')

    #conv1
    with tf.variable_scope('conv1') as scope:
        kernel = _variable_with_weight_decay('weights', shape=[5, 5, 3, 64], stddev=5e-2, wd=0.0)
        conv = tf.nn.conv2d(x, kernel, [1, 1, 1, 1], padding='SAME')
        biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
        pre_activation = tf.nn.bias_add(conv, biases)
        conv1 = tf.nn.relu(pre_activation, name=scope.name)
        #### NaN detect:
        if DEBUG_NAN:
            conv1 = tf.check_numerics(conv1, 'NaN: cnn()conv1')

    # pool1
    pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool1')
    #### NaN detect:
    if DEBUG_NAN:
        pool1 = tf.check_numerics(pool1, 'NaN: cnn()pool1')

    # norm1
    norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name='norm1')
    #### NaN detect:
    if DEBUG_NAN:
        norm1 = tf.check_numerics(norm1, 'NaN: cnn()norm1')

    # conv2
    with tf.variable_scope('conv2') as scope:
        kernel = _variable_with_weight_decay('weights', shape=[5, 5, 64, 64], stddev=5e-2, wd=0.0)
        conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME')
        biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1))
        pre_activation = tf.nn.bias_add(conv, biases)
        conv2 = tf.nn.relu(pre_activation, name=scope.name)
        #### NaN detect:
        if DEBUG_NAN:
            conv2 = tf.check_numerics(conv2, 'NaN: cnn()conv2')
    ...
    #norm2
    ...
    #pool2
    ...
    #local3
    ...
    #local4
    ...
    #linear layer
    ...
    return softmax_linear

simpleP2P 的输出：

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 8
> GPU0 = "Tesla P100-PCIE-16GB" IS  capable of Peer-to-Peer (P2P)
> GPU1 = "Tesla P100-PCIE-16GB" IS  capable of Peer-to-Peer (P2P)
> GPU2 = "Tesla P100-PCIE-16GB" IS  capable of Peer-to-Peer (P2P)
> GPU3 = "Tesla P100-PCIE-16GB" IS  capable of Peer-to-Peer (P2P)
> GPU4 = "Tesla P100-PCIE-16GB" IS  capable of Peer-to-Peer (P2P)
> GPU5 = "Tesla P100-PCIE-16GB" IS  capable of Peer-to-Peer (P2P)
> GPU6 = "Tesla P100-PCIE-16GB" IS  capable of Peer-to-Peer (P2P)
> GPU7 = "Tesla P100-PCIE-16GB" IS  capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU1) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU2) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU3) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU4) : No
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU5) : No
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU6) : No
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU7) : No
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU0) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU2) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU3) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU4) : No
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU5) : No
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU6) : No
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU7) : No
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU0) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU1) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU3) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU4) : No
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU5) : No
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU6) : No
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU7) : No
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU0) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU1) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU2) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU4) : No
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU5) : No
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU6) : No
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU7) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU0) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU1) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU2) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU3) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU5) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU6) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU7) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU0) : No
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU1) : No
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU2) : No
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU3) : No
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU4) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU6) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU7) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU0) : No
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU1) : No
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU2) : No
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU3) : No
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU4) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU5) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU7) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU0) : No
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU1) : No
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU2) : No
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU3) : No
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU4) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU5) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU6) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla P100-PCIE-16GB (GPU0) supports UVA: Yes
> Tesla P100-PCIE-16GB (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 12.16GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed

编辑

我忘记了来自tf.add_check_numerics_ops()的错误信息：

Traceback (most recent call last):
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
    return fn(*args)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
    status, run_metadata)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
    c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN: average_gradients(expanded_g) : Tensor had Inf and NaN values
         [[Node: CheckNumerics_30 = CheckNumerics[T=DT_FLOAT, message="NaN: average_gradients(expanded_g)", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims_30)]]
         [[Node: tower_6/total_loss/_2216 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:6", send_device_incarnation=1, tensor_name="edge_4923_tower_6/total_loss", _device="/job:localhost/replica:0/task:0/device:GPU:6"](tower_6/total_loss)]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "cnn_base.py", line 1704, in <module>
    training()
  File "cnn_base.py", line 1312, in training
    nan_debug, _, loss_value = sess.run([check_op, train_op, loss])
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 889, in run
    run_metadata_ptr)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1120, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
    options, run_metadata)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN: average_gradients(expanded_g) : Tensor had Inf and NaN values
         [[Node: CheckNumerics_30 = CheckNumerics[T=DT_FLOAT, message="NaN: average_gradients(expanded_g)", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims_30)]]
         [[Node: tower_6/total_loss/_2216 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:6", send_device_incarnation=1, tensor_name="edge_4923_tower_6/total_loss", _device="/job:localhost/replica:0/task:0/device:GPU:6"](tower_6/total_loss)]]


Caused by op 'CheckNumerics_30', defined at:
  File "cnn_base.py", line 1704, in <module>
    training()
  File "cnn_base.py", line 1241, in training
    grads = average_gradients(tower_grads)
  File "cnn_base.py", line 1142, in average_gradients
    expanded_g = tf.check_numerics(expanded_g, 'NaN: average_gradients(expanded_g)')
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/ops/gen_array_ops.py", line 569, in check_numerics
    "CheckNumerics", tensor=tensor, message=message, name=name)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
    op_def=op_def)
  File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): NaN: average_gradients(expanded_g) : Tensor had Inf and NaN values
         [[Node: CheckNumerics_30 = CheckNumerics[T=DT_FLOAT, message="NaN: average_gradients(expanded_g)", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims_30)]]
         [[Node: tower_6/total_loss/_2216 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:6", send_device_incarnation=1, tensor_name="edge_4923_tower_6/total_loss", _device="/job:localhost/replica:0/task:0/device:GPU:6"](tower_6/total_loss)]]

编辑 2

NaN 错误之前的错误值演变：

| Step: 71001 | Loss: 0.408
| Step: 71002 | Loss: 0.334
| Step: 71003 | Loss: 0.366
| Step: 71004 | Loss: 0.535
| Step: 71005 | Loss: 0.580
| Step: 71006 | Loss: 0.665
| Step: 71007 | Loss: 0.973
| Step: 71008 | Loss: 1.532
| Step: 71009 | Loss: 1.926
| Step: 71010 | Loss: 3.996
| Step: 71011 | Loss: 3.897
| Step: 71012 | Loss: 48.157
| Step: 71013 | Loss: 116.674
| Step: 71014 | Loss: 81.629
| Step: 71015 | Loss: 605.457
| Step: 71016 | Loss: 5922.730
| Step: 71017 | Loss: 44706.512
| Step: 71018 | Loss: 153461.141
| Step: 71019 | Loss: 3288852.750
| Step: 71020 | Loss: 100990616.000
| Step: 71021 | Loss: 191808240.000
| Step: 71022 | Loss: 198109808.000
| Step: 71023 | Loss: 644734183800832.000
| Step: 71024 | Loss: 10551573931360256.000
| Step: 71025 | Loss: 14357759286057107456.000
| Step: 71026 | Loss: 4102828570323191104191619661824.000
| Step: 71027 | Loss: nan

【问题讨论】：

当你说Tesla P100时，我只是假设你的意思是the car。可能不是一回事。
哦，你说得对 :) 我换个问题。谢谢！

标签： python python-3.x tensorflow

【解决方案1】：

我找到了 NaN 错误的原因。回想起来，我不得不说，它一直在我的鼻子面前。

简短版：

我使用tf.train.GradientDescentOptimizer() 和tf.train.exponential_decay() 进行优化。将其更改为tf.train.AdamOptimizer() 解决了我的问题。

长版：

所以不是 GPU 集群，而是优化算法。但是我并没有立即注意到，因为如果我在 GPU 集群上只使用一个 GPU，总的损失值不是无限的，但是如果我使用多个 GPU，损失值加起来然后进入无限范围。只有当我在本地机器上运行脚本很长时间（使用 NVIDIA GTX 770）时，我才收到 NaN 错误。那时我知道它与 NVIDIA Tesla P100 无关。这个GitHub issue 让我更多地参与了tf.train.GradientDescentOptimizer()。现在看来这解决了我的问题。

TensorFlow 教程 Convolutional Neural Networks 使用 tf.train.GradientDescentOptimizer()，我现在将代码更改为：

lr = tf.train.exponential_decay(get_initial_learning_rate(), 
        global_step, 
        decay_steps, 
        get_learning_rate_decay_factor(), 
        staircase=True)

opt = tf.train.GradientDescentOptimizer(lr)

到：

opt = tf.train.AdamOptimizer(
get_initial_learning_rate(), # 0.001
beta1=0.9, 
beta2=0.999, 
epsilon=1e-08, 
use_locking=False)

【讨论】：

为什么多 GPU 和 SGD 会导致 nan 而不是 Adam？如果增加的损失变为无限，那么它与 SGD 又有什么关系？
nan 值是除以零的结果。 Adam 通过向分母添加一个非常小的值来防止这种情况发生。
零来自所有损失值的总和，在某些时候无法再存储并变为零。
我面临同样的骰子丢失问题。顺便说一句，我不太明白“零来自所有损失值的总和，这些损失值在某些时候无法再存储并变为零”是什么意思？