【发布时间】:2018-06-29 02:03:41
【问题描述】:
基线情况
我在具有 8 个 NVIDIA Tesla P100 GPU 的 GPU 集群上进行训练。脚本代码基于 TensorFlow 教程Convolutional Neural Networks。作为训练数据,我创建了基于Cifar10 data set 的二进制文件,其中仅包含 50,000 张图像中的 5000 张。我每次培训课程只使用其中一个文件。
以下是一些关于训练的基本数据:
- 纪元:100,000
- 批量:128
- 训练数据数:5000
- 初始学习率:0.1
- 学习率衰减因子:0.1
- 每次衰减 350.0 的 Epoch 数
- 没有指数移动平均线
- ...
- 如果您需要更多信息,请发表评论
问题是,如果我使用多个 GPU 进行训练,迟早会出现 NaN 错误。然后损失值在几个时期内从~0.4 到~1.e+26 以上的值爆炸到我认为的无穷大,然后出现NaN 错误。
到目前为止我做了什么
到目前为止,我已经尝试了以下方法来确定 NaN 错误的来源,以便我可以修复它。
- 几乎每次张量返回后,我都添加了
tf.check_numerics()。 (错误信息如下) - 我添加了
tf.add_check_numerics_ops()。这些错误信息对我来说和tf.check_numerics()的错误信息一样难以理解:-) - 我检查了输入数据的 NaN,数据正常。
- 我已经降低了学习率,所以错误出现在后面。
- 初始学习率:0.01
- 学习率衰减因子:0.01
- 运行 simpleP2P(Cuda 示例)并通过测试。 (输出如下)
- 仅修改了TensorFlow Tutorial Code,使
tf.train.string_input_producer()仅获得1 个文件而不是5 个文件。filename_queue = tf.train.string_input_producer([/path/traindata.bin])并将训练数据数量设置为 5,000。 - 在 gpu:0 上保存变量而不是 cpu:0(请参阅 TensorFlow 教程 here 和 here)并仅使用 gpu:1-7 进行训练。但这太慢了,它不是一个选择,我把它中断了。 (也许我也做错了)
- 仅使用 4 个 GPU(0-3 或 4-7)进行训练,但后来出现了 NaN 错误。
系统
- Linux 内核:4.4.72-18.12-default x86_64
- 8x NVIDIA Tesla P100-PCIE-16GB
- Cuda 8.0 - V8.0.61
- TenserFlow 1.4.1
- Python3
一些代码和错误信息
tf.check_numerics() 错误信息:
2018..: E tensorflow/core/kernels/check_numerics_op.cc:157] abnormal_detected_host @0x12e49401900 = {0, 1} NaN: cnn()conv2
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: W tensorflow/core/kernels/queue_base.cc:303] _3_prefetch_queue/fifo_queue: Skipping cancelled dequeue attempt with queue not closed
2018..: E tensorflow/core/kernels/check_numerics_op.cc:157] abnormal_detected_host @0x12e49401a00 = {0, 1} NaN: cnn()conv2
2018..: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: NaN: cnn()conv2 : Tensor had Inf values
[[Node: tower_5/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:5"](tower_5/conv2/conv2)]]
2018-01-19 17:31:30.439453: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: NaN: cnn()conv2 : Tensor had Inf values
[[Node: tower_5/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:5"](tower_5/conv2/conv2)]]
Traceback (most recent call last):
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN: cnn()conv2 : Tensor had Inf values
[[Node: tower_7/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:7"](tower_7/conv2/conv2)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cnn_base.py", line 1703, in <module>
training()
File "cnn_base.py", line 1314, in training
_, loss_value = sess.run([train_op, loss])
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN: cnn()conv2 : Tensor had Inf values
[[Node: tower_7/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:7"](tower_7/conv2/conv2)]]
Caused by op 'tower_7/conv2/CheckNumerics_3', defined at:
File "cnn_base.py", line 1703, in <module>
training()
File "cnn_base.py", line 1228, in training
loss = tower_loss(scope, image_batch, label_batch)
File "cnn_base.py", line 1110, in tower_loss
logits = cnn(images)
File "cnn_base.py", line 1018, in cnn
conv2 = tf.check_numerics(conv2, 'NaN: cnn()conv2')
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/ops/gen_array_ops.py", line 569, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): NaN: cnn()conv2 : Tensor had Inf values
[[Node: tower_7/conv2/CheckNumerics_3 = CheckNumerics[T=DT_FLOAT, message="NaN: cnn()conv2", _device="/job:localhost/replica:0/task:0/device:GPU:7"](tower_7/conv2/conv2)]]
有时 NaN 错误来自pool1、nom1、conv2、...或local3。但绝不来自x。而且并不总是相同的 GPU。
def cnn(x):
#### NaN detect:
if DEBUG_NAN:
x = tf.check_numerics(x, 'NaN: cnn(x)')
#conv1
with tf.variable_scope('conv1') as scope:
kernel = _variable_with_weight_decay('weights', shape=[5, 5, 3, 64], stddev=5e-2, wd=0.0)
conv = tf.nn.conv2d(x, kernel, [1, 1, 1, 1], padding='SAME')
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.0))
pre_activation = tf.nn.bias_add(conv, biases)
conv1 = tf.nn.relu(pre_activation, name=scope.name)
#### NaN detect:
if DEBUG_NAN:
conv1 = tf.check_numerics(conv1, 'NaN: cnn()conv1')
# pool1
pool1 = tf.nn.max_pool(conv1, ksize=[1, 3, 3, 1], strides=[1, 2, 2, 1], padding='SAME', name='pool1')
#### NaN detect:
if DEBUG_NAN:
pool1 = tf.check_numerics(pool1, 'NaN: cnn()pool1')
# norm1
norm1 = tf.nn.lrn(pool1, 4, bias=1.0, alpha=0.001 / 9.0, beta=0.75, name='norm1')
#### NaN detect:
if DEBUG_NAN:
norm1 = tf.check_numerics(norm1, 'NaN: cnn()norm1')
# conv2
with tf.variable_scope('conv2') as scope:
kernel = _variable_with_weight_decay('weights', shape=[5, 5, 64, 64], stddev=5e-2, wd=0.0)
conv = tf.nn.conv2d(norm1, kernel, [1, 1, 1, 1], padding='SAME')
biases = _variable_on_cpu('biases', [64], tf.constant_initializer(0.1))
pre_activation = tf.nn.bias_add(conv, biases)
conv2 = tf.nn.relu(pre_activation, name=scope.name)
#### NaN detect:
if DEBUG_NAN:
conv2 = tf.check_numerics(conv2, 'NaN: cnn()conv2')
...
#norm2
...
#pool2
...
#local3
...
#local4
...
#linear layer
...
return softmax_linear
simpleP2P 的输出:
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 8
> GPU0 = "Tesla P100-PCIE-16GB" IS capable of Peer-to-Peer (P2P)
> GPU1 = "Tesla P100-PCIE-16GB" IS capable of Peer-to-Peer (P2P)
> GPU2 = "Tesla P100-PCIE-16GB" IS capable of Peer-to-Peer (P2P)
> GPU3 = "Tesla P100-PCIE-16GB" IS capable of Peer-to-Peer (P2P)
> GPU4 = "Tesla P100-PCIE-16GB" IS capable of Peer-to-Peer (P2P)
> GPU5 = "Tesla P100-PCIE-16GB" IS capable of Peer-to-Peer (P2P)
> GPU6 = "Tesla P100-PCIE-16GB" IS capable of Peer-to-Peer (P2P)
> GPU7 = "Tesla P100-PCIE-16GB" IS capable of Peer-to-Peer (P2P)
Checking GPU(s) for support of peer to peer memory access...
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU1) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU2) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU3) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU4) : No
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU5) : No
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU6) : No
> Peer access from Tesla P100-PCIE-16GB (GPU0) -> Tesla P100-PCIE-16GB (GPU7) : No
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU0) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU2) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU3) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU4) : No
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU5) : No
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU6) : No
> Peer access from Tesla P100-PCIE-16GB (GPU1) -> Tesla P100-PCIE-16GB (GPU7) : No
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU0) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU1) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU3) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU4) : No
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU5) : No
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU6) : No
> Peer access from Tesla P100-PCIE-16GB (GPU2) -> Tesla P100-PCIE-16GB (GPU7) : No
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU0) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU1) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU2) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU4) : No
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU5) : No
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU6) : No
> Peer access from Tesla P100-PCIE-16GB (GPU3) -> Tesla P100-PCIE-16GB (GPU7) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU0) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU1) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU2) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU3) : No
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU5) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU6) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU4) -> Tesla P100-PCIE-16GB (GPU7) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU0) : No
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU1) : No
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU2) : No
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU3) : No
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU4) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU6) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU5) -> Tesla P100-PCIE-16GB (GPU7) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU0) : No
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU1) : No
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU2) : No
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU3) : No
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU4) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU5) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU6) -> Tesla P100-PCIE-16GB (GPU7) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU0) : No
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU1) : No
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU2) : No
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU3) : No
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU4) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU5) : Yes
> Peer access from Tesla P100-PCIE-16GB (GPU7) -> Tesla P100-PCIE-16GB (GPU6) : Yes
Enabling peer access between GPU0 and GPU1...
Checking GPU0 and GPU1 for UVA capabilities...
> Tesla P100-PCIE-16GB (GPU0) supports UVA: Yes
> Tesla P100-PCIE-16GB (GPU1) supports UVA: Yes
Both GPUs can support UVA, enabling...
Allocating buffers (64MB on GPU0, GPU1 and CPU Host)...
Creating event handles...
cudaMemcpyPeer / cudaMemcpy between GPU0 and GPU1: 12.16GB/s
Preparing host buffer and memcpy to GPU0...
Run kernel on GPU1, taking source data from GPU0 and writing to GPU1...
Run kernel on GPU0, taking source data from GPU1 and writing to GPU0...
Copy data back to host from GPU0 and verify results...
Disabling peer access...
Shutting down...
Test passed
编辑
我忘记了来自tf.add_check_numerics_ops()的错误信息:
Traceback (most recent call last):
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1323, in _do_call
return fn(*args)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1302, in _run_fn
status, run_metadata)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/errors_impl.py", line 473, in __exit__
c_api.TF_GetCode(self.status.status))
tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN: average_gradients(expanded_g) : Tensor had Inf and NaN values
[[Node: CheckNumerics_30 = CheckNumerics[T=DT_FLOAT, message="NaN: average_gradients(expanded_g)", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims_30)]]
[[Node: tower_6/total_loss/_2216 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:6", send_device_incarnation=1, tensor_name="edge_4923_tower_6/total_loss", _device="/job:localhost/replica:0/task:0/device:GPU:6"](tower_6/total_loss)]]
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "cnn_base.py", line 1704, in <module>
training()
File "cnn_base.py", line 1312, in training
nan_debug, _, loss_value = sess.run([check_op, train_op, loss])
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 889, in run
run_metadata_ptr)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1120, in _run
feed_dict_tensor, options, run_metadata)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1317, in _do_run
options, run_metadata)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/client/session.py", line 1336, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: NaN: average_gradients(expanded_g) : Tensor had Inf and NaN values
[[Node: CheckNumerics_30 = CheckNumerics[T=DT_FLOAT, message="NaN: average_gradients(expanded_g)", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims_30)]]
[[Node: tower_6/total_loss/_2216 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:6", send_device_incarnation=1, tensor_name="edge_4923_tower_6/total_loss", _device="/job:localhost/replica:0/task:0/device:GPU:6"](tower_6/total_loss)]]
Caused by op 'CheckNumerics_30', defined at:
File "cnn_base.py", line 1704, in <module>
training()
File "cnn_base.py", line 1241, in training
grads = average_gradients(tower_grads)
File "cnn_base.py", line 1142, in average_gradients
expanded_g = tf.check_numerics(expanded_g, 'NaN: average_gradients(expanded_g)')
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/ops/gen_array_ops.py", line 569, in check_numerics
"CheckNumerics", tensor=tensor, message=message, name=name)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/home/freundlicher/tfEnv/lib/python3.4/site-packages/tensorflow/python/framework/ops.py", line 1470, in __init__
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): NaN: average_gradients(expanded_g) : Tensor had Inf and NaN values
[[Node: CheckNumerics_30 = CheckNumerics[T=DT_FLOAT, message="NaN: average_gradients(expanded_g)", _device="/job:localhost/replica:0/task:0/device:CPU:0"](ExpandDims_30)]]
[[Node: tower_6/total_loss/_2216 = _Send[T=DT_FLOAT, client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:6", send_device_incarnation=1, tensor_name="edge_4923_tower_6/total_loss", _device="/job:localhost/replica:0/task:0/device:GPU:6"](tower_6/total_loss)]]
编辑 2
NaN 错误之前的错误值演变:
| Step: 71001 | Loss: 0.408
| Step: 71002 | Loss: 0.334
| Step: 71003 | Loss: 0.366
| Step: 71004 | Loss: 0.535
| Step: 71005 | Loss: 0.580
| Step: 71006 | Loss: 0.665
| Step: 71007 | Loss: 0.973
| Step: 71008 | Loss: 1.532
| Step: 71009 | Loss: 1.926
| Step: 71010 | Loss: 3.996
| Step: 71011 | Loss: 3.897
| Step: 71012 | Loss: 48.157
| Step: 71013 | Loss: 116.674
| Step: 71014 | Loss: 81.629
| Step: 71015 | Loss: 605.457
| Step: 71016 | Loss: 5922.730
| Step: 71017 | Loss: 44706.512
| Step: 71018 | Loss: 153461.141
| Step: 71019 | Loss: 3288852.750
| Step: 71020 | Loss: 100990616.000
| Step: 71021 | Loss: 191808240.000
| Step: 71022 | Loss: 198109808.000
| Step: 71023 | Loss: 644734183800832.000
| Step: 71024 | Loss: 10551573931360256.000
| Step: 71025 | Loss: 14357759286057107456.000
| Step: 71026 | Loss: 4102828570323191104191619661824.000
| Step: 71027 | Loss: nan
【问题讨论】:
-
当你说
Tesla P100时,我只是假设你的意思是the car。可能不是一回事。 -
哦,你说得对 :) 我换个问题。谢谢!
标签: python python-3.x tensorflow