【问题标题】:Tensorflow SEQ2SEQ training suddenly super slowTensorFlow SEQ2SEQ 训练突然超慢
【发布时间】:2017-06-04 04:09:13
【问题描述】:

我正在使用来自https://google.github.io/seq2seq/ 的代码训练 seq2seq NMT(神经机器翻译)。在我中断训练过程后,重新启动的过程变得非常慢(从 1.2 步/秒到 0.07 步/秒)。其他人有这种经验吗?我该如何调试?我已经运行了好几个星期了,真的不想放弃...谢谢!~

正常训练的最后一行,

INFO:tensorflow:loss = 0.585205, step = 830853 (79.477 sec)
INFO:tensorflow:global_step/sec: 1.24179
INFO:tensorflow:loss = 0.267574, step = 830953 (80.529 sec)

前几行超慢训练...

INFO:tensorflow:global_step/sec: 0.0746058
INFO:tensorflow:loss = 0.554718, step = 830854 (1340.379 sec)

【问题讨论】:

    标签: tensorflow recurrent-neural-network


    【解决方案1】:

    啊,仔细查看日志后,似乎是 Cuda 的一个最新更新造成了这个问题。自动更新后来解决了这个问题。这是日志..

    2017-06-02 22:47:22.684229: E tensorflow/stream_executor/cuda/cuda_driver.cc:405] failed call to cuInit: CUDA_ERROR_NO_DEVICE
    2017-06-02 22:47:22.684260: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:158] retrieving CUDA diagnostic information for host: POSingularity
    2017-06-02 22:47:22.684276: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:165] hostname: POSingularity
    2017-06-02 22:47:22.684305: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:189] libcuda reported version is: 375.66.0
    2017-06-02 22:47:22.684331: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:369] driver version file contents: """NVRM version: NVIDIA UNIX x86_64 Kernel Module  375.51  Wed Mar 22 10:26:12 PDT 
    2017
    GCC version:  gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) 
    """
    2017-06-02 22:47:22.684352: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:193] kernel reported version is: 375.51.0
    2017-06-02 22:47:22.684368: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:303] kernel version 375.51.0 does not match DSO version 375.66.0 -- cannot find working devices in this configuration
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-03-13
      • 2017-11-21
      • 2016-06-20
      • 2018-05-01
      • 1970-01-01
      • 2016-03-09
      • 1970-01-01
      相关资源
      最近更新 更多