【问题标题】:Why can't I run tensorflow session on CPU while one GPU device's memory is all allocated?为什么我不能在一个 GPU 设备的内存全部分配时在 CPU 上运行 tensorflow 会话?
【发布时间】:2018-11-05 15:19:47
【问题描述】:

从 tensorflow 网站 (https://www.tensorflow.org/guide/using_gpu) 我找到了以下代码来手动指定使用 CPU 而不是 GPU:

# Creates a graph.
with tf.device('/cpu:0'):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
    c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(c))

我尝试在我的机器(有 4 个 GPU)上运行它并收到以下错误:

2018-11-05 10:02:30.636733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 0 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:18:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-11-05 10:02:30.863280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1392] Found device 1 with properties: 
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:3b:00.0
totalMemory: 10.92GiB freeMemory: 10.76GiB
2018-11-05 10:02:31.117729: E tensorflow/core/common_runtime/direct_session.cc:158] Internal: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_OUT_OF_MEMORY; total memory reported: 11721506816
Traceback (most recent call last):
  File "./tf_test.py", line 10, in <module>
    sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
  File ".../anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1566, in __init__
    super(Session, self).__init__(target, graph, config=config)
  File ".../anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 636, in __init__
    self._session = tf_session.TF_NewSession(self._graph._c_graph, opts)
tensorflow.python.framework.errors_impl.InternalError: Failed to create session.

似乎当我创建会话时,tensorflow 尝试在所有设备上初始化流执行器。不幸的是,我的同事现在正在使用其中一个 GPU。我希望他对一个 GPU 的充分使用不会妨碍我使用另一个设备(无论是 GPU 还是 CPU),但事实似乎并非如此。

有人知道解决方法吗?也许要添加到配置中?这是可以在 tensorflow 中修复的吗?

仅供参考...这是“gpustat -upc”的输出:

<my_hostname>  Mon Nov  5 10:19:47 2018
[0] GeForce GTX 1080 Ti | 36'C,   0 % |    10 / 11178 MB |
[1] GeForce GTX 1080 Ti | 41'C,   0 % |    10 / 11178 MB |
[2] GeForce GTX 1080 Ti | 38'C,   0 % | 11097 / 11178 MB | <my_colleague>:python2/148901(11087M)
[3] GeForce GTX 1080 Ti | 37'C,   0 % |    10 / 11178 MB |

【问题讨论】:

    标签: python tensorflow gpgpu


    【解决方案1】:

    好的……所以在我同事的帮助下,我有了一个可行的解决方案。实际上,关键是对配置的修改。具体来说,是这样的:

    config.gpu_options.visible_device_list = '0'
    

    这将确保 tensorflow 只看到 GPU 0。

    事实上,我能够运行以下命令:

    #!/usr/bin/env python                                                                                                                                                                                                                        
    
    import tensorflow as tf
    
    with tf.device('/gpu:2'):
        a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
        b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
        c = tf.matmul(a, b)
    # Creates a session with log_device_placement set to True.                                                                                                                                                                                   
    config=tf.ConfigProto(log_device_placement=True)
    config.gpu_options.visible_device_list = '0,1,3'
    sess = tf.Session(config=config)
    # Runs the op.                                                                                                                                                                                                                               
    print(sess.run(c))
    

    请注意,此代码实际上指定在 GPU 2 上运行(您可能还记得它是已满的那个)。这一点很重要……GPU是根据visible_device_list重新编号的,所以在上面的代码中,当我们说“with gpu:2”时,这是指列表中的第3个GPU('0,1,3 '),这实际上是 GPU 3。如果你尝试这个,这可能会咬你:

    #!/usr/bin/env python                                                                                                                                                                                                                        
    
    import tensorflow as tf
    
    with tf.device('/gpu:1'):
        a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
        b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
        c = tf.matmul(a, b)
    # Creates a session with log_device_placement set to True.                                                                                                                                                                                   
    config=tf.ConfigProto(log_device_placement=True)
    config.gpu_options.visible_device_list = '1'
    sess = tf.Session(config=config)
    # Runs the op.                                                                                                                                                                                                                               
    print(sess.run(c))
    

    问题在于它在列表中寻找第二个 GPU,但可见列表中只有一个 GPU。你会得到的错误如下:

    InvalidArgumentError (see above for traceback): Cannot assign a device for operation 'a': Operation was explicitly assigned to /device:GPU:1 but available devices are [ /job:localhost/replica:0/task:0/device:CPU:0, /job:localhost/replica:0/task:0/device:GPU:0 ]. Make sure the device specification refers to a valid device.
         [[Node: a = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [2,3] values: [1 2 3]...>, _device="/device:GPU:1"]()]]
    

    当我想在 CPU 上运行时,我必须指定一个 GPU 列表,这对我来说仍然很奇怪。我尝试使用一个空列表但它失败了,所以如果所有 4 个 GPU 都在使用中,我将没有解决方法。其他人有更好的主意吗?

    【讨论】:

      猜你喜欢
      • 2018-07-19
      • 2020-11-10
      • 2017-11-14
      • 2017-09-06
      • 1970-01-01
      • 2019-01-07
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多