在多 GPU 情况下，CPU 代码是否可以存在于“with tf.device(gpu_id):”下？答案

【问题标题】：Can CPU code exist under "with tf.device(gpu_id) :" in multiple GPU case?在多 GPU 情况下，CPU 代码是否可以存在于“with tf.device(gpu_id):”下？
【发布时间】：2018-02-05 06:25:34
【问题描述】：

您好，我是 Tensorflow 的新手，我被分配了一项任务来更改 Github 项目“tf-faster-rcnn”中的“Demo.py”以实现多 GPU 推理。

而这通常是我打算做的（假设我拥有的图像数量与 GPU 的数量相同，为了简单起见，我将使用此处未说明的队列）：

for id, gpu in gpu_dict:
    with tf.device(gpu):
        im_detect(images[id])

源文件中提供了“im_detect”函数（我可以直接调用它），它包含一些非GPU代码（如条件和数据准备）

def im_detect(sess, net, im):
  blobs, im_scales = _get_blobs(im)
  assert len(im_scales) == 1, "Only single-image batch implemented"

  im_blob = blobs['data']
  blobs['im_info'] = np.array([im_blob.shape[1], im_blob.shape[2], im_scales[0]], dtype=np.float32)

  _, scores, bbox_pred, rois = net.test_image(sess, blobs['data'], blobs['im_info'])

  boxes = rois[:, 1:5] / im_scales[0]
  scores = np.reshape(scores, [scores.shape[0], -1])
  bbox_pred = np.reshape(bbox_pred, [bbox_pred.shape[0], -1])
  if cfg.TEST.BBOX_REG:
    # Apply bounding-box regression deltas
    box_deltas = bbox_pred
    pred_boxes = bbox_transform_inv(boxes, box_deltas)
    pred_boxes = _clip_boxes(pred_boxes, im.shape)
  else:
    # Simply repeat the boxes, once for each class
    pred_boxes = np.tile(boxes, (1, scores.shape[1]))

  return scores, pred_boxes

由于我以前从未玩过GPU，而且我是Tensorflow的新手，所以我想问一下在Tensorflow中为每个GPU分配这样的函数调用是否可以？

----------------下面更新了------------------------------------

我知道 Tensorflow 中有一个“alow_soft_placement”选项，它将那些非 GPU 代码分配给 CPU，但是当有多个 GPU 时，一个 CPU 如何处理来自所有 GPU 的这些请求？我应该为每个 GPU 创建一个 CPU 线程吗？

【问题讨论】：

im_detect 中定义的 TensorFlow 操作在哪里？您将无法直接从im_detect 返回您的结果，就像您打算的那样。当您设置 tf 操作时，它们将不会运行，直到您随后调用 session.run（并且可能是在循环设备之后）。也就是说，在一个图中，多个 GPU 和一个 CPU 可以一起工作。不过，运维将需要唯一的引用，这样就不会对在哪里运行的内容产生歧义。因此，例如 GPU:0 可能从 CPU 上的“data_prep_0”张量获取数据，而 GPU:1 需要引用 CPU 上的“data_prep_1”。
@JoshuaR。 sess.run(tensors) 在“im_detect”函数的“test_image”函数中被调用（在这种情况下，函数调用跟踪很长......）。谢谢帮忙，看来这里的并发问题我不用太担心了。
如果您这样做，您将不会同时在两个 GPU 上运行操作。您的循环将等待每个 session.run 完成以继续下一个 GPU 设备。我在下面的答案中添加了一个示意图代码示例，它可能会构建一些东西。
@JoshuaR。感谢您给出如此详细的答复！ github.com/tensorflow/models/blob/master/tutorials/image/…这是一个Tensorflow下的多gpu训练例子，我觉得和我的想法一样（使用循环，只有一个会话）？

标签： python tensorflow deep-learning gpu

【解决方案1】：

是的。来自https://www.tensorflow.org/programmers_guide/using_gpu。如果操作没有 CUDA 内核，会话配置的 allow_soft_placement 参数允许 TensorFlow 回退到 CPU。

myConf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)
sess = tf.Session(config=myConf)

有时你不会想要这个 - 例如。如果您正在尝试验证您期望的所有操作实际上都在 GPU 上运行。

您还可以在 with tf.device('/gpu:0'): 块内使用 with tf.device('/cpu:0'): 将操作显式分配给 CPU。

我倾向于使用严格放置，然后在 TensorFlow 抱怨时将不兼容的操作显式分配给 cpu。这样我就可以确定所有合适的操作都经过 GPU 优化。

更新：

这里有一些示意图代码，它应该概述如何在 GPU 上运行并行计算。

graph = tf.Graph()

with graph.as_default():

gpus = ['/gpu:0', '/gpu:1']
results = []
datasets = []

for idx, gpu in enumerate(gpus):
   with tf.device(gpu):
       # assign data prep ops to CPU
       # (or use soft placement and leave out the next line).
       with tf.device('/cpu:0'):
            datasets[idx] = tf.placeholder(tf.float32, name = 'Features'+idx)

       # Computationally expensive ops get assigned to GPU, but make reference
       # to specific non-GPU ops on CPU.
       results[idx] = tf.reduce_sum(datasets[idx])  

myConf = tf.ConfigProto(allow_soft_placement=True, log_device_placement=True)

with tf.Session(graph=graph, config=myConf) as session:

    # Now, using the graph set up previously, evaluate results
    # using both gpu devices (each these ops depends on independent
    # cpu ops).
    res0, res1 = session.run([results[0], results[1]])

【讨论】：

您好，感谢您的回复！如果我有 2 个 GPU，当我分配它们两个工作时，应该有 2 个张量同时流动。但是我只有一个 CPU 来处理那些非 GPU 代码，我应该为每个 GPU 分配一个 CPU 线程，还是 Tensor 流能够为我处理？
@JiangWenbo，这是一个有趣的问题。您是否在同一会话中运行所有设备？我看不到您的会话在哪里定义。您的推理任务是无关的（即“令人尴尬的并行”），还是打算让您的 GPU 同时在同一个推理上协同工作（即分布在它们之间的单个图）？
是的，我打算在同一个会话中运行所有设备，所以一切都将在该会话下。是的，我正在推断一些不相关的测试图像（一个 GPU 一次处理一个图像）。