TensorFlow 分布式传递设备答案

【问题标题】：Tensorflow distributed passing devicesTensorFlow 分布式传递设备
【发布时间】：2016-03-06 08:32:41
【问题描述】：

我最近安装了用于分布式处理的tensorflow版本。从trend，我尝试在多台计算机上使用多个gpu 实现，还找到了white paper 用于一些额外的规范。我可以在 2 台不同的计算机上分别运行服务器和一个工作程序，分别具有 2 个和 1 个 gpus，并使用会话 grpc，在远程或本地模式下分配和运行程序。

我在远程计算机上运行了本地 tensorflow：

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='local|localhost:2500' --job_name=local --task_id=0 &

在服务器上使用

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='worker|192.168.170.193:2500,prs|192.168.170.226:2500' --job_name=worker --task_id=0 \
--job_name=prs --task_id=0 &

但是，当我尝试指定同时在 2 台计算机上运行的设备时，python 会显示错误：

 Could not satisfy explicit device specification '/job:worker/task:0'

当我使用时

with tf.device("/job:prs/task:0/device:gpu:0"):
  x = tf.placeholder(tf.float32, [None, 784], name='x-input')
  W = tf.Variable(tf.zeros([784, 10]), name='weights')
with tf.device("/job:prs/task:0/device:gpu:1"):
  b = tf.Variable(tf.zeros([10], name='bias'))
# Use a name scope to organize nodes in the graph visualizer
with tf.device("/job:worker/task:0/device:gpu:0"):
  with tf.name_scope('Wx_b'):
    y = tf.nn.softmax(tf.matmul(x, W) + b)

甚至更改工作名称。所以我想知道Add a New Device是否需要它，或者我可能在集群初始化时做错了什么。

【问题讨论】：

标签： python tensorflow

【解决方案1】：

worker 实际上是集群的名称。

你的第一个 bazel 调用应该是这样的：

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \ --cluster_spec='worker|192.168.170.193:2500;192.168.170.226:2501' --job_name=worker --task_id=0 &

在第一个节点 192.168.170.193 上运行

您的集群名称是 worker，其中包含两个节点的 IP 地址。该任务然后引用两个正在运行的节点。您必须在两个节点上启动协议，为每个节点指定不同的任务 ID，即。然后运行：

bazel-bin/tensorflow/core/distributed_runtime/rpc/grpc_tensorflow_server \
--cluster_spec='worker|192.168.170.193:2500;192.168.170.226:2501' --job_name=worker --task_id=1 &`

在您的第二个节点 192.168.170.226

然后运行：

with tf.device("/job:worker/task:0/device:gpu:0"):
  x = tf.placeholder(tf.float32, [None, 784], name='x-input')
  W = tf.Variable(tf.zeros([784, 10]), name='weights')
with tf.device("/job:worker/task:0/device:gpu:1"):
  b = tf.Variable(tf.zeros([10], name='bias'))
# Use a name scope to organize nodes in the graph visualizer
with tf.device("/job:worker/task:1/device:gpu:0"):
  with tf.name_scope('Wx_b'):
    y = tf.nn.softmax(tf.matmul(x, W) + b)

【讨论】：

欢迎来到 SO。请访问帮助中心以查看如何answer 问题。如果您遇到与问题相同的问题，请将其添加为评论。
这个答案很好，但如果你解释如何使用ClusterSpec 和Server 会更好。
TF 开发人员已经大大改进了他们的文档，可以在这里找到：tensorflow.org/how_tos/distributed