inter_op_parallelism_threads 和 intra_op_parallelism_threads 的含义答案

【问题标题】：Meaning of inter_op_parallelism_threads and intra_op_parallelism_threadsinter_op_parallelism_threads 和 intra_op_parallelism_threads 的含义
【发布时间】：2017-05-05 03:45:46
【问题描述】：

有人可以解释以下 TensorFlow 术语吗

inter_op_parallelism_threads
intra_op_parallelism_threads

或者，请提供指向正确解释来源的链接。

我通过更改参数进行了一些测试，但结果并不一致得出结论。

【问题讨论】：

标签： python parallel-processing tensorflow distributed-computing

【解决方案1】：

inter_op_parallelism_threads 和 intra_op_parallelism_threads 选项记录在 source of the tf.ConfigProto protocol buffer 中。这些选项配置 TensorFlow 使用的两个线程池来并行执行，如 cmets 所述：

// The execution of an individual op (for some op types) can be
// parallelized on a pool of intra_op_parallelism_threads.
// 0 means the system picks an appropriate number.
int32 intra_op_parallelism_threads = 2;

// Nodes that perform blocking operations are enqueued on a pool of
// inter_op_parallelism_threads available in each process.
//
// 0 means the system picks an appropriate number.
//
// Note that the first Session created in the process sets the
// number of threads for all future sessions unless use_per_session_threads is
// true or session_inter_op_thread_pool is configured.
int32 inter_op_parallelism_threads = 5;

在运行 TensorFlow 图时有几种可能的并行形式，这些选项提供了一些控制多核 CPU 并行性：

如果您的 TensorFlow 图中有许多独立的操作（因为在数据流图中它们之间没有定向路径），TensorFlow 将尝试使用带有 inter_op_parallelism_threads 线程的线程池同时运行它们。如果这些操作具有多线程实现，它们将（在大多数情况下）共享同一个线程池以实现操作内并行。

最后，两个配置选项都采用默认值0，这意味着“系统选择了一个合适的数字”。目前，这意味着每个线程池将在您的机器中的每个 CPU 内核中拥有一个线程。

【讨论】：

这可以用来在多个 CPU 上并行化我的代码吗？如果集群中的一台机器出现故障，如何使用这些功能实现容错？
这些选项控制着运行 TensorFlow 图表可以获得的最大并行度。但是，它们依赖于您运行的具有并行实现的操作（就像许多标准内核所做的那样）来实现操作内并行性；以及在图中运行独立操作以实现操作间并行性的可用性。但是，如果（例如）您的图形是线性操作链，并且这些操作只有串行实现，那么这些选项不会增加并行性。这些选项与容错（或分布式执行）无关。
这两个选项似乎只适用于 CPU 而不是 GPU？如果我有多个基于并行矩阵乘法运算的 tf.add_n 运算符并在 GPU 中运行，默认情况下并行化是如何完成的，我可以控制它吗？
将这两个值都设置为 1 对速度的影响有多大？将两者都设置为一个是否意味着 tensorflow 将只使用一个线程？（我刚刚尝试过，一旦开始训练，我就可以看到我所有的核心使用率都在上升，但我并没有真正看到速度上的差异）
@mrry 所以如果我理解正确答案，intra 控制核心数（1 个节点内），inter 控制节点数，对吧？或者粗略地说，intra 像 OpenMP 一样工作，inter 像 OpenMPI 一样工作？如果我错了，请纠正我。

【解决方案2】：

要从机器获得最佳性能，请更改并行度 tensorflow 后端的线程和 OpenMP 设置如下（来自here）：

import tensorflow as tf

#Assume that the number of cores per socket in the machine is denoted as NUM_PARALLEL_EXEC_UNITS
#  when NUM_PARALLEL_EXEC_UNITS=0 the system chooses appropriate settings 

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, 
                        inter_op_parallelism_threads=2, 
                        allow_soft_placement=True,
                        device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS})

session = tf.Session(config=config)

回答下面的评论： [source]

allow_soft_placement=True

如果您希望 TensorFlow 自动选择现有且受支持的设备来运行操作，以防指定的设备不存在，您可以在创建会话时在配置选项中将 allow_soft_placement 设置为 True。简而言之，它允许动态分配 GPU 内存。

【讨论】：

allow_soft_placement=True 是什么？
在帖子中回答了问题。

【解决方案3】：

Tensorflow 2.0兼容答案：如果我们想在Tensorflow Version 2.0的Graph模式下执行，我们可以配置inter_op_parallelism_threads的函数strong> 和 intra_op_parallelism_threads 是

tf.compat.v1.ConfigProto.

【讨论】：