为什么 concurrent.futures.ProcessPoolExecutor 的性能很低？答案

【问题标题】：Why the performance of concurrent.futures.ProcessPoolExecutor is very low?为什么 concurrent.futures.ProcessPoolExecutor 的性能很低？
【发布时间】：2018-02-09 21:02:24
【问题描述】：

我正在尝试利用Python3 中的concurrent.futures.ProcessPoolExecutor 来并行处理大型矩阵。代码的大体结构是：

class X(object):

self.matrix

def f(self, i, row_i):
    <cpu-bound process>

def fetch_multiple(self, ids):
    with ProcessPoolExecutor() as executor:
        futures = [executor.submit(self.f, i, self.matrix.getrow(i)) for i in ids]
        return [f.result() for f in as_completed(futures)]

self.matrix 是一个大的scipy csr_matrix。 f 是我的并发函数，它采用一行 self.matrix 并在其上应用 CPU-bound 进程。最后，fetch_multiple 是一个函数，它并行运行多个f 实例并返回结果。

问题是在运行脚本后，所有 cpu 核心的繁忙度都低于 50%（见以下截图）：

为什么所有核心都不忙？

我认为问题在于self.matrix 的大对象以及在进程之间传递行向量。我该如何解决这个问题？

【问题讨论】：

标签： python-3.x threadpool python-multithreading concurrent.futures process-pool

【解决方案1】：

是的。开销不应该那么大 - 但这可能是您的 CPU 出现中转的原因（尽管它们应该正忙于传递数据）。

但是试试这里的方法，使用共享内存将对象的“指针”传递给子进程。

http://briansimulator.org/sharing-numpy-arrays-between-processes/

从那里引用：

from multiprocessing import sharedctypes
size = S.size
shape = S.shape
S.shape = size
S_ctypes = sharedctypes.RawArray('d', S)
S = numpy.frombuffer(S_ctypes, dtype=numpy.float64, count=size)
S.shape = shape

现在我们可以将 S_ctypes 和 shape 发送到多处理，并将其转换回孩子中的 numpy 数组流程如下：

from numpy import ctypeslib
S = ctypeslib.as_array(S_ctypes)
S.shape = shape

处理引用计数应该很棘手，但我想numpy.ctypeslib 会处理这个问题 - 所以，只需以它们无法处理的方式协调将实际行号传递给子进程相同的数据

【讨论】：