为什么通过共享内存的通信比通过队列慢得多？答案

【问题标题】：Why is communication via shared memory so much slower than via queues?为什么通过共享内存的通信比通过队列慢得多？
【发布时间】：2014-10-05 23:03:15
【问题描述】：

我在最近的老式 Apple MacBook Pro 上使用 Python 2.7.5，它有四个硬件和八个逻辑 CPU；即，sysctl 实用程序给出：

$ sysctl hw.physicalcpu
hw.physicalcpu: 4
$ sysctl hw.logicalcpu
hw.logicalcpu: 8

我需要对大型一维列表或数组执行一些相当复杂的处理，然后将结果保存为中间输出，稍后将在我的应用程序的后续计算中再次使用它。我的问题的结构很自然地适合并行化，所以我想我会尝试使用 Python 的多处理模块将一维数组细分为几块（4 块或 8 块，我还不确定哪个），执行并行计算，然后将结果输出重新组合成最终格式。我正在尝试决定是使用multiprocessing.Queue()（消息队列）还是multiprocessing.Array()（共享内存）作为将结果计算从子进程传回主父进程的首选机制，我一直在尝试几个“玩具”模型，以确保我了解多处理模块的实际工作原理。然而，我遇到了一个相当出乎意料的结果：在为同一问题创建两个本质上等效的解决方案时，使用共享内存进行进程间通信的版本似乎比使用消息的版本需要更多的执行时间（比如多 30 倍！）排队。下面，我为一个“玩具”问题提供了两个不同版本的示例源代码，它使用并行进程生成一长串随机数，并以两种不同的方式将聚集的结果传回父进程：首先使用消息队列，第二次使用共享内存。

这是使用消息队列的版本：

import random
import multiprocessing
import datetime

def genRandom(count, id, q):

    print("Now starting process {0}".format(id))
    output = []
    # Generate a list of random numbers, of length "count"
    for i in xrange(count):
        output.append(random.random())
    # Write the output to a queue, to be read by the calling process 
    q.put(output)

if __name__ == "__main__":
    # Number of random numbers to be generated by each process
    size = 1000000
    # Number of processes to create -- the total size of all of the random
    # numbers generated will ultimately be (procs * size)
    procs = 4

    # Create a list of jobs and queues 
    jobs = []
    outqs = []
    for i in xrange(0, procs):
        q = multiprocessing.Queue()
        p = multiprocessing.Process(target=genRandom, args=(size, i, q))
        jobs.append(p)
        outqs.append(q)

    # Start time of the parallel processing and communications section
    tstart = datetime.datetime.now()    
    # Start the processes (i.e. calculate the random number lists)      
    for j in jobs:
        j.start()

    # Read out the data from the queues
    data = []
    for q in outqs:
        data.extend(q.get())

    # Ensure all of the processes have finished
    for j in jobs:
        j.join()
    # End time of the parallel processing and communications section
    tstop = datetime.datetime.now()
    tdelta = datetime.timedelta.total_seconds(tstop - tstart)

    msg = "{0} random numbers generated in {1} seconds"
    print(msg.format(len(data), tdelta))

当我运行它时，我得到的结果通常看起来像这样：

$ python multiproc_queue.py
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 0.514805 seconds

现在，这是等效的代码段，但稍作重构，使其使用共享内存而不是队列：

import random
import multiprocessing
import datetime

def genRandom(count, id, d):

    print("Now starting process {0}".format(id))
    # Generate a list of random numbers, of length "count", and write them
    # directly to a segment of an array in shared memory
    for i in xrange(count*id, count*(id+1)):
        d[i] = random.random()

if __name__ == "__main__":
    # Number of random numbers to be generated by each process
    size = 1000000
    # Number of processes to create -- the total size of all of the random
    # numbers generated will ultimately be (procs * size)
    procs = 4

    # Create a list of jobs and a block of shared memory
    jobs = []
    data = multiprocessing.Array('d', size*procs)
    for i in xrange(0, procs):
        p = multiprocessing.Process(target=genRandom, args=(size, i, data))
        jobs.append(p)

    # Start time of the parallel processing and communications section
    tstart = datetime.datetime.now()    
    # Start the processes (i.e. calculate the random number lists)      
    for j in jobs:
        j.start()

    # Ensure all of the processes have finished
    for j in jobs:
    j.join()
    # End time of the parallel processing and communications section
    tstop = datetime.datetime.now()
    tdelta = datetime.timedelta.total_seconds(tstop - tstart)

    msg = "{0} random numbers generated in {1} seconds"
    print(msg.format(len(data), tdelta))

但是，当我运行共享内存版本时，典型的结果看起来更像这样：

$ python multiproc_shmem.py 
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 15.839607 seconds

我的问题：为什么我的代码的两个版本之间的执行速度存在如此巨大的差异（大约 0.5 秒对 15 秒，是 30 倍！）？特别是，如何修改共享内存版本以使其运行得更快？

【问题讨论】：

将第一个示例中的queue.put 移动到for 循环中，使其与第二个示例中的d[i] 相同。目前，您无法比较这两种技术，因为 queue 一种被大量使用，而共享内存一种被急切使用。

标签： python performance multiprocessing message-queue shared-memory

【解决方案1】：

这是因为multiprocessing.Array默认使用锁来防止多个进程同时访问它：

multiprocessing.Array(typecode_or_type, size_or_initializer, *, lock=True)

...

如果 lock 为 True（默认值），则创建一个新的锁对象以同步对值的访问。如果 lock 是 Lock 或 RLock 对象然后将用于同步访问该值。如果锁是 False 则不会自动访问返回的对象受锁保护，因此不一定是“进程安全的”。

这意味着您并没有真正同时写入数组 - 一次只有一个进程可以访问它。由于您的示例工作人员除了数组写入之外几乎什么都不做，因此不断等待此锁会严重损害性能。如果在创建数组的时候使用lock=False，性能会好很多：

与lock=True:

Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 4.811205 seconds

与lock=False:

Now starting process 0
Now starting process 3
Now starting process 1
Now starting process 2
4000000 random numbers generated in 0.192473 seconds

请注意，使用lock=False 意味着您需要手动保护对Array 的访问，只要您执行不安全的操作。您的示例是让进程写入独特的部分，所以没关系。但是，如果您在执行此操作时尝试从中读取，或者有不同的进程写入重叠部分，则需要手动获取锁。

【讨论】：

谢谢！一百万年我都不会猜到是内存锁定导致了所有额外开销。