【发布时间】:2014-10-05 23:03:15
【问题描述】:
我在最近的老式 Apple MacBook Pro 上使用 Python 2.7.5,它有四个硬件和八个逻辑 CPU;即,sysctl 实用程序给出:
$ sysctl hw.physicalcpu
hw.physicalcpu: 4
$ sysctl hw.logicalcpu
hw.logicalcpu: 8
我需要对大型一维列表或数组执行一些相当复杂的处理,然后将结果保存为中间输出,稍后将在我的应用程序的后续计算中再次使用它。我的问题的结构很自然地适合并行化,所以我想我会尝试使用 Python 的多处理模块将一维数组细分为几块(4 块或 8 块,我还不确定哪个),执行并行计算,然后将结果输出重新组合成最终格式。我正在尝试决定是使用multiprocessing.Queue()(消息队列)还是multiprocessing.Array()(共享内存)作为将结果计算从子进程传回主父进程的首选机制,我一直在尝试几个“玩具”模型,以确保我了解多处理模块的实际工作原理。然而,我遇到了一个相当出乎意料的结果:在为同一问题创建两个本质上等效的解决方案时,使用共享内存进行进程间通信的版本似乎比使用消息的版本需要更多的执行时间(比如多 30 倍!)排队。下面,我为一个“玩具”问题提供了两个不同版本的示例源代码,它使用并行进程生成一长串随机数,并以两种不同的方式将聚集的结果传回父进程:首先使用消息队列,第二次使用共享内存。
这是使用消息队列的版本:
import random
import multiprocessing
import datetime
def genRandom(count, id, q):
print("Now starting process {0}".format(id))
output = []
# Generate a list of random numbers, of length "count"
for i in xrange(count):
output.append(random.random())
# Write the output to a queue, to be read by the calling process
q.put(output)
if __name__ == "__main__":
# Number of random numbers to be generated by each process
size = 1000000
# Number of processes to create -- the total size of all of the random
# numbers generated will ultimately be (procs * size)
procs = 4
# Create a list of jobs and queues
jobs = []
outqs = []
for i in xrange(0, procs):
q = multiprocessing.Queue()
p = multiprocessing.Process(target=genRandom, args=(size, i, q))
jobs.append(p)
outqs.append(q)
# Start time of the parallel processing and communications section
tstart = datetime.datetime.now()
# Start the processes (i.e. calculate the random number lists)
for j in jobs:
j.start()
# Read out the data from the queues
data = []
for q in outqs:
data.extend(q.get())
# Ensure all of the processes have finished
for j in jobs:
j.join()
# End time of the parallel processing and communications section
tstop = datetime.datetime.now()
tdelta = datetime.timedelta.total_seconds(tstop - tstart)
msg = "{0} random numbers generated in {1} seconds"
print(msg.format(len(data), tdelta))
当我运行它时,我得到的结果通常看起来像这样:
$ python multiproc_queue.py
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 0.514805 seconds
现在,这是等效的代码段,但稍作重构,使其使用共享内存而不是队列:
import random
import multiprocessing
import datetime
def genRandom(count, id, d):
print("Now starting process {0}".format(id))
# Generate a list of random numbers, of length "count", and write them
# directly to a segment of an array in shared memory
for i in xrange(count*id, count*(id+1)):
d[i] = random.random()
if __name__ == "__main__":
# Number of random numbers to be generated by each process
size = 1000000
# Number of processes to create -- the total size of all of the random
# numbers generated will ultimately be (procs * size)
procs = 4
# Create a list of jobs and a block of shared memory
jobs = []
data = multiprocessing.Array('d', size*procs)
for i in xrange(0, procs):
p = multiprocessing.Process(target=genRandom, args=(size, i, data))
jobs.append(p)
# Start time of the parallel processing and communications section
tstart = datetime.datetime.now()
# Start the processes (i.e. calculate the random number lists)
for j in jobs:
j.start()
# Ensure all of the processes have finished
for j in jobs:
j.join()
# End time of the parallel processing and communications section
tstop = datetime.datetime.now()
tdelta = datetime.timedelta.total_seconds(tstop - tstart)
msg = "{0} random numbers generated in {1} seconds"
print(msg.format(len(data), tdelta))
但是,当我运行共享内存版本时,典型的结果看起来更像这样:
$ python multiproc_shmem.py
Now starting process 0
Now starting process 1
Now starting process 2
Now starting process 3
4000000 random numbers generated in 15.839607 seconds
我的问题:为什么我的代码的两个版本之间的执行速度存在如此巨大的差异(大约 0.5 秒对 15 秒,是 30 倍!)?特别是,如何修改共享内存版本以使其运行得更快?
【问题讨论】:
-
将第一个示例中的
queue.put移动到for循环中,使其与第二个示例中的d[i]相同。目前,您无法比较这两种技术,因为queue一种被大量使用,而共享内存一种被急切使用。
标签: python performance multiprocessing message-queue shared-memory