【发布时间】:2020-03-27 15:21:12
【问题描述】:
我正在编写一个需要处理大量数据的脚本。我意识到脚本的并行组件实际上对具有大量单独数据点的实例没有帮助。我将创建临时文件并并行运行它们。我在qsub 上运行它,所以我将通过-pe threaded $N_JOBS 分配一定数量的线程(在这个小例子中是4)。
我的最终目标是使用我分配的线程之一启动每个进程,然后等待所有作业完成后再继续。
但是,我只使用过process = subprocess.Popen 和process.communicate() 来运行shell 作业。由于僵尸进程,我过去在使用process.wait() 时遇到了一些麻烦。
如何修改我的run 函数来启动作业,而不是等待完成,然后开始下一个作业,然后在所有作业运行后,等待所有作业完成?
如果不清楚,请告诉我,我可以更好地解释。在下面的示例中(可能是一个糟糕的示例?),我想使用 4 个单独的线程(我不知道如何设置这个 b/c 我只为简单的并行化做过joblib.Parallel)其中每个线程运行命令echo '$THREAD' && sleep 1。所以最终它应该花费 1 秒多一点而不是 ~4 秒。
我找到了这篇文章:Python threading multiple bash subprocesses?,但我不确定如何使用我的run 脚本来适应我的情况。
import sys, subprocess, time
# Number of jobs
N_JOBS=4
# Run command
def run(
cmd,
popen_kws=dict(),
):
# Run
f_stdout = subprocess.PIPE
f_stderr = subprocess.PIPE
# Execute the process
process_ = subprocess.Popen(cmd, shell=True, stdout=f_stdout, stderr=f_stderr, **popen_kws)
# Wait until process is complete and return stdout/stderr
stdout_, stderr_ = process_.communicate() # Use this .communicate instead of .wait to avoid zombie process that hangs due to defunct. Removed timeout b/c it's not available in Python 2
# Return code
returncode_ = process_.returncode
return {"process":process_, "stdout":stdout_, "stderr":stderr_, "returncode":returncode_}
# Commands
cmds = list(map(lambda x:"echo '{}' && sleep 1".format(x), range(1, N_JOBS+1)))
# ["echo '1'", "echo '2'", "echo '3'", "echo '4'"]
# Start time
start_time = time.time()
results = dict()
for thread, cmd in enumerate(cmds, start=1):
# Run command but don't wait for it to finish (Currently, it's waiting to finish)
results[thread] = run(cmd)
# Now wait until they are all finished
print("These jobs took {} seconds\n".format(time.time() - start_time))
print("Here's the results:", *results.items(), sep="\n")
print("\nContinue with script. .. ...")
# These jobs took 4.067937850952148 seconds
# Here's the results:
# (1, {'process': <subprocess.Popen object at 0x1320766d8>, 'stdout': b'1\n', 'stderr': b'', 'returncode': 0})
# (2, {'process': <subprocess.Popen object at 0x1320547b8>, 'stdout': b'2\n', 'stderr': b'', 'returncode': 0})
# (3, {'process': <subprocess.Popen object at 0x132076ba8>, 'stdout': b'3\n', 'stderr': b'', 'returncode': 0})
# (4, {'process': <subprocess.Popen object at 0x132076780>, 'stdout': b'4\n', 'stderr': b'', 'returncode': 0})
# Continue with script. .. ...
我已经尝试按照 multiprocessing https://docs.python.org/3/library/multiprocessing.html 上的文档进行操作,但要根据我的情况调整它真的很令人困惑:
# Run command
def run(
cmd,
errors_ok=False,
popen_kws=dict(),
):
# Run
f_stdout = subprocess.PIPE
f_stderr = subprocess.PIPE
# Execute the process
process_ = subprocess.Popen(cmd, shell=True, stdout=f_stdout, stderr=f_stderr, **popen_kws)
return process_
# Commands
cmds = list(map(lambda x:"echo '{}' && sleep 0.5".format(x), range(1, N_JOBS+1)))
# ["echo '1'", "echo '2'", "echo '3'", "echo '4'"]
# Start time
start_time = time.time()
results = dict()
for thread, cmd in enumerate(cmds, start=1):
# Run command but don't wait for it to finish (Currently, it's waiting to finish)
p = multiprocessing.Process(target=run, args=(cmd,))
p.start()
p.join()
results[thread] = p
【问题讨论】:
标签: python multithreading parallel-processing subprocess popen