【发布时间】:2015-08-19 08:54:43
【问题描述】:
我编写了一个函数,它返回 Pandas 数据帧(样本作为行,描述符作为列)并将输入作为肽列表(生物序列作为字符串数据)。 "my_function(pep_list)" 将 pep_list 作为参数并返回数据帧。它从pep_list迭代每个肽序列并计算描述符并将所有数据组合为pandas数据框并返回df:
pep_list = [DAAAAEF,DAAAREF,DAAANEF,DAAADEF,DAAACEF,DAAAEEF,DAAAQEF,DAAAGEF,DAAAHEF,DAAAIEF,DAAALEF,DAAAKEF]
示例:
我想用下面给定的算法并行化这段代码:
1. get the number of processor available as .
n = multiprocessing.cpu_count()
2. split the pep_list as
sub_list_of_pep_list = pep_list/n
sub_list_of_pep_list = [[DAAAAEF,DAAAREF,DAAANEF],[DAAADEF,DAAACEF,DAAAEEF],[DAAAQEF,DAAAGEF,DAAAHEF],[DAAAIEF,DAAALEF,DAAAKEF]]
4. run "my_function()" for each core as (example if 4 cores )
df0 = my_function(sub_list_of_pep_list[0])
df1 = my_function(sub_list_of_pep_list[1])
df2 = my_functonn(sub_list_of_pep_list[2])
df3 = my_functonn(sub_list_of_pep_list[4])
5. join all df = concat[df0,df1,df2,df3]
6. returns df with nX speed.
请向我推荐最合适的库来实现此方法。
感谢和问候。
Updated
通过一些阅读,我能够写下符合我期望的代码,例如 1. 没有并行化,10 个肽序列大约需要 10 秒 2. 两个过程需要约 6 秒 12 个肽 3. 四个过程需要约 4 秒处理 12 个肽
from multiprocessing import Process
def func1():
structure_gen(pep_seq = ["DAAAAEF","DAAAREF","DAAANEF"])
def func2():
structure_gen(pep_seq = ["DAAAQEF","DAAAGEF","DAAAHEF"])
def func3():
structure_gen(pep_seq = ["DAAADEF","DAAALEF"])
def func4():
structure_gen(pep_seq = ["DAAAIEF","DAAALEF"])
if __name__ == '__main__':
p1 = Process(target=func1)
p1.start()
p2 = Process(target=func2)
p2.start()
p3 = Process(target=func1)
p3.start()
p4 = Process(target=func2)
p4.start()
p1.join()
p2.join()
p3.join()
p4.join()
但是这段代码很容易处理 10 个肽段,但无法实现它,因为 PEP_list 包含 100 万个肽段
谢谢
【问题讨论】:
-
Process(target=my_function, args=(each_item_in_sub_list,)).start() 您可以生成比 CPU 数量更多的进程
-
如果可以请详细说明谢谢
标签: python pandas python-multithreading python-multiprocessing