并行 BLAST 程序耗时过长答案

【问题标题】：Parallel BLAST program takes too long并行 BLAST 程序耗时过长
【发布时间】：2021-03-11 03:25:57
【问题描述】：

当我运行以下代码时，我什至没有得到一个爆炸结果。如果有人发现错误，可以告诉我吗？

from Bio.Blast import NCBIWWW
from Bio import SeqIO
from Bio.Blast import NCBIXML
from multiprocessing import Pool
import time


def blast_sequences_parallel(seq_record):
    result_handle = NCBIWWW.qblast("blastn", "nt", seq_record.seq, entrez_query='txid10239[viruses]')
    blast_records = NCBIXML.parse(result_handle)
    return blast_records


if __name__ == "__main__":
    file = "file.fa"
    get_number_of_seqs(file)
    seq_records = SeqIO.parse(file, "fasta")
    t1 = time.time()
    p = Pool()
    results = p.map(blast_sequences_parallel, seq_records)
    p.close()
    p.join()

    print("Pool took:", time.time() - t1)
    print(results)

我有 73,000 个序列要运行，所以我试图让它更快。我在超级计算机上运行它。关于我需要多少内存以及多少核心/节点有什么建议吗？我也在 shell 中尝试过以下命令：

blastn -query file.fa -remote

但我收到一条错误消息，提示我需要下载数据库？有没有办法使用在线服务器进行搜索？如果有办法，我可以只搜索病毒基因组吗？

【问题讨论】：

标签： python parallel-processing bioinformatics biopython blast

【解决方案1】：

对于 73K 序列，您应该下载适当的数据库并在本地运行 BLAST，而不是尝试在线运行。 Biopython 也有这个from Bio.Blast.Applications import NcbiblastxCommandline 的包装器，但是从命令行运行 BLAST 会更容易。另见相关 biopython docs。

NCBI 提供一组预制的 dbs（或者您可以通过 makeblastdb 从 FASTA 文件构建自己的）：https://ftp.ncbi.nlm.nih.gov/blast/db/

【讨论】：