在 python 中对齐 DNA 序列答案

【问题标题】：Aligning DNA sequences inside python在 python 中对齐 DNA 序列
【发布时间】：2016-02-10 22:46:46
【问题描述】：

我有数千个范围在 100 到 5000 bp 之间的 DNA 序列，我需要比对并计算指定对的同一性得分。 Biopython pairwise2 做得很好，但仅适用于短序列，当序列大小大于 2kb 时，即使使用了 'score_only' 和 'one_alignment_only' 选项，它也会显示严重的内存泄漏，从而导致 'MemoryError'！

whole_coding_scores={}
from Bio import pairwise2
for genes in whole_coding: # whole coding is a <25Mb dict providing DNA sequences
   alignment=pairwise2.align.globalxx(whole_coding[genes][3],whole_coding[genes][4],score_only=True,one_alignment_only=True)
   whole_coding_scores[genes]=alignment/min(len(whole_coding[genes][3]),len(whole_coding[genes][4]))

超级计算机返回的结果：

Max vmem         = 256.114G  #Memory usage of the script
failed assumedly after job because:
job 4945543.1 died through signal XCPU (24)

我知道还有其他对齐工具，但它们主要可以将分数写入输出文件，需要再次读取和解析以检索和使用对齐分数。是否有任何工具可以像pairwise2那样在python环境中对齐序列并返回对齐分数但没有内存泄漏？

【问题讨论】：

标签： python alignment biopython

【解决方案1】：

对于全局对齐可以尝试 NWalign https://pypi.python.org/pypi/nwalign/。我没有使用它，但似乎您可以在脚本中恢复对齐分数。

否则 EMBOSS 工具可能会有所帮助：http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html

【讨论】：

【解决方案2】：

首先，我为此使用了 BioPython 的针。可以在here找到一个不错的方法（忽略遗留设计:-)）

第二：也许你可以避免使用生成器将整个集合放入内存？我不知道你的“whole_coding”对象来自哪里。但是，如果它是一个文件，请确保您没有读取整个文件，然后遍历内存对象。例如：

whole_coding = open('big_file', 'rt').readlines() # Will consume memory

但是

for gene in open('big_file', 'rt'):     # will not read the whole thing into memory first
    process(gene)

如果你需要处理，你可以写一个生成器函数：

def gene_yielder(filename):
    for line in open('filename', 'rt'):
        line.strip()   # Here you preprocess your data
        yield line     # This will return

然后

for gene in  gene_yielder('big_file'):
    process_gene(gene)

基本上，您希望您的程序充当管道：事物流过它并得到处理。准备肉汤时不要将其用作烹饪锅：加入所有东西，然后加热。我希望这种比较不会牵强 :-)

【讨论】：

实际上，如果不从 python 中写入和读取数据，似乎没有任何解决方案。我还用 Biopython 针解决了这个问题，做了一些额外的工作，但完成了工作。其次，整个 dict 是一个

【解决方案3】：

Biopython 可以（现在）。 Biopython 版本中的pairwise2 模块。 1.68 （快得多）并且可以花费更长的序列。这是新旧pairwise2的比较（在32位Python 2.7.11上，内存限制为2 GByte，64位Win7，Intel Core i5，2.8 GHz）：

from Bio import pairwise2

length_of_sequence = ...
seq1 = '...'[:length_of_sequence]  # Two very long DNA sequences, e.g. human and green monkey
seq2 = '...'[:length_of_sequence]  # dystrophin mRNAs (NM_000109 and XM_007991383, >13 kBp)
aln = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -1)

老成对2
- 最大长度/时间：~1,900 个字符/10 秒
新的成对2
- 最大长度/时间：~7450 个字符/12 秒
- 1,900 个字符的时间：1 秒

将score_only 设置为True，新的pairwise2 可以在6 秒内完成两个约8400 个字符的序列。

【讨论】：