【问题标题】:how to extend ambiguous dna sequence如何扩展不明确的 dna 序列
【发布时间】:2014-12-18 17:05:06
【问题描述】:

假设您有这样的 DNA 序列:

AATCRVTAA

其中RV 是DNA 核苷酸的模糊值,其中R 代表AGV 代表ACG

是否有一种 Biopython 方法可以生成上述模糊序列可以表示的所有不同序列组合?

例如,这里的输出将是:

AATCAATAA
AATCACTAA
AATCAGTAA
AATCGATAA
AATCGCTAA
AATCGGTAA

【问题讨论】:

    标签: python biopython dna-sequence


    【解决方案1】:

    也许是一种更短更快的方法,因为这个函数很可能会用于非常大的数据:

    from Bio import Seq
    from itertools import product
    
    def extend_ambiguous_dna(seq):
       """return list of all possible sequences given an ambiguous DNA input"""
       d = Seq.IUPAC.IUPACData.ambiguous_dna_values
       return [ list(map("".join, product(*map(d.get, seq)))) ]
    

    使用map 可以让您的循环在C 而非Python 中执行。这应该比使用普通循环甚至列表推导式要快得多。

    现场测试

    使用d 的简单字典,而不是ambiguous_na_values 返回的字典

    from itertools import product
    import time
    
    d = { "N": ["A", "G", "T", "C"], "R": ["C", "A", "T", "G"] }
    seq = "RNRN"
    
    # using list comprehensions
    lst_start = time.time()
    [ "".join(i) for i in product(*[ d[j] for j in seq ]) ]
    lst_end = time.time()
    
    # using map
    map_start = time.time()
    [ list(map("".join, product(*map(d.get, seq)))) ]
    map_end = time.time()
    
    lst_delay = (lst_end - lst_start) * 1000
    map_delay = (map_end - map_start) * 1000
    
    print("List delay: {} ms".format(round(lst_delay, 2)))
    print("Map delay: {} ms".format(round(map_delay, 2)))
    

    输出:

    # len(seq) = 2:
    List delay: 0.02 ms
    Map delay: 0.01 ms
    
    # len(seq) = 3:
    List delay: 0.04 ms
    Map delay: 0.02 ms
    
    # len(seq) = 4
    List delay: 0.08 ms
    Map delay: 0.06 ms
    
    # len(seq) = 5
    List delay: 0.43 ms
    Map delay: 0.17 ms
    
    # len(seq) = 10
    List delay: 126.68 ms
    Map delay: 77.15 ms
    
    # len(seq) = 12
    List delay: 1887.53 ms
    Map delay: 1320.49 ms
    

    显然map 更好,但只有 2 或 3 倍。它肯定可以进一步优化。

    【讨论】:

    • 由于Biopython的变化,可能与this有关,此代码需要更改才能工作。将第一行更改为import Bio.Data.IUPACData as bdi,然后将d = Seq.IUPAC.IUPACData.ambiguous_dna_values 更改为d = bdi.ambiguous_dna_values
    【解决方案2】:

    我最终编写了自己的函数:

    from Bio import Seq
    from itertools import product
    
    def extend_ambiguous_dna(seq):
       """return list of all possible sequences given an ambiguous DNA input"""
       d = Seq.IUPAC.IUPACData.ambiguous_dna_values
       r = []
       for i in product(*[d[j] for j in seq]):
          r.append("".join(i))
       return r 
    
    In [1]: extend_ambiguous_dna("AV")
    Out[1]: ['AA', 'AC', 'AG']
    

    它允许您使用

    生成给定尺寸的每个模式
    In [2]: extend_ambiguous_dna("NN")
    
    Out[2]: ['GG', 'GA', 'GT', 'GC',
             'AG', 'AA', 'AT', 'AC',
             'TG', 'TA', 'TT', 'TC',
             'CG', 'CA', 'CT', 'CC']
    

    希望这会为其他人节省时间!

    【讨论】:

      【解决方案3】:

      我不确定 biopython 的方法可以做到这一点,但这里有一个使用 itertools 的方法:

      s = "AATCRVTAA"
      ambig = {"R": ["A", "G"], "V":["A", "C", "G"]}
      groups = itertools.groupby(s, lambda char:char not in ambig)
      splits = []
      for b,group in groups:
          if b:
              splits.extend([[g] for g in group])
          else:
              for nuc in group:
                  splits.append(ambig[nuc])
      answer = [''.join(p) for p in itertools.product(*splits)]
      

      输出:

      In [189]: answer
      Out[189]: ['AATCAATAA', 'AATCACTAA', 'AATCAGTAA', 'AATCGATAA', 'AATCGCTAA', 'AATCGGTAA']
      

      【讨论】:

        【解决方案4】:

        还有一个 itertools 解决方案:

        from itertools import product
        import re
        
        lu = {'R':'AG', 'V':'ACG'}
        
        def get_seqs(seq):
            seqs = []
            nrepl = seq.count('R') + seq.count('V')
            sp_seq = [a for a in re.split(r'(R|V)', seq) if a]
            pr_terms = [lu[a] for a in sp_seq if a in 'RV']
        
            for cmb in product(*pr_terms):
                seqs.append(''.join(sp_seq).replace('R', '%s').replace('V', '%s') % cmb)
            return seqs
        
        seq = 'AATCRVTAA'
        
        print 'seq: ', seq
        print '\n'.join(get_seqs(seq))
        
        seq1 = 'RAATCRVTAAR'
        print 'seq: ', seq1
        print '\n'.join(get_seqs(seq1))
        

        输出:

        seq:  AATCRVTAA
        AATCAATAA
        AATCACTAA
        AATCAGTAA
        AATCGATAA
        AATCGCTAA
        AATCGGTAA
        seq:  RAATCRVTAAR
        AAATCAATAAA
        AAATCAATAAG
        AAATCACTAAA
        AAATCACTAAG
        AAATCAGTAAA
        AAATCAGTAAG
        AAATCGATAAA
        AAATCGATAAG
        AAATCGCTAAA
        AAATCGCTAAG
        AAATCGGTAAA
        AAATCGGTAAG
        GAATCAATAAA
        GAATCAATAAG
        GAATCACTAAA
        GAATCACTAAG
        GAATCAGTAAA
        GAATCAGTAAG
        GAATCGATAAA
        GAATCGATAAG
        GAATCGCTAAA
        GAATCGCTAAG
        GAATCGGTAAA
        GAATCGGTAAG
        

        【讨论】:

        • 在我们有两个或多个相同的相邻模糊代码(如“RRATCGGTAAA”)的特殊情况下输出错误
        猜你喜欢
        • 1970-01-01
        • 2013-05-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2014-02-22
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多