算法优化以找到总质量为 m 的可能氨基酸序列答案

【问题标题】：Algorithm optimization to find possible aminoacid sequences with total mass m [duplicate]算法优化以找到总质量为 m 的可能氨基酸序列
【发布时间】：2013-12-10 10:09:13
【问题描述】：

这是一个家庭作业，我解决了这个问题，但我正在尝试找到一个更快的解决方案。

问题如下：我需要弄清楚有多少个可能的氨基酸（aa）序列存在总质量为 m。我有一张氨基酸表（单字母字符串）和它们对应的质量（int），我把它放在字典里。

我最初的解决方案是创建 aa 的所有可能组合，并将每个组合的总质量与质量 m 进行比较。这适用于少量的 m，但是当 m 开始达到数百个时，组合的数量会变得非常高。

我做了一些小的优化，让它在 m

这是我目前所拥有的：

totalmass = m

def pepList():
    tempList = ['']
    temp2List = []
    length = 0
    total = 0
    aminoList = 'GASPVTCINDKEMHFRYW'  #this are all the aminoacids

    while length < maxLength:
        for i in tempList:
            for j in aminoList:
                pepMass = peptideMass(i+j, massTable) #find the mass of 
                                                      #this peptide
                if pepMass == totalmass:
                    total += 1
                elif pepMass <= totalmass:
                    temp2List.append(i+j)


        tempList = []
        for i in temp2List:
            tempList.append(i)
        temp2List = []
        length = length + 1

    print (total)

pepList()

我可以在大约一秒内得到 m = 300 的解，但 m = 500 大约需要 40 秒

我尝试了使用 itertools 的替代方法，但速度并不快：

total = 0
pepList = []

for i in range(maxLength+1):
    for p in itertools.combinations_with_replacement(aminoList, i): 
    #order matters for the total number of peptides but not for calculating 
    #the total mass
        amino = ''.join(p)
        if peptideMass(amino, massTable) == mass:
            pepList.append(amino)

print (len(pepList))

newpepList = []

for i in pepList:

    for p in itertools.permutations(i, r = len(i)): 
    #I use permutations here to get the total number because order matters
        if p not in newpepList:
            newpepList.append(p)

            total +=1

print (total)

示例输入：米 = 270 输出： 22

【问题讨论】：

欢迎来到 StackOverflow，您的问题很好 - 您提供了很好的解释和代码，但是，如果您可以提供一些示例输入和输出以便其他用户可以检查以确保他们为您提供了一个好的和正确的解决方案。无论如何，存在一个小问题 - 您的问题对您的项目和问题非常具体，因此适用于每个人，为了更好地获得帮助，请尝试提出更通用的问题，可能专注于代码的某个方面，并请求帮助在一个新问题中对其进行优化。
谢谢。另一个问题似乎是完全相同的问题，所以我会检查一下。

标签： python optimization python-3.x bioinformatics

【解决方案1】：

氨基酸出现的顺序不会改变质量 - AAC 与 ACA 和 CAA 的重量相同。

因此，这可以简化为线性规划问题 - 找到系数的值，使得 M = a*A + b*C + c*D + d*E + e*G + ... + r* W

一旦找到解决方案，您就可以生成给定氨基酸组的所有可能排列 - 或者如果您只需要排列的个数，则可以直接计算。

编辑：

正如@Hooked 指出的那样，这不是线性规划，原因有两个：首先，我们需要整数系数，其次，我们正在寻找所有组合，而不是找到单一的最优解。

我设计了一个递归生成器，如下所示：

from math import floor, ceil
import profile

amino_weight = {
    'A':  71.038,
    'C': 103.009,
    'D': 115.027,
    'E': 129.043,
    'F': 147.068,
    'G':  57.021,
    'H': 137.059,
    'I': 113.084,
    'K': 128.095,
    'L': 113.084,   # you omitted leutine?
    'M': 131.040,
    'N': 114.043,
    'P':  97.053,
    'Q': 128.059,   # you omitted glutamine?
    'R': 156.101,
    'S':  87.032,
    'T': 101.048,
    'V':  99.068,
    'W': 186.079,
    'Y': 163.063
}

def get_float(prompt):
    while True:
        try:
            return float(raw_input(prompt))
        except ValueError:
            pass

# This is where the fun happens!
def get_mass_combos(aminos, pos, lo, hi, cutoff):
    this = aminos[pos]         # use a pointer into the string, to avoid copying 8 million partial strings around
    wt = amino_weight[this]
    kmax = int(floor(hi / wt))
    npos = pos - 1
    if npos:                   # more aminos to consider recursively
        for k in xrange(0, kmax + 1):
            mass    = k * wt
            nlo     = lo - mass
            nhi     = hi - mass
            ncutoff = cutoff - mass
            if nlo <= 0. and nhi >= 0.:
                # we found a winner!
                yield {this: k}
            elif ncutoff < 0.:
                # no further solution is possible
                break
            else:
                # recurse
                for cc in get_mass_combos(aminos, npos, nlo, nhi, ncutoff):
                    if k > 0: cc[this] = k
                    yield cc
    else:                      # last amino - it's this or nothing
        kmin = int(ceil(lo / wt))
        for k in xrange(kmin, kmax+1):
            yield {this: k}

def to_string(combo):
    keys = sorted(combo)
    return ''.join(k*combo[k] for k in keys)

def total_mass(combo):
    return sum(amino_weight[a]*n for a,n in combo.items())

def fact(n):
    num = 1
    for i in xrange(2, n+1):
        num *= i
    return num

def permutations(combo):
    num = 0
    div = 1
    for v in combo.values():
        num += v
        div *= fact(v)
    return fact(num) / div

def find_combos(lo, hi):
    total = 0
    bases = []
    aminos = ''.join(sorted(amino_weight, key = lambda x: amino_weight[x]))
    for combo in get_mass_combos(aminos, len(aminos)-1, lo, hi, hi - amino_weight[aminos[0]]):
        base = to_string(combo)
        bases.append(base)
        mass = total_mass(combo)
        cc = permutations(combo)
        total += cc
        print("{} (mass {}, {} permutations)".format(base, mass, cc))
    print('Total: {} bases, {} permutations'.format(len(bases), total))

def main():
    lo = get_float('Bottom of target mass range? ')
    hi = get_float('Top of target mass range? ')

    prof = profile.Profile()
    prof.run('find_combos({}, {})'.format(lo, hi))
    prof.print_stats()

if __name__=="__main__":
    main()

它还使用浮点氨基质量查找质量范围。在我的机器 (i5-870) 上搜索 748.0 和 752.0 之间的质量会在 3.82 秒内返回 7,505 个碱基，总计 9,400,528 个排列。

【讨论】：

如果你的系数a,b,c,d... 必须是整数，它是否仍然被认为是一个线性规划问题？我问是因为那个约束：en.wikipedia.org/wiki/Linear_programming#Integer_unknowns 似乎使它成为 NP-hard...
我尝试了这段代码，但起初在寻找确切结果时它不起作用（即 hi=lo），但经过一些调试后，我发现get_mass_combos() 中的for k in xrange(kmin, kmax+1) 从未执行任何操作因为 kmax+1 始终与 kmin 相同。我试图修复它，但我仍然不太确定 else 语句应该如何工作。最后，我通过将if npos 更改为if pos >= 0 并完全删除else 来修复它。
@user3032890：（畏缩）不，不要那样做。 else 语句负责处理要考虑的最后一个 氨基酸——即不再递归。此时，仅当 kmin 几乎但不完全是所有时间。执行您的建议将完全破坏算法，返回错误的解决方案（或根本没有）。
我知道它会处理最后一个氨基酸，但我不明白如何处理。因为它是你的代码似乎忽略了最后一个氨基酸，我不知道为什么。我所做的是将递归再扩展一步，并且效果很好。你可以自己试试。使用质量圆桌和 270 的目标质量，您应该得到 22 个排列。在这种情况下，所有可能的肽段将是 GGGV、GGR、GAAA、GVN、AAK 和 NR 的排列。您的算法只输出 AAK 和 NR，它忽略了所有带有 G 的肽段。

【解决方案2】：

灵感来自Hugh's answer below：这是使用numpy 的解决方案。它总是计算所有组合，因此它使用大量内存，但它具有线性复杂度。

想法是将所有可能的系数数组存储在一个numpy数组（C）中，并使用numpy.dot生成每个系数数组的质量总和，然后将其存储在数组PROD中。

然后问题是找出PROD 的哪些元素包含所需的质量M，并返回（基于PROD 中的元素索引）实际的PEP_NAMES

知道它是否真的产生了正确的结果会很有趣：

import sys
import numpy as np

def main():
    try:
        M = int(sys.argv[1])
    except Exception:
        print "Usage: %s M" % sys.argv[0]
        print "    M = total mass"
        sys.exit()

    PEP_NAMES      =          ['G', 'A', 'S', 'P', 'V', 'T', 'C', 'I', 'N', 'D', 'K', 'E', 'M', 'H', 'F', 'R', 'Y', 'W']
    # random data
    PEP_MASSES     = np.array([ 71,  99,  14,  37,  61,  63,  83,   3,  52,  43,   2,  80,  18,  37,  56,  36,  96,  13])
    LEN_PEP_MASSES = len(PEP_MASSES)
    NUM_COMB       = 2**LEN_PEP_MASSES-1

    # numpy array containing all possible coeficients
    C = np.array([[int(x) for x in np.binary_repr(K, width=LEN_PEP_MASSES)] for K in xrange(NUM_COMB)])
    # each element is an array of coefficients representing a number between 0 and NUM_COMB in binary form
    print "type(C)      = %s" % type(C)
    print "type(C[0])   = %s" % type(C[0])
    print "C.shape      = %s" % str(C.shape)
    print "C[0].shape   = %s" % str(C[0].shape)
    print "C[0]         = %s" % C[0]
    print "C[15]        = %s" % C[15]
    print "C[255]       = %s" % C[255]

    # Calculate sum of all combinations
    PROD = C.dot(PEP_MASSES)

    # find the ones that match M
    valid_combinations = [(i,x) for i,x in enumerate(PROD) if x == M]
    print 'Found %d possibilities with total mass = %d:' % (len(valid_combinations), M)
    print valid_combinations
    for comb_index, comb_mass in valid_combinations:
        # work back the combinations in string format
        comb_str = [PEP_NAMES[i] for i,x in enumerate(C[comb_index]) if x==1]
        print '%10d --> %s' % (comb_index, ''.join(comb_str))

if __name__ == '__main__':
    main()

样本输出：

python test.py 750
type(C)      = <type 'numpy.ndarray'>
type(C[0])   = <type 'numpy.ndarray'>
C.shape      = (262143, 18)
C[0].shape   = (18,)
C[0]         = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
C[15]        = [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1]
C[255]       = [0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1]
Found 24 possibilities with total mass = 750:
[(130815, 750), (196478, 750), (204671, 750), (208895, 750), (212575, 750), (220155, 750), (221039, 750), (225263, 750), (227455, 750), (228059, 750), (228943, 750), (229151, 750), (236542, 750), (244446, 750), (244695, 750), (252910, 750), (257914, 750), (260062, 750), (260814, 750), (260988, 750), (261022, 750), (261063, 750), (261750, 750), (262109, 750)]
    130815 --> ASPVTCINKEMHFRYW
    196478 --> GSPVTCINDEMHFRY
    204671 --> GATCINDEMHFRYW
    208895 --> GAVCINDKEMHFRYW
    212575 --> GAVTCINEHFRYW
    220155 --> GAPTCNDKEMHFYW
    221039 --> GAPTCINDEMFRYW
    225263 --> GAPVCINDKEMFRYW
    227455 --> GAPVTCEMHFRYW
    228059 --> GAPVTCNKEHFYW
    228943 --> GAPVTCINEFRYW
    229151 --> GAPVTCINDHFRYW
    236542 --> GASTCNDKEMHFRY
    244446 --> GASVTCNKEHFRY
    244695 --> GASVTCNDKEHRYW
    252910 --> GASPTCNDKEMFRY
    257914 --> GASPVCINDEMHFY
    260062 --> GASPVTINDKEHFRY
    260814 --> GASPVTCNKEFRY
    260988 --> GASPVTCNDEMHFR
    261022 --> GASPVTCNDKHFRY
    261063 --> GASPVTCNDKERYW
    261750 --> GASPVTCINEMHRY
    262109 --> GASPVTCINDKEHFRW

在我的笔记本电脑上运行大约需要 15 秒。

请注意，它为您提供所有组合 (!)（即元素的顺序并不重要）。如果您需要所有排列，您只需要遍历每个结果并生成它们。

【讨论】：