在多序列比对中检查特定位置的特定氨基酸答案

【问题标题】：Checking for specific amino acids in specific positions in a multiple sequence alignment在多序列比对中检查特定位置的特定氨基酸
【发布时间】：2016-01-07 04:53:33
【问题描述】：

Stack Overflow 上有一个类似的问题，但它使用的是 Linux 终端 (Search for specific characters in specific positions of line)。我想使用 python 做类似的事情，我不能完全弄清楚什么是 pythonic 方式来做到这一点，而不必手动编写成员资格检查。

我想在多序列比对的特定位置搜索特定氨基酸。我已经在索引列表中定义了氨基酸比对的位置，

e.g Index = [1, 100, 235, 500].

我已经在这些位置上定义了我想要的氨基酸。

Res1 = ["A","G"]
Res2 = ["T","F"]
Res3 = ["S,"W"]
Res4 = ["H","J"]

我目前正在做这样的事情：

for m in records_dict:
    if (records_dict[m].seq[Index[0]] \
        in Res1) and (records_dict[m].seq[Index[1]] \
        in Res2) and (records_dict[m].seq[Index[2]] \
        in Res3) and (records_dict[m].seq[Index[3]]\
        in Res4)
    print m

现在，假设我有一个包含 40 个残基的列表要检查，我知道我必须编写残基列表以手动检查，但肯定有更简单的方法可以使用 while 循环或别的东西。

另外，有什么方法可以合并一个系统，如果没有序列匹配所有 40 个成员检查，我将获得最接近匹配所有 40 个检查的 5 个最佳序列，以及诸如序列“m”之类的输出有 30/40 个匹配项以及这 30 个匹配项的列表，哪些 10 个不匹配？

【问题讨论】：

标签： python alignment sequence membership biopython

【解决方案1】：

我假设您要检查 Res1 是否位于 Index[0]、Res2 是否位于 Index[1] 等等。

res = [Res1, Res2, Res3, Res4]
for m in records_dist:
    match = 0
    match_log = []
    for i in Index:
        if records_dict[m].seq[i] in res[i]:
            match += 1
            match_log.append(i)

使用这个小代码，您可以计算匹配的数量，并跟踪每个 records_dist 值发生匹配的索引。

如果您想检查 ResX 是否位于多个位置，或者如果您不想像 Res 列表那样对索引列表进行排序，我会定义一个列表字典，其中键是 ResX 和值是索引列表：

to_check = {}
to_check[Res1] = [index1, index2]
to_check[Res2] = [index1, ..., indexN]
...
to_check[ResX] = [indexI, ..., indexJ]

然后，使用

match_log = {}
for m in records_dist:
    match_log[m] = {}
    for res, indexes in to_check:
        match_log[m][res] = []
        for i in indexes:
            if records_dict[m].seq[i] in res:
                match_log[m][res].append(i)
        nb_match = len(match_log[m][res])

或者以更pythonic的方式，使用filter：

match_log = {}
for m in records_dist:
    match_log[m] = {}
    for res, indexes in to_check:
        match_log[m][res] = filter(lamba i: records_dict[m].seq[i] in res, indexes)
        nb_match = len(match_log[m][res])

【讨论】：

感谢 Nizil，这有帮助。但是，如果 Index 的编号顺序与 Res 列表的顺序不同，那该怎么做呢？我假设您必须手动输入。例如，如果 index[0] == ResAsp 和 index[2] == ResGlu（而不是 Res1 和 Res2）。我会将所有 ResX 列表放在另一个列表中并同时对其进行迭代吗？
@user1998510 使用字典应该更有效，我已经更新了我的答案；）