查找重叠和非重叠区域答案

【问题标题】：Finding overlapping and non overlapping regions查找重叠和非重叠区域
【发布时间】：2014-04-02 09:48:07
【问题描述】：

我有一个包含数百万次命中的 BLAST 表格输出。Con 是我的序列，P 是蛋白质命中。我有兴趣区分与下面说明的 3 个案例相对应的命中。它们都应该打印在 3 个单独的新文件中，并且文件 1 中的 contigs 不应该在文件 2,3 等中。如何做到这一点？

con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
   p1---- p2 ------    p4---
               p3-----
con2 --------------------- (only overlapping)  con3 ----------------(only non overlp)
   p1 -----                                       p1 ----  p2 -----
      p2 -------
          p3 -----

如果我知道蛋白质的起始和结束位点，就可以识别重叠或非重叠；如果 S1 0。我的输入文件，即

contig  protein start end
con1    P1  481 931
con1    P2  140 602
con1    P3  232 548
con2    P4  335 406
con2    P5  642 732
con2    P6  2282    2433
con2    P7  729 812
con3    P8  17  148
con3    P9  289 45

我的脚本（这只会打印出不重叠的匹配项）

from itertools import groupby

def nonoverlapping(hits):
    """Returns a list of non-overlapping hits."""
    nonover = []
    overst =  False
    for i in range(1,len(hits)):
        (p, c) = hits[i-1], hits[i]
        if c[2] > p[3]:
            if not overst: nonover.append(p)
            nonover.append(c)
            overst = True  
    return nonover

fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    if len(hits) > 1: 
        hits.sort(key=lambda x: x[2])
        for hit in nonoverlapping(hits):
            oh.write('\t'.join([str(f) for f in hit])+'\n')

【问题讨论】：

标签： python bioinformatics

【解决方案1】：

我会做这样的事情。为两个命中定义一个“重叠”函数，然后测试每个重叠群是否全部重叠、部分重叠或不重叠。然后将所有 contigs 写入所需的文件：

from itertools import groupby

def overlaps(a, b):
    result = True
    # Supposing a[2] is the start, a[3] the end. 
    # If end before start, they are not overlapping
    if a[3] < b[2] or b[3] < a[2]:
        result = False

    return result

def test_overlapping(hits):
    overlapping = 'None'
    overlapping_count = 0
    for i in range(len(hits)-1):
        if overlaps(hits[i], hits[i+1]):
            overlapping_count += 1


    if overlapping_count == 0:
        overlapping = 'None'
    elif overlapping_count == len(hits) -1:
        overlapping = 'All'
    else:
        overlapping = 'Some'

    return overlapping


fh = open('file.txt')

file_all = open('result_all.txt', 'w')
file_some = open('result_some.txt', 'w')
file_none = open('result_none.txt', 'w')

line = fh.readline() # quit header

for qid, grp in groupby(fh, lambda l: l.split()[0]):
    hits = []
    for line in grp:
        hsp = line.split()
        hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
        hits.append(hsp)
    if len(hits) > 1: 
        hits.sort(key=lambda x: x[2])
        overlapping = test_overlapping(hits)
        out_file = file_none
        if overlapping == 'All':
            out_file = file_all
        elif overlapping == 'Some':
            out_file = file_some

        for h in hits:
            out_file.write('\t'.join([str(v) for v in h]))
            out_file.write('\n')


file_all.close()
file_some.close()
file_none.close()

【讨论】：

谢谢，我知道了你的算法路线，但是 cud 没有运行脚本。如果我只添加 if c[2] p[ 3]：行？
我测试了整个东西，现在放上去。这个对我有用。问题是，对于每个 contig，您需要查看它是否具有重叠读取，以查看要在哪个文件中写入它，对吗？我就是这么理解的。让我知道这是否解决了问题:)
你知道，它似乎有效，我只需要仔细检查，因为我的数据非常庞大，非常感谢！
不客气 :) 代码没有经过详尽的测试，这只是一个似乎可行的想法，但如果您发现任何问题，您可以随时说些什么。