【发布时间】:2014-04-02 09:48:07
【问题描述】:
我有一个包含数百万次命中的 BLAST 表格输出。Con 是我的序列,P 是蛋白质命中。我有兴趣区分与下面说明的 3 个案例相对应的命中。它们都应该打印在 3 个单独的新文件中,并且文件 1 中的 contigs 不应该在文件 2,3 等中。如何做到这一点?
con1 ----------------------- (Contigs with both overlapping and non overlapping hits)
p1---- p2 ------ p4---
p3-----
con2 --------------------- (only overlapping) con3 ----------------(only non overlp)
p1 ----- p1 ---- p2 -----
p2 -------
p3 -----
如果我知道蛋白质的起始和结束位点,就可以识别重叠或非重叠;如果 S1 0。 我的输入文件,即
contig protein start end
con1 P1 481 931
con1 P2 140 602
con1 P3 232 548
con2 P4 335 406
con2 P5 642 732
con2 P6 2282 2433
con2 P7 729 812
con3 P8 17 148
con3 P9 289 45
我的脚本(这只会打印出不重叠的匹配项)
from itertools import groupby
def nonoverlapping(hits):
"""Returns a list of non-overlapping hits."""
nonover = []
overst = False
for i in range(1,len(hits)):
(p, c) = hits[i-1], hits[i]
if c[2] > p[3]:
if not overst: nonover.append(p)
nonover.append(c)
overst = True
return nonover
fh = open('file.txt')
oh = open('result.txt', 'w')
for qid, grp in groupby(fh, lambda l: l.split()[0]):
hits = []
for line in grp:
hsp = line.split()
hsp[2], hsp[3] = int(hsp[2]), int(hsp[3])
hits.append(hsp)
if len(hits) > 1:
hits.sort(key=lambda x: x[2])
for hit in nonoverlapping(hits):
oh.write('\t'.join([str(f) for f in hit])+'\n')
【问题讨论】:
标签: python bioinformatics