【问题标题】:Python - how to get rid of the boundry range overlap between linesPython - 如何摆脱线之间的边界范围重叠
【发布时间】:2017-01-02 13:51:37
【问题描述】:

我有一个文件,里面是这样的:

1 33725 36725 ENHANCER0002 1 711760 714760 ENHANCER0003 1 724150 727150 ENHANCER0004 1 725455 728455 ENHANCER0005 1 871280 874410 ENHANCER0006 1 874180 877180 ENHANCER0007 1 900540 903540 ENHANCER0008 1 901475 904475 ENHANCER0009 1 910260 913260 ENHANCER00010 1 933355 936355 ENHANCER00011 1 947660 950660 ENHANCER00012 1 1013530 1016530 ENHANCER00013 . . . 1 2477030 2480030 ENHANCER00043 1 2478160 2481160 ENHANCER00044 1 2478845 2481845 ENHANCER00045

中间的两列是我的下边界和上边界。就像在 line3-4 或 line5-6 中一样,边界重叠。我必须以某种方式重塑它,如果边界重叠,它只打印最低的下边界和最高的上边界。我正在使用 Python 寻求这样的解决方案,这是我的代码:

def write_line(chr_no,tmp_l,tmp_h,cnt,filename):
    filename.write(str(chr_no)+"\t"+str(tmp_l)+"\t"+str(tmp_h)+"\t"+"ENHANCER000"+str(cnt)+"\n")


inf = open("/home/firat/Desktop/Onder_Lab/Kenan/enhancers_bj.bed","r")
outf = open("/home/firat/Desktop/deneme_v3.bed","w")

cnt = 0
tmp_l=0
tmp_h=0

tmp_list = []

for line in inf:
    cnt += 1
    line = line.split(' ')
    current_low = line[1]
    current_high = line[2]
    previous_low = tmp_l
    previous_high = tmp_h
    if (int(current_low) <= int(previous_high)):
        tmp_list.append(int(current_low))
        tmp_list.append(int(current_high))
        tmp_list.append(int(previous_low))
        tmp_list.append(int(previous_high))
        write_line(line[0],min(tmp_list),max(tmp_list),cnt,outf)
        tmp_l = min(tmp_list)
        tmp_h = max(tmp_list)
        tmp_list = []
    else:
        write_line(line[0], previous_low, previous_high, cnt, outf)
        tmp_l= current_low
        tmp_h= current_high

虽然我的解决方案看起来很有效,但输出是这样的:

1 27460 30460 ENHANCER0002 1 33725 36725 ENHANCER0003 1 711760 714760 ENHANCER0004 1 724150 728455 ENHANCER0005 1 724150 728455 ENHANCER0006 1 871280 877180 ENHANCER0007 1 871280 877180 ENHANCER0008 1 900540 904475 ENHANCER0009 1 900540 904475 ENHANCER00010 1 910260 913260 ENHANCER00011 1 933355 936355 ENHANCER00012 1 947660 950660 ENHANCER00013 1 1013530 1016530 ENHANCER00014 . . . 1 2477030 2481160 ENHANCER00044 1 2477030 2481845 ENHANCER00045 1 2477030 2481845 ENHANCER00046 如前所述,当边界重叠时,打印会出现重复。也有 3 行重叠的情况,就像在最底部一样。预期的输出应该是这样的:

1 27460 30460 ENHANCER0002 1 33725 36725 ENHANCER0003 1 711760 714760 ENHANCER0004 1 724150 728455 ENHANCER0005 1 871280 877180 ENHANCER0006 1 900540 904475 ENHANCER0007 1 910260 913260 ENHANCER0008 . . . 1 2477030 2481845 ENHANCER00046

我的代码有什么问题,即使有超过 2 行重叠,我如何改进它以使其正常工作?

【问题讨论】:

    标签: python arrays algorithm


    【解决方案1】:

    对于一项简单的任务,您的代码似乎过于复杂。您不需要使用四个变量 - tmp_l、tmp_h、previous_low 和 previous_high。您也不需要维护当前的重叠间隔列表。您需要做的就是保持重叠区间的低位和高位。

    但是,您的代码的问题是您每次迭代都调用write_line。相反,您想要做的是仅在当前低点高于前一个高点时调用write_line,这意味着前一组重叠间隔已经结束,并且也在循环结束时。

    下面的代码可以工作:

    for line in inf.splitlines():
        cnt += 1
        line = line.split(' ')
        current_low = int(line[1])
        current_high = int(line[2])
        if current_low <= previous_high:
            previous_high = current_high
        else:
            if previous_high > 0:
                write_line(line[0], previous_low, previous_high, cnt, outf)
            previous_low = current_low
            previous_high = current_high
    
    if previous_high > 0:
        write_line(line[0], previous_low, previous_high, cnt, outf)
    

    需要检查if previous_high &gt; 0 才能不输出previous_low 和previous_high - 0, 0 的默认值。需要for 循环末尾的额外write_line 来输出最后一组重叠间隔。

    当重叠间隔超过 2 个时,此代码也可以工作。

    【讨论】:

      猜你喜欢
      • 2012-01-01
      • 2016-04-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2023-04-01
      相关资源
      最近更新 更多