【问题标题】:I made a code with Biopython but it does not work every time. What is wrong with my code?我用 Biopython 编写了一个代码,但它并不是每次都有效。我的代码有什么问题?
【发布时间】:2022-11-17 19:27:19
【问题描述】:

我有一个 FASTA 文件,其中包含按从 1(第一个序列:从 >*)到 n(最后一个)的顺序分类的序列。内容如下:

 >TRINITY_GG_10000_c0_g1_i1.p2 TRINITY_GG_10000_c0_g1~~TRINITY_GG_10000_c0_g1_i1.p2  ORF type:complete len:381 (+),score=55.64 TRINITY_GG_10000_c0_g1_i1:244-1386(+)
MNSFLSIRKRTSLATASKTRQLNWKPAKVSIRVTSNDKKLPVTQADVARKETSKHVSMLE
TTPKLKKSFIFMAGRVVRVMIGSFLVLFALLHMGILHTLSPAVKKGLGNFSSRTWQAAEQ
IFTGKWEDHEATATAFEHGF*
>TRINITY_GG_10000_c0_g1_i1.p1 TRINITY_GG_10000_c0_g1~~TRINITY_GG_10000_c0_g1_i1.p1  ORF type:5prime_partial len:1567 (-),score=319.89 TRINITY_GG_10000_c0_g1_i1:1694-6394(-)
SPNAVQQVPVQSPNAVQQVPVQSPNAVQQVPVQSARAIQQVPNQNPNAVQQWTRHPGAMQ
QPVQDSRAIQQQQQNNSSVQAQPQATGHHARQVDESTTRSGPEVPVSSQQGHTNAPSDV*
>TRINITY_GG_10000_c0_g1_i1.p........

我还有另一个文本文件,其中包含与第一个 FASTA 文件中的某些序列分类相对应的数字,内容是这样的:

10140
10178
11626
12110
12119
n

我试图创建一个程序,允许我从 FASTA 文件中提取与文本文件中包含的数字相对应的序列,但我的程序运行不正常。提取的序列与文本文件中所需和编号的序列数不对应。我的程序有什么问题?

import sys
fasta_name = sys.argv[1]
nums_name = sys.argv[2]
out_name = sys.argv[3]

from Bio import SeqIO

fasta_sequences = list(SeqIO.parse(fasta_name, "fasta"))


nums_file = open(nums_name,"r")
nums=nums_file.readlines()
nums_file.close()

out_file = open(out_name,"w")
out_file.close()
out_file = open(out_name,"a+")

numsAsInt= [int(num[:-1]) for num in nums]
indexes = set(range(1,len(fasta_sequences)+1)).intersection(set(numsAsInt))

for ind in indexes:
        fasta = fasta_sequences[ind-1]
        name, sequence = fasta.id, str(fasta.seq)
        out_file.write(">"+name+"\n")
        out_file.write(sequence+"\n")

out_file.close()

我试图解决这个问题,但作为 Python 的初学者,我无法更进一步。接下来我可以尝试什么?

【问题讨论】:

  • 怎么错了?你做了什么来调试这个?

标签: python sequence extract biopython fasta


【解决方案1】:

嘿,我希望你仍然需要一个答案:

问题错误列表我提供了我的答案作为我测试过的代码并且它有效。

我还提供了另一种更符合生物蟒蛇的方式来做到这一点:

#!/bin/python3

import sys
fasta_name = 'test.fasta'
nums_name = 'test.list'
out_name = 'out2.fasta'

from Bio import SeqIO
from Bio import Seq

fasta_sequences = list(SeqIO.parse(fasta_name, "fasta"))
#print the number of sequences in the file

"""
nums_file = open(nums_name,"r") # 
nums=nums_file.readlines()
nums_file.close()
#produced: ['1  n', '3  n', '4'] these are strings not ints
    ['1 n', '3 n', '4'] needs to be [1,3,4] fix file readlines

"""

#nicer way to read in the list of numbers
nums=[]
with open(nums_name, 'r') as f:
    nums_raw=f.readlines()
    #strip newlines if they exist
    nums=[x.strip() for x in nums_raw]
    #turn nums into integers
    nums=[int(x) for x in nums]
    

out_file = open(out_name,"w")
out_file.close()
out_file = open(out_name,"a+")

#numsAsInt= [int(num[:-1]) for num in nums] 
# caused an error and is now no longer needed since we already have ints
numsAsInt=nums
indexes = set(range(1,len(fasta_sequences)+1)).intersection(set(numsAsInt))

#you can directly iterate over the SeqIO object and provide the indexes as a list
for ind in nums:
        fasta = fasta_sequences[ind-1] #generally it would be advisable to start indexes from 0
        name, sequence = fasta.id, str(fasta.seq)
        out_file.write(">"+name+"
")
        out_file.write(sequence+"
")

out_file.close()

# a more  biopython way is this:
fasta_sequences = list(SeqIO.parse(fasta_name, "fasta"))
nums=[]
with open(nums_name, "r") as f:
    nums=[int(x.strip()) for x in f.readlines()]
selected_seqs = [fasta_sequences[ind-1] for ind in nums]
SeqIO.write(selected_seqs, out_name, "fasta")        


 
 

最后一种是最短且最有效的方法。

[标签]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-06-18
    • 2021-04-04
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多