【问题标题】:Parsing two files to pool data and create new Fasta file解析两个文件以汇集数据并创建新的 Fasta 文件
【发布时间】:2020-10-08 16:41:21
【问题描述】:

我有两个文件,human.faprotein-coding_gene.txt(有数百种不同的蛋白质信息)。我必须解析蛋白质编码基因,然后解析human.fa(10个蛋白质名称)以将其汇集到一个新的fasta文件中。

蛋白质编码基因.txt:

Protein1 PreviousNames1 PreviousSymbols1 Symbol1 Chromosome1
Protein2 PreviousNames2 PreviousSymbols2 Symbol2 Chromosome2

人类.fa:

>Protein1  Sequence1
>Protein2 Sequence2

我需要一个新的 fasta 文件来输出:

>Protein1 Synonyms1 Chromsome1 Sequence1
>Protein2 Synonyms2 Chromosome2 Sequence2 

我当前的代码是:

class Protein:
    
    def __init__(self, Name, Synonyms, Chromosome):
        self.Name = Name
        self.Synonyms = Synonyms
        self.Chromosome = Chromosome
             
Proteins = []
with open('protein-coding_gene.txt', 'r') as file:
    for line in file:
        parseline = line.rstrip().split("\t")
        Name = parseline[2]
        Synonyms = parseline[6]
        Chromosome = parseline[7]
        Proteins.append(Protein(Name, Synonyms, Chromosome))


f = open("human.fa")

seqs = {}
for i in f:
    line = i.strip()
    if line[0] == '>':
        l = line.split()
        gene = l[0][1:]
        seqs[gene] = ''
    else:
        seqs[gene] = seqs[gene] + line

        
f.close()

        
for p in Proteins:
    print(p.Name, p.Synonyms, p.Chromosome, sep=",")

for name, seq in seqs.items():
        print (name, seq)
        

from Bio import SeqIO
        
newhuman = []
SeqIO.write[newhuman, "fastaML.fa", "fasta")

现在它打印我想要的蛋白质编码文件中的所有内容(名称、同义词、染色体)并打印整个 human.fa 文件。我需要它进行排序并仅打印 fasta 文件的 10 个蛋白质名称以及来自 protein-coding_gene.txt 的信息和序列。任何帮助将不胜感激。

【问题讨论】:

    标签: python biopython fasta


    【解决方案1】:

    您想要的格式不是有效的 fasta 格式。但是如果你仍然想要fastaML.fa 中的相同输出,那么你不应该使用 SeqIO.write() 方法。相反,您应该使用基本的文件处理。

    class Protein:
        
        def __init__(self, Name, Synonyms, Chromosome):
            self.Name = Name
            self.Synonyms = Synonyms
            self.Chromosome = Chromosome
    
        def add_sequence(self, Sequence):
            self.Sequence = Sequence
                 
    Proteins = []
    with open('protein-coding_gene.txt', 'r') as file:
        for line in file:
            parseline = line.rstrip().split(" ")
            Name = parseline[0]
            Synonyms = parseline[1:4]
            Chromosome = parseline[4]
            Proteins.append(Protein(">"+Name, Synonyms, Chromosome))
    
    
    f = open("human.fa")
    
    seqs = {}
    gene = ""
    for i in f:
        line = i.strip()
        if line[0] == '>':
            l = line.split()
            gene = l[0]
            seqs[gene] = l[1]
        else:
            seqs[gene] = seqs[gene] + line
    
            
    f.close()
    
    for p in Proteins:
        for name, seq in seqs.items():
            if(p.Name == name):
                p.add_sequence(seq)     
    
    with open('fastaML.fa', 'w') as file:
        for p in Proteins:
            file.write(p.Name + " " + p.Synonyms[0] + " " + p.Synonyms[1] + " " + p.Synonyms[2] + " " + p.Chromosome + " " + p.Sequence + "\n")
            #I have used single space here. You can modify it as per your need.
    

    Here is a working repl for your reference.

    【讨论】:

    • 感谢您的帮助。如果你能帮我看看它在哪里,我得到了一个 AttributeError。
    • 我检查了您的建议,但我必须拒绝这些更改,因为这是一个一般性的答案,它为读者提供了一个想法,甚至可能在将来帮助某人。通过查看您的代码,我可以看出 AttributeError 是由于 seqs[gene] = '' 而您将其留空。相反,你应该这样做seqs[gene] = l[1]
    • 但是您将在file.write() 中遇到另一个错误,因为它不接受多个参数。所以这就是我在其中使用串联操作的原因。
    猜你喜欢
    • 1970-01-01
    • 2014-03-27
    • 1970-01-01
    • 1970-01-01
    • 2018-03-30
    • 1970-01-01
    • 1970-01-01
    • 2019-07-05
    • 1970-01-01
    相关资源
    最近更新 更多