【发布时间】:2020-10-08 16:41:21
【问题描述】:
我有两个文件,human.fa 和 protein-coding_gene.txt(有数百种不同的蛋白质信息)。我必须解析蛋白质编码基因,然后解析human.fa(10个蛋白质名称)以将其汇集到一个新的fasta文件中。
蛋白质编码基因.txt:
Protein1 PreviousNames1 PreviousSymbols1 Symbol1 Chromosome1
Protein2 PreviousNames2 PreviousSymbols2 Symbol2 Chromosome2
人类.fa:
>Protein1 Sequence1
>Protein2 Sequence2
我需要一个新的 fasta 文件来输出:
>Protein1 Synonyms1 Chromsome1 Sequence1
>Protein2 Synonyms2 Chromosome2 Sequence2
我当前的代码是:
class Protein:
def __init__(self, Name, Synonyms, Chromosome):
self.Name = Name
self.Synonyms = Synonyms
self.Chromosome = Chromosome
Proteins = []
with open('protein-coding_gene.txt', 'r') as file:
for line in file:
parseline = line.rstrip().split("\t")
Name = parseline[2]
Synonyms = parseline[6]
Chromosome = parseline[7]
Proteins.append(Protein(Name, Synonyms, Chromosome))
f = open("human.fa")
seqs = {}
for i in f:
line = i.strip()
if line[0] == '>':
l = line.split()
gene = l[0][1:]
seqs[gene] = ''
else:
seqs[gene] = seqs[gene] + line
f.close()
for p in Proteins:
print(p.Name, p.Synonyms, p.Chromosome, sep=",")
for name, seq in seqs.items():
print (name, seq)
from Bio import SeqIO
newhuman = []
SeqIO.write[newhuman, "fastaML.fa", "fasta")
现在它打印我想要的蛋白质编码文件中的所有内容(名称、同义词、染色体)并打印整个 human.fa 文件。我需要它进行排序并仅打印 fasta 文件的 10 个蛋白质名称以及来自 protein-coding_gene.txt 的信息和序列。任何帮助将不胜感激。
【问题讨论】: