【问题标题】:Protein to RNA codons蛋白质到 RNA 密码子
【发布时间】:2021-01-23 01:43:10
【问题描述】:

我有这个问题,我们需要编写一个代码,该代码采用蛋白质 fasta 文件和蛋白质序列标识符,并计算 fasta 文件中序列的所有可能 RNA 组合,条件是组合总数应该小于 5000。

我开始制作一个 RNA 密码子字典,然后我制作了一个函数,将 fasta 文件的元素(氨基酸)放入一个列表中,然后我尝试从该列表中进行组合,但我得到了一个错误,我试过但不知道问题出在哪里,如果有人可以检查代码并告诉我出了什么问题,我将不胜感激。

import itertools

from Bio import SeqIO

RNA_codon_table = {
'A': ('GCU', 'GCC', 'GCA', 'GCG'),
'C': ('UGU', 'UGC'),
'D': ('GAU', 'GAC'),
'E': ('GAA', 'GAG'),
'F': ('UUU', 'UUC'),
'G': ('GGU', 'GGC', 'GGA', 'GGG'),
'H': ('CAU', 'CAC'),
'I': ('AUU', 'AUC', 'AUA'),
'K': ('AAA', 'AAG'),
'L': ('UUA', 'UUG', 'CUU', 'CUC', 'CUA', 'CUG'),
'M': ('AUG',),
'N': ('AAU', 'AAC'),
'P': ('CCU', 'CCC', 'CCA', 'CCG'),
'Q': ('CAA', 'CAG'),
'R': ('CGU', 'CGC', 'CGA', 'CGG', 'AGA', 'AGG'),
'S': ('UCU', 'UCC', 'UCA', 'UCG', 'AGU', 'AGC'),
'T': ('ACU', 'ACC', 'ACA', 'ACG'),
'V': ('GUU', 'GUC', 'GUA', 'GUG'),
'W': ('UGG',),
'Y': ('UAU', 'UAC'),}
 
def protein_fasta (protein_file):
   protein_sequence = []
   protein = SeqIO.parse(protein_file, format = 'fasta')
   for Seqrecord in protein: 
      protein_sequence.append(Seqrecord.seq)
   print (protein_sequence)


for seq in protein_sequence:
     codons = [ list(RNA_codon_table[key]) for key in protein_sequence ]
print(list(itertools.product(codons)))

对不起,我不知道如何附加fasta文件,但这是里面的顺序:

>seq_compl complete sequence
IEEATHMTPCYELHGLRWVQIQDYAINVMQCL

这是我得到的错误:

 KeyError                                  Traceback (most recent call last)
<ipython-input-65-3dd46947c505> in <module>
----> 1 all_combinations ('short_protein.fasta')

<ipython-input-64-45a50fffc1d9> in all_combinations(protein_file)
      5        protein_sequence.append(Seqrecord.seq)
      6 
----> 7    codons = [ list(RNA_codon_table[key]) for key in protein_sequence 
]
      8    print(list(itertools.product(codons)))

<ipython-input-64-45a50fffc1d9> in <listcomp>(.0)
      5        protein_sequence.append(Seqrecord.seq)
      6 
----> 7    codons = [ list(RNA_codon_table[key]) for key in protein_sequence 
 ]
      8    print(list(itertools.product(codons)))

 KeyError: Seq('IEEATHMTPCYELHGLRWVQIQDYAINVMQCL')

【问题讨论】:

  • 能否请edit 包含错误的回溯。附加一个小的 FASTA 文件和一些重现问题的简单示例数据也会很有用。另请参阅提供minimal reproducible example 的指南。可能还要说明哪个库提供了SeqIO 类。
  • 哦,好吧,我写蛋白质序列和错误
  • 我希望这会更好?
  • 我添加了缺少的import,但缩进仍然有些奇怪。我提供的链接提供了额外的指导;我建议你点击并阅读它。

标签: python bioinformatics biopython


【解决方案1】:

根据您的示例,protein_sequence 变量当前在 protein_fasta 函数的本地范围内声明。

您需要先将此函数的结果分配给一个变量,然后才能对其进行迭代。

例如,将您的 print 切换为 return:

def protein_fasta (protein_file):
   protein_sequence = []
   protein = SeqIO.parse(protein_file, format = 'fasta')
   for Seqrecord in protein: 
      protein_sequence.append(Seqrecord.seq)
   return protein_sequence

并确保调用并分配函数的结果:

protein_sequence = protein_fasta(protein_file)

现在你有了可以迭代的东西。

我发现你的 for 循环还有一个问题。你没有对 seq 做任何事情。在这种情况下,大概 protein_sequence 应该换成 seq。我还取出了包装 RNA_codon_table 的列表,因为我认为在这种情况下不需要它:

for seq in protein_sequence:
    codons = [ RNA_codon_table[key] for key in seq ]
    print(list(itertools.product(*codons)))

【讨论】:

  • 好的,非常感谢,我找到了丢失的部分,但是当我想编写函数时,我只写:protein_fasta ('protein.fasta') 吗?因为它只是打印文件中的序列,并没有给我密码子..
  • SeqIO.parse 函数只根据他们的文档获取文件句柄或文件名。因此,如果您的输入文件名为 protein.fasta,那应该可以!如果您没有将 print 更改为 return 语句,这可能就是它只打印文件中的序列记录而不是打印密码子的原因。
  • 哦,明白了,非常感谢!
【解决方案2】:

您的蛋白质串将产生数十亿种组合:

from itertools import product,islice
def protGen(proteins):
   for codons in product(*(RNA_codon_table[P] for P in proteins)):
       yield "".join(codons) 

计数组合:

proteins = "IEEATHMTPCYELHGLRWVQIQDYAINVMQCL"
    
count = 1
for P in proteins: count *= len(RNA_codon_table[P])

print(count) # 37,572,373,905,408 combinations

输出:

for protSeq in islice(protGen(proteins),500): # first 500
    print(protSeq)

AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGUCUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAAUGCCUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUUUA
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUUUG
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUU
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUC
AUUGAAGAAGCUACUCAUAUGACUCCUUGUUAUGAAUUACAUGGUUUACGUUGGGUUCAAAUUCAAGAUUAUGCUAUUAAUGUUAUGCAGUGUCUA

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2013-01-16
    • 1970-01-01
    • 1970-01-01
    • 2020-12-22
    • 2020-07-01
    • 1970-01-01
    • 2012-06-27
    • 1970-01-01
    相关资源
    最近更新 更多