在 Biopython 中提取 CDS 序列答案

【问题标题】：Extracting CDS sequences in Biopython在 Biopython 中提取 CDS 序列
【发布时间】：2018-03-30 05:28:41
【问题描述】：

大家好，

我开始在 Biopython 中编程，我想知道如何从具有所有特征坐标的基因组 GenBank 文件 (*.gb) 中提取基因序列和蛋白质标识符。

我的想法是创建一个包含蛋白质标识符、基因坐标和基因序列的文本文件。

如果您有任何想法，我将不胜感激。

到目前为止我已经试过了：

for seq_record in seq_record.features: 
    if seq_record.type == 'CDS':
       x=seq_record.qualifiers['protein_id']
       i=seq_record.location._start.position
       f=seq_record.location._end.position
       sq = seq_record.seq
       FEAT_LIST.append('START END STRAND ID')
       FEAT_LIST.append(str(((i, f), s, x, sq)))
       print(FEAT_LIST)

但是，我收到此错误：sq = seq_record.seq AttributeError: 'SeqFeature' object has no attribute 'seq'

感谢您的帮助。

【问题讨论】：

欢迎来到 StackOverflow！你试过什么？你看过 [Biopython](www.biopython.org) 教程/wiki 吗？简短的回答是，如果没有任何你想要的东西，那就做一个解析器。

标签： sequences biopython

【解决方案1】：

FeatureLocation 有一个很好的extract 方法，它接受父序列并为您提供一个新的 SeqRecord 对象。在该对象上，您可以使用通常的.seq 来获取序列：

from Bio import SeqIO

for rec in SeqIO.parse("sequence.gb", "genbank"):
    if rec.features:
        for feature in rec.features:
            if feature.type == "CDS":
                print feature.location
                print feature.qualifiers["protein_id"]
                print feature.location.extract(rec).seq

【讨论】：

【解决方案2】：

我建议您查看 SeqIO 和 SeqRecord 对象的 Biopython 文档，例如 parse 和 read。 genbank 格式在解析器中实现，因此您在读取文件时应该没有任何问题。实际上，您只需将genbank 指定为参数。

Here 你甚至有一个读取 genbank 文件的示例。

编辑：所以我认为您在遍历记录时遇到了问题。我看到的问题是SeqRecord 对象和SeqFeature 对象之间存在混淆。你不能这样做：

for seq_record in seq_record.features:

因为 seq_record 是 SeqFeature 对象，不是 SeqRecord 之一。当您第一次解析 GenBank 文件时，您将遍历 SeqRecord 对象：

for record in SeqIO.parse('my_file.gbk','genbank'):
    print "Record %s has %i features and sequence: %s" % (record.id, len(record.features), record.seq)

每个SeqRecord 对象都有一个seq 属性和features 属性中的SeqFeature 对象列表。如果你想对这些特征做点什么，你需要为每条记录遍历它们。

【讨论】：

感谢您的信息。但是，当我尝试运行此代码时出现了我的问题：for seq_record in seq_record.features: ` if seq_record.type == 'CDS':` ` x=seq_record.qualifiers['protein_id']` ` i=seq_record.location._start.position ` ` f=seq_record.location._end.position` ` sq = seq_record.seq` ` FEAT_LIST.append('START END STRAND ID')` ` FEAT_LIST.append(str(((i, f), s, x, sq)))` print(FEAT_LIST) 我得到这个错误： sq = seq_record.seq AttributeError: 'SeqFeature' object has no attribute 'seq'
我编辑了我的答案以试图更具解释性。我还认为您发布的代码应该在您的问题中可见。对此发表了修改。

【解决方案3】：

seqfeature 对象有一个extract method，可以省去深入研究 FeatureLocation 的麻烦。它返回一个新序列。

from Bio.Seq import Seq
from Bio.Alphabet import generic_protein
from Bio.SeqFeature import SeqFeature, FeatureLocation
seq = Seq("MKQHKAMIVALIVICITAVVAAL", generic_protein)
f = SeqFeature(FeatureLocation(8, 15), type="domain")
f.extract(seq)
# returns: Seq('VALIVIC', ProteinAlphabet())

【讨论】：