如何识别基因组中特定位置的特征答案

【问题标题】：How to identify what feature(s) are at a specific location in a genome如何识别基因组中特定位置的特征
【发布时间】：2013-07-25 21:02:38
【问题描述】：

我有兴趣确定基因组特定位置的特征（即基因/cds）。例如，什么基因（如果有）包含位置 2,000,000。我知道如何使用for 循环并遍历基因组中的每个特征（代码包含在下面）来做到这一点，但这是我想做数亿次随机化研究的一部分，而这个将花费比我想要的更长的时间。

下面包含的代码是我正在尝试做的更具体的示例：

from Bio import SeqIO
import random
GenomeSeq = SeqIO.read(open("reference_sequence.gbk", "r"), "genbank")

interesting_position = random.randint(0, len(GenomeSeq))

for feature in GenomeSeq.features:  # loop each position through whole genome
    # In this particular case I'm interested in focusing on cds, but
    # in others, I may be interested in other feature types?
    if feature.type == "CDS":  
        if (feature.location._start.position <= interesting_position and 
            feature.location._end.position >= interesting_position):
            try:
                print feature.qualifiers['gene']
            except KeyError:
                print feature

我考虑过制作一个字典，其中基因中的每个位置对应一个键，特征 ID 作为值，因为查找比循环更快，但似乎应该有一种方法做GenomeSeq[interestion_position].qualifiers['gene']

【问题讨论】：

可能是GenomeSeq[interesting_position].features() 之类的？
@verbsintransit 是的，这很好，但它似乎不起作用，我得到一个属性错误（'str' 对象没有属性'features'）。这是行之有效的事情，还是您希望看到的事情？

标签： python biopython

【解决方案1】：

我从未使用过 BioPython，但我发现它位于其文档中：http://biopython.org/wiki/SeqIO

from Bio import SeqIO
handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
print records[0].id  #first record
print records[-1].id #last record

这就是你要找的吗？

【讨论】：

这会根据它们在 fasta 文件中的顺序访问不同的序列条目，而不是根据随机位置识别特定基因。

【解决方案2】：

相当老了，但我会试一试。假设您要检查给定基因组的随机点列表（而不是许多基因组的固定点集）：

from Bio import SeqIO
import random

GenomeSeq = SeqIO.read(open("sequence.gb", "r"), "genbank")

# Here I generate a lot of random points in the genome in a set
interesting_positions = set(random.sample(xrange(1, len(GenomeSeq)), 20000))


for feature in GenomeSeq.features:

    if feature.type == "CDS":
        # Here I create a set for every position covered by a feature
        feature_span = set(xrange(feature.location._start.position,
                                  feature.location._end.position))

        # And this is the "trick": check if the two sets overlaps: if any point
        # is in the interesting_positions AND in the points covered by this
        # feature, we are interested in the feature.
        if feature_span & interesting_positions:
            try:
                print feature.qualifiers['gene']
            except KeyError:
                print feature

对于 20.000 个点和 4.7Mb 的基因组，循环在我的 crapy-2003 计算机中大约需要 3 秒，对于 200.000 个随机点需要 5 秒。

经过分析和一点重构，我提取了所有只需要计算一次的行即可：

from Bio import SeqIO
import random


def sample_genome(feature_spans, interesting_positions):
  for span in feature_spans:  
      if span & interesting_positions:
          # Do your business here
          print span[1].qualifiers.get("gene") or span[1]

if __name__ == "__main__":
    GenomeSeq = SeqIO.read(open("sequence.gb", "r"), "genbank")
    genome_length = len(GenomeSeq)

    features = [f for f in GenomeSeq.features if f.type == "CDS"]

    spans = [(set(
        xrange(feature.location._start.position,
               feature.location._end.position)), feature)
        for feature in features]

    for i in range(20):
        positions = set(random.sample(xrange(1, genome_length), 200))
        sample_genome(spans, positions)

每个样本大约需要 0.2 秒 + 读取/准备基因组需要 4 秒。在我的这台旧机器上完成 1.000.000 个样本大约需要 50 个小时。作为一个随机抽样，在多核机器上启动一些进程，明天你就完成了。

【讨论】：

感谢您的回复，这实际上是我还没有解决的问题。看起来这实际上是同一个问题：每个随机集必须遍历基因组特征 1 次。 200 个随机点是合理的，但理想情况下我想重复至少 100 万次（1388 小时）。
@ded 我在答案中添加了最快的代码。如果效果好，请告诉我。