【问题标题】:Access sequence element from fasta record using Biopython Entrez使用 Biopython Entrez 从 fasta 记录中访问序列元素
【发布时间】:2013-07-20 05:05:27
【问题描述】:

我有一个 refseq ID (keys_list) 列表,我用它来使用 BioPython Entrez 拉下序列记录。我只想访问返回的 fasta 记录中的序列,但我不想将记录写入文件。

我正在尝试以下代码

for key in key_list:
   Entrez.email = "myemailaddress"
   handle = Entrez.efetch(db='nuccore', id=key, rettype='fasta')
   record = SeqIO.parse(handle, "fasta")
   for seq_record in SeqIO.parse(record, "fasta"):
    print seq_record.seq

当我运行它时,我得到了错误:

File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/__init__.py", line 538, in parse
  yield r
File "/usr/lib64/python2.6/contextlib.py", line 34, in __exit__
  self.gen.throw(type, value, traceback)
File "/usr/lib64/python2.6/site-packages/Bio/File.py", line 59, in as_handle
  yield handleish
File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/__init__.py", line 537, in parse
  for r in i:
File "/usr/lib64/python2.6/site-packages/Bio/SeqIO/FastaIO.py", line 37, in FastaIterator
  line = handle.readline()
AttributeError: 'generator' object has no attribute 'readline'

如果我用handle.read()返回整个记录,我可以获得整个fasta记录,但在这个阶段我只想访问核苷酸序列。

谁能帮我解决这个问题?

非常感谢。

【问题讨论】:

    标签: python biopython fasta ncbi


    【解决方案1】:

    这就是你需要的。

    代替:

    handle = Entrez.efetch(db='nuccore', id=key, rettype='fasta')
    

    试试这个:

    handle = Entrez.efetch(db="nucleotide", id=key, retmode="xml") # retmode as 'xml' , db='nucleotide'
    features = Entrez.read(handle)[0]
    sequence = features['GBSeq_sequence'] # this is your sequence!
    

    返回一个字符串是你的序列:

    'ggctcgcatctctccttcacgcgcccgccgccttacctgaggccgccatccacgccggttgagtcgcgttctgccgcctcccgcctgtggtgcctcctgaactacgtccgccgtctaggtaagtttagagctcaggtcgagaccgggcctttgtccggcgctcccttggagcctacctagactcagccggctctccacgctttgcctgaccctgcttgctcaactctacgtctttgtttcgttttctgttctgcgccgttacagatcgaaagttccacccctttccctttcattcacgactgactgccggcttggcccacggccaagtaccggcaactctgctggctcggagccagcgacagcccattctatagcactctccaggagagaaatttagtacacagttgggggctcgtccgggattcgagcgcccctttattccctaggcaatgggccaaatcttttcccgtagcgctagccctattccgcggccgccccgggggctggccgctcatcactggcttaacttcctccaggcggcatatcgcctagaacccggtccctccagttacgatttccaccagttaaaaaaatttcttaaaatagctttagaaacaccggtctggatctgccccattaactactccctcctagccagcctactcccaaaaggataccccggccgggtgaatgaaattttacacatactcatccaaacccaagcccagatcccgtcccgccccgcgccgccgccgccgtcatcctccacccacgaccccccggattctgacccacaaatcccccctccctatgttgagcctacagccccccaagtccttccagtcatgcacccacatggtgcccctcccaaccaccgcccatggcaaatgaaagacctacaggccattaagcaagaagtctcccaagcggcccctggaagcccccagtttatgcagaccatccggcttgcggtgcagcagtttgaccccactgccaaagacctccaagacctcctgcagtacctttgctcctccctcgtggcttccctccatcaccagcagctagatagccttatatcagaggccgaaactcgaggtattacaggttataaccccttagccggtcccctccgtgtccaagccaacaatccacaacaacaaggattaaggcgagaataccagcaactctggctcgccgccttcgccgccctgccagggagtgccaaagacccttcctgggcctctatcctccaaggcctggaggagccttaccacgccttcgtagaacgcctcaacatagctcttgacaatgggctgccagaaggcacgcccaaagaccccattttacgttccttagcctactctaatgcaaacaaagaatgccaaaaattactacaggcccgagggcacactaatagccctctaggagatatgttgcgggcttgtcaggcctggacccccaaagacaaaaccaaagtgttagttgtccagcctaaaaaaccccccccaaatcagccgtgcttccggtgcgggaaagcaggccactggagtcgggactgcactcagcctcgtcctccccctgggccatgccccctatgtcaagatccaactcactggaagcgagactgcccccgcctaaagcccactatcccagaaccagagccagaggaggatgccctcctattagatctccccgccgacatcccacacccaaaaaactccatagggggggaggtttaacctccccccccacattacagcaagtccttcctaaccaagacccaacatctattctgccagttataccgttagatcccgcccgtcggcccgtaattaaagcccagattgacacccagaccagccacccaaagactatcgaagctctactagatacaggagcagacatgacagtccttccgatagccttgttctcaagtaatactcccctcaaaaacacatccgtgttaggggcagggggccaaacccaagatcactttaagctcacctcccttcctgtgctaatacgcctccctttccggacgacgcctattgttttaacatcttgcctagttgataccaaaaacaactgggccatcataggtcgtgatgccttacaacaatgccaaggcgtcctgtacctccctgaggcaaaaaggccgcctgtaatcttgccaatacaggcgccagctgtccttgggctagaacacctcccaaggccccccgaaatcagccagttccctttaaaccagaacgcctccaggccttgcaacacttggtccggaaggccctggaggcaggccatatcgaaccctacaccgggccaggaaataacccagtattcccagttaaaaaagccaatggaacctggcgattcatccacgacctgcgggccactaactctctaaccatagatctctcatcatcttcccccgggccccctgacttgtccagcctgccaactacactagcccacttacaaactatagaccttaaagacgcctttttccaaatccccctacctaaacagttccagccctactttgctttcactgtcccacagcagtgtaactacggccccggcactagatacgcctggagagtactaccccaagggtttaaaaatagtcccaccctgttcgaaatgcagctggcccatatcctgcagcccattcggcaagccttcccccaatgcactattcttcagtacatggatgacattctcctggcaagcccctcccatgcggacctgcaactactctcagaggccacaatggcttccctaatctcccatgggttgcctgtgtccgaaaacaaaacccagcaaacccctggaacaattaagttcctagggcaaataatttcacctaatcacctcacttatgatgcagtccccaaggtacctatacggtcccgctgggcgctacctgaacttcaagccctacttggcgagattcagtgggtctccaaaggaactcctaccttacgccagccccttcacagtctctactgtgccttacaaaggcatactgatccccgagaccaaatatatttaaatccttctcaagttcaatcattagtgcagctgcggcaggccctgtcacagaactgccgcagtagactagtccaaaccctgcccctcctaggggctattatgctgaccctcactggcaccaccactgtggtgttccagtccaagcagcagtggccacttgtctggctacatgcccccctaccccacactagccagtgcccctgggggcagctacttgcctcagctgtgttattactcgacaaatacaccttgcaatcctatggactactctgccaaaccatacatcataacatctccacccaaaccttcaaccaattcattcaaacatctgaccaccccagtgttcctatcttactccaccacagtcaccgattcaaaaatttaggtgcccagactggagaactttggaacacttttcttaaaacaactgccccattggctcctgtgaaagcccttatgccagtgtttactctttcccctgtgatcataaacaccgccccttgcctgttttcagacggatccacctcccaggcagcctatattctctgggacaagcatatattgtcacaaagatcattcccccttccgccaccgcacaagtcggcccaacgggccgaacttctcggacttttgcatggcctctccagcgcccgttcgtggcgctgtctcaacatatttctagactccaagtatctttatcattaccttcggacccttgccctaggcaccttccaaggcaggtcctctcaggccccctttcaggccctcctgccccgcttactatcgcgtaaggtcgtctatttgcaccacgttcgcagccataccaatctacctgatcccatctccaggctcaacgctctcacagatgccctactaatcacccctgtcctgcagctctctcctgcagacctacacagtttcacccattgcggacagacggccctcacactgcaaggggcaaccacaactgaggcctccaatatcctgcgctcttgccacgcctgccgcaaaaataacccacaacatcagatgcctcaaggacacatccgccgtggcctactccctaaccacatctggcaaggcgacattacccatttcaaatataaaaatacactgtatcgccttcatgtatgggtagacaccttttcaggagccatctcagctacccaaaagagaaaagaaacaagctcagaagctatttcctctttgctccaggccattgcctatctaggcaagcctagctacataaacacagacaatggccctgcctatatttcccaagacttcctcaatatgtgtacctcccttgctattcgccatactacccatgtcccctacaatccaaccagctccggacttgtagaacgctctaatggcattcttaaaaccctattatataagtactttactgacaaacccgacctacctatggataatgctctatccatagccctatggacaatcaaccacctaaatgtattaaccaactgccacaaaacccgatggcagcttcaccactccccccgactccagccgatcccagagacacattccctcagcaataaacaaacccattggtattatttcaagcttcctggtcttaatagccgccagtggaaaggaccacaggaggctcttcaagaagctgccggcgctgctctcatcccggtaagcgctagttctgcccagtggatcccgtggaggctcctcaagcgagctgcatgcccaagacccgtcggaggccccgccgatcccaaagaaaaagaccaccaacaccatgggtaagtttctcgccactttgattttattcttccagttctgccccctcatcctcggtgattacagccccagctgctgtactctcacagttggagtctcctcataccactctaaaccctgcaatcctgcccagccagtttgttcatggaccctcgacctgctggccctttcagcagatcaggccctacagccaccctgccctaatctagtaagttactccagctaccatgccacctattccctatatctattccctcattggatcaaaaagccaaaccgaaatggcggaggctattattcagcctcttattcagacccttgttccttaaaatgcccatacctagggtgccaatcatggacctgcccctatacaggagccgtctccagcccctactggaaatttcagcaagatgtcaattttactcaagaagtttcacacctcaatattaatctccatttttcaaaatgcggtttttccttctcccttctagtcgacgctccaggatatgaccccatctggttccttaataccgaacccagccaactgcctcccaccgcccctcctctactctcccactctaacctagaccatatcctcgagccctctataccatggaaatcaaaactcctgactcttgtccagttaaccctacaaagcactaattatacttgcattgtctgtatcgatcgtgccagcctatccacttggcacgtcctatactctcccaacgtctctgttccatccccttcttctacccccctcctttacccatcgttagcgcttccagccccccacctgacgttaccatttaactggacccactgctttgacccccagattcaagctatagtctcctccccctgtcataactccctcatcctgccccccttttccttgtcacctgttcccacgctaggatcccgctcccgccgagcagtaccggtggcggtctggcttgtctccgccctggccatgggagccggagtggctggcaggattaccggctccatgtccctcgcctcaggaaagagcctcctacatgaggtggacaaagatatttcccaattaactcaagcaatagtcaaaaaccacaaaaatctgctcaaaattgcacagtatgctgcccagaacagacgaggccttgatctcctgttctgggagcaaggaggattatgcaaagcattacaagaacagtgctgttttctaaatattactaattcccatgtctcaatactacaagagagacccccccttgaaaatcgagtcctgactggctggggccttaactgggaccttggcctctcacagtgggctcgagaagccttacaaactggaatcacccttgtcgcgctactccttcttgttatccttgcaggaccatgcatcctccgtcagctacgacacctcccctcgcgcgtcagatacccccattactctcttataaaccctgagtcatccctgtaaaccaagcacacaattattgcaaccacatcgcctccagcctcccctgccaataattaacctctcccatcaaatcctccttctcctgcagcaacctcctccgttcagcctccaaggactccacctcgccttccaactgtctagtatagccatcaacccccaactcctgcattttttctttcctagcactatgctgtttcgccttctcagccccttgtctccacttgcgctcacggcgctcctgctcttcctgctttctccgggcgaagtcagcggccttctcctccgcccgcttcctgcgccgtgccttctcctcttccttccttttcaaatactcagcaatctgcttttcctcctctttctcccgctctttttttcgcttcctcttctcctcagcccgtcgctgccgatcacgatgcgtttccccgcgaggtggcgctttcccccctggagggccccgtcgcagccggccgcggctttcctcttctagagatagcaaaccgtcaagcacagtttcctcctcctccttgtcctttaactcttcctccaaggataatagcccgtccaccaattcctccaccagcaggtcctccgggcatggaacaggcaaacatcgaaacagccctacggatacaaagttaaccatgcttattatcagcccacttcccagggtttggacagagtcttcttttcggatacccagtctacgtgtttggagactgtgtacaaggcgactggtgccccatctctgggggactatgttcggcccgcctacatcgtcacgccctactggccacctgtccagagcatcagatcacctgggaccccatcgatggacgcgttatcggctcagctctacagttccttatccctcgactcccctccttccccacccagagaacctctaagacccttaaggtccttaccccgccaatcactcatacaacccccaacattccaccctccttcctccaggccatgcgcaaatactcccccttccgaaatggatacatggaacccacccttgggcagcacctcccaaccctgtcttttccagaccccggactccggccccaaaacctgtacaccctctggggaggctccgttgtctgcatgtacctctaccagctttccccccccatcacctggcccctcctgccccatgtgattttttgccaccccggccagctcggggccttcctcaccaatgttccctacaaacgaatagaaaaactcctctataaaatttcccttaccacaggggccctaataattctacccgaggactgtttgcccaccacccttttccagcctgctagggcacccgtcacgctgacagcctggcaaaacggcctccttccgttccactcaaccctcaccactccaggccttatttggacatttaccgatggcacgcctatgatttccgggccctgccctaaagatggccagccatctttagtactacagtcctcctcctttatatttcacaaatttcaaaccaaggcctaccacccctcatttctactctcacacggcctcatacagtactcttcctttcataatttgcatctcctatttgaagaatacaccaacatccccatttctctactttttaacgaaaaagaggcagatgacaatgaccatgagccccaaatatcccccgggggcttagagcctctcagtgaaaaacatttccgtgaaacagaagtctgagaaggtcagggcccagaataaggctctgacgtctccccccggaggacagctcagcaccagctcaggctaggccctgacgtgtccccctaaagacaaatcataagctcagacctccgggaagccaccgggaaccacccatttcctccccatgtttgtcaagccgtcctcaggcgttgacgacaacccctcacctcaaaaaacttttcatggcacgcatacggctcaataaaataacaggagtctataaaagcgtggggacagttcaggagggggctcgcatctctccttcacgcgcccgccgccttacctgaggccgccatccacgccggttgagtcgcgttctgccgcctcccgcctgtggtgcctcctgaactacgtccgccgtctaggtaagtttagagctcaggtcgagaccgggcctttgtccggcgctcccttggagcctacctagactcagccggctctccacgctttgcctgaccctgcttgctcaactcta'
    

    【讨论】:

      【解决方案2】:

      我相当肯定,当您使用 biopython 解析 fasta 文件时,它会将信息组织到字典中。您可以通过打印来检查所有内容的组织方式

      print dir(seq_record)
      

      我知道在解析 genbank 文件时,每个 seq_record 都有一个名为 features 的字典,因此对于 FASTA 文件,假设它的组织方式与您可以通过以下方式访问序列的方式相同

      for record in SeqIO.parse(handle, "fasta"):
          for f in record.features:
              print "sequence"
              print dir(f) # Print the attributes of f to make sure that "sequence" used in the above line is in fact a key in the dictionary, if not pick the correct key to use above
      

      【讨论】:

        猜你喜欢
        • 2012-11-13
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-12-28
        • 2014-11-04
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多