Biopython 的 ESearch 没有给我完整的 IdList答案

【问题标题】：Biopython's ESearch does not give me full IdListBiopython 的 ESearch 没有给我完整的 IdList
【发布时间】：2017-10-05 07:00:32
【问题描述】：

我正在尝试使用以下代码搜索一些文章：

handle = Entrez.esearch(db="pubmed", term="lung+cancer")
record = Entrez.read(handle)

从record['Count'] 我可以看到有 293279 个结果，但是当我查看record['IdList'] 时，它只给了我 20 个 ID。这是为什么？如何获取所有 293279 条记录？

【问题讨论】：

标签： python biopython pubmed

【解决方案1】：

Entrez.esearch 返回的默认记录数是 20。这是为了防止 NCBI 的服务器过载。要获取完整的记录列表，请更改 retmax 参数：

>>> from Bio import Entrez
>>> Entrez.email = "A.N.Other@example.com"     # Always tell NCBI who you are
>>> handle = Entrez.esearch(db="pubmed", term="lung+cancer")
>>> record = Entrez.read(handle)
>>> count = record['Count']
>>> handle = Entrez.esearch(db="pubmed", term="lung+cancer", retmax=count)
>>> record = Entrez.read(handle)
>>> print len(record['IdList'])
293279

下载所有记录的方法是使用Entrez.epost。

来自chapter 9.4 of the BioPython tutorial：

EPost 上传 UI 列表以用于后续搜索策略；请参阅EPost help page 了解更多信息。它可以通过 Bio.Entrez.epost() 函数从 Biopython 获得。

举个例子说明这在什么时候有用，假设您有一长串要使用 EFetch 下载的 ID（可能是序列，也可能是引用——任何东西）。当您使用 EFetch 发出请求时，您的 ID 列表、数据库等都将变成发送到服务器的长 URL。如果您的 ID 列表很长，则此 URL 会变长，并且长 URL 可能会中断（例如，某些代理不能很好地处理）。

相反，您可以将其分为两个步骤，首先使用 EPost 上传 ID 列表（这在内部使用“HTML post”，而不是“HTML get”，以解决长 URL 问题）。借助历史记录支持，您可以参考这一长长的 ID 列表，并使用 EFetch 下载相关数据。

[...] 返回的 XML 包含两个重要的字符串，QueryKey 和 WebEnv，它们共同定义了您的历史会话。您可以提取这些值以用于另一个 Entrez 调用，例如 EFetch。

阅读chapter 9.15.: Searching for and downloading sequences using the history了解如何使用QueryKey和WebEnv

一个完整的工作示例是：

from Bio import Entrez
import time

Entrez.email = "A.N.Other@example.com" 
handle = Entrez.esearch(db="pubmed", term="lung+cancer")
record = Entrez.read(handle)

count = int(record['Count'])
handle = Entrez.esearch(db="pubmed", term="lung+cancer", retmax=count)
record = Entrez.read(handle)

id_list = record['IdList']
post_xml = Entrez.epost("pubmed", id=",".join(id_list))
search_results = Entrez.read(post_xml)

webenv = search_results["WebEnv"]
query_key = search_results["QueryKey"] 

try:
    from urllib.error import HTTPError  # for Python 3
except ImportError:
    from urllib2 import HTTPError  # for Python 2

batch_size = 200
out_handle = open("lung_cancer.txt", "w")
for start in range(0, count, batch_size):
    end = min(count, start+batch_size)
    print("Going to download record %i to %i" % (start+1, end))
    attempt = 0
    success = False
    while attempt < 3 and not success:
        attempt += 1
        try:
            fetch_handle = Entrez.efetch(db="pubmed",
                                         retstart=start, retmax=batch_size,
                                         webenv=webenv, query_key=query_key)
            success = True
        except HTTPError as err:
            if 500 <= err.code <= 599:
                print("Received error from server %s" % err)
                print("Attempt %i of 3" % attempt)
                time.sleep(15)
            else:
                raise
    data = fetch_handle.read()
    fetch_handle.close()
    out_handle.write(data)
out_handle.close()

【讨论】：

对了，你知道我要不要从GEO数据库（db=gds）中获取描述，比如这个ncbi.nlm.nih.gov/sites/GDSbrowser?acc=GDS5648，我也可以用你发的方法吗？
是的，只需将上面脚本中"pubmed" 的每个实例替换为"gds"。