【问题标题】:UnicodeDecodeError with using Biopython to obtain the abstract from efetchUnicodeDecodeError 与使用 Biopython 从 efetch 获取摘要
【发布时间】:2017-03-21 11:34:17
【问题描述】:

最近,使用 Biopython 从 Pubmed 中提取了一些摘要。 我的代码是用 Python3 编写的,如下所示:

from Bio import Entrez

Entrez.email = "myemail@example.com"    # Always tell NCBI who you are


def get_number():    #Get the total number of abstract available in Pubmed
    handle = Entrez.egquery(term="allergic contact dermatitis ")
    record = Entrez.read(handle)
    for row in record["eGQueryResult"]:
        if row["DbName"]=="pubmed":
            return int(row["Count"])


def get_id():    #Get all the ID of the abstract available in Pubmed
    handle = Entrez.esearch(db="pubmed", term="allergic contact dermatitis ", retmax=200)
    record = Entrez.read(handle)
    idlist = record["IdList"]
    return idlist

idlist = get_id()

for ids in idlist:    #Download the abstract based on their ID
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text")    # Retmode Can Be txt / json / xml / csv
    f = open("{}.txt".format(ids), "w")    # Create a TXT file with the name of ID
    f.write(handle.read())    #Write the abstract to the TXT file

我想得到200个摘要,但只能成功得到三四个摘要。然后,出现错误:

UnicodeDecodeError: 'cp950' codec can't decode byte 0xc5 in position 288: illegal multibyte sequence

handle.read() 似乎对那些包含某些符号或单词的抽象有问题。我尝试使用print 来了解handle 的类:

handle = Entrez.efetch(db="pubmed", id=idlist, rettype="abstract", retmode="text")
print(handle)

结果是:

<_io.TextIOWrapper encoding='cp950'>

我已经在很多页面上搜索了解决方案,但没有一个有效。有人可以帮忙吗?

【问题讨论】:

标签: python-3.x biopython pubmed


【解决方案1】:

对我来说,您的代码运行良好。这是您网站上的编码问题。您可以以字节模式打开文件并将文本编码为 utf-8 您可以尝试这样的解决方法:

for ids in idlist:    #Download the abstract based on their ID
    handle = Entrez.efetch(db="pubmed", id=ids, rettype="abstract", retmode="text")    # Retmode Can Be txt / json / xml / csv
    f = open("{}.txt".format(ids), "wb")    # Create a TXT file with the name of ID
    f.write(handle.read().encode('utf-8'))

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-12-23
    • 2017-11-05
    • 2016-11-11
    • 2012-11-13
    相关资源
    最近更新 更多