【发布时间】:2014-08-02 00:22:55
【问题描述】:
我正在尝试使用 Biopython 的 Bio Entrez 解析函数解析 PubMed Central XML 文件。这是我迄今为止尝试过的:
from Bio import Entrez
for xmlfile in glob.glob ('samplepmcxml.xml'):
print xmlfile
fh = open (xmlfile, "r")
read_xml (fh, outfp)
fh.close()
def read_xml (handle, outh):
records = Entrez.parse(handle)
for record in records:
print record
我收到以下错误:
Traceback (most recent call last):
File "3parse_info_from_pmc_nxml.py", line 78, in <module>
read_xml (fh, outfp)
File "3parse_info_from_pmc_nxml.py", line 10, in read_xml
for record in records:
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 137, in parse
self.parser.Parse(text, False)
File "/usr/lib/pymodules/python2.6/Bio/Entrez/Parser.py", line 165, in startNamespaceDeclHandler
raise NotImplementedError("The Bio.Entrez parser cannot handle XML data that make use of XML namespaces")
NotImplementedError: The Bio.Entrez parser cannot handle XML data that make use of XML namespaces
我已经下载了archivearticle.dtd 文件。是否需要安装任何其他 DTD 文件来描述 PMC 文件的架构?有没有人成功使用过Bio Entrez函数或者其他方法解析PMC文章?
感谢您的帮助!
【问题讨论】:
标签: python xml-parsing dtd biopython