使用 Python 解析 XML 时处理多个节点答案

【问题标题】：Handling multiple nodes when parsing XML with Python使用 Python 解析 XML 时处理多个节点
【发布时间】：2017-04-27 15:35:29
【问题描述】：

对于一个作业，我需要解析一个 200 万行的 XML 文件，并将数据输入到 MySQL 数据库中。由于我们使用带有 sqlite 的 python 环境作为类，我正在尝试使用 python 来解析文件。请记住，我只是在学习 python，所以一切都是新的！

我尝试了几次，但一直失败并感到沮丧。为了提高效率，我只在少量完整的 XML 上测试我的代码，这里：

<pub>
<ID>7</ID>
<title>On the Correlation of Image Size to System Accuracy in Automatic Fingerprint Identification Systems</title>
<year>2003</year>
<booktitle>AVBPA</booktitle>
<pages>895-902</pages>
<authors>
    <author>J. K. Schneider</author>
    <author>C. E. Richardson</author>
    <author>F. W. Kiefer</author>
    <author>Venu Govindaraju</author>
</authors>
</pub>

第一次尝试

这里我成功地从每个标签中提取了所有数据，除非<authors>标签下有多个作者。我正在尝试遍历作者标签中的每个节点，计数，然后为这些作者创建一个临时数组，然后使用 SQL 将它们放入我的数据库中。我的作者数量是“15”，但显然只有 4 个！我该如何解决？

from xml.dom import minidom

xmldoc= minidom.parse("test.xml")

pub = xmldoc.getElementsByTagName("pub")[0]
ID = pub.getElementsByTagName("ID")[0].firstChild.data
title = pub.getElementsByTagName("title")[0].firstChild.data
year = pub.getElementsByTagName("year")[0].firstChild.data
booktitle = pub.getElementsByTagName("booktitle")[0].firstChild.data
pages = pub.getElementsByTagName("pages")[0].firstChild.data
authors = pub.getElementsByTagName("authors")[0]
author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )

print(ID)
print(title)
print(year)
print(booktitle)
print(pages)
print(author)

【问题讨论】：

标签： python mysql xml

【解决方案1】：

请注意，您在此处获取第一作者的字符数，因为代码将结果限制为仅第一作者（索引 0），然后获取其长度：

author = authors.getElementsByTagName("author")[0].firstChild.data
num_authors = len(author)
print("Number of authors: ", num_authors )

只是不要限制结果获取所有作者：

author = authors.getElementsByTagName("author")
num_authors = len(author)
print("Number of authors: ", num_authors )

您可以使用 list comprehension 来获取列表中的所有作者姓名，而不是作者元素：

author = [a.firstChild.data for a in authors.getElementsByTagName("author")]
print(author)
# [u'J. K. Schneider', u'C. E. Richardson', u'F. W. Kiefer', u'Venu Govindaraju']

【讨论】：

我知道我需要访问数组中的每个变量，但不确定语法。非常感谢！
嘿@har07，所以我取得了进展，但我的一些 XML 数据在某种意义上是“坏的”......我有一个名称中包含特殊字符（如“í”）的条目，然后出来到“í”在 XML 文件中。如何将这些特殊语言字符处理成 python？我得到的错误是“ExpatError：未定义的实体：”。