将 XML 解析为哈希表答案

【问题标题】：Parsing XML to a hash table将 XML 解析为哈希表
【发布时间】：2010-12-26 21:01:39
【问题描述】：

我有一个格式如下的 XML 文件：

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>

<type...>
</type>
</doc>

我想解析这个文档并建立一个哈希表

{X: {"A": [(100,80), (200,90)], "B": [(100,20), (20,90)]}, Y: .....}

我将如何在 Python 中做到这一点？

【问题讨论】：

这种问题已经被问过几次了。答案或许能帮到你。 stackoverflow.com/questions/191536/…stackoverflow.com/questions/471946/…

标签： python xml dom

【解决方案1】：

不要重新发明轮子。使用 Amara 工具包。无论如何，变量名只是字典中的键。 http://www.xml3k.org/Amara

【讨论】：

另一个链接 - xml.com/pub/a/2005/01/19/amara.html 你最终会得到一个变量 doc，它有 doc.id，它有 doc.id.type[0]，然后是 doc.id.type[0]。分钟，...等等。超级容易访问！

【解决方案2】：

我不同意其他答案中使用 minidom 的建议——这是对最初为其他语言构想的标准的一般 Python 改编，可用但不太适合。现代 Python 中推荐的方法是ElementTree。

在第三方模块lxml 中也实现了相同的接口，速度更快，但除非您需要极快的速度，否则 Python 标准库中包含的版本很好（无论如何都比 minidom 快）——关键是编程到该接口，那么您可以随时根据需要切换到同一接口的不同实现，只需对您自己的代码进行最少的更改。

例如，在需要导入 &c 之后，以下代码是您的示例的最小实现（它不会验证 XML 是否正确，只是假设正确性提取数据 - 添加各种检查非常容易当然）：

from xml.etree import ElementTree as et  # or, import any other, faster version of ET

def xml2data(xmlfile):
  tree = et.parse(xmlfile)
  data = {}
  for anid in tree.getroot().getchildren():
    currdict = data[anid.get('name')] = {}
    for atype in anid.getchildren():
      currlist = currdict[atype.get('name')] = []
      for c in atype.getchildren():
        currlist.append((c.get('val'), c.get('id')))
  return data

根据您的示例输入，这会产生您想要的结果。

【讨论】：

for child in node.getchildren(): 是不必要的；请改用for child in node:。
警告：xml.etree.ElementTree 模块对恶意构建的数据不安全。如果您需要解析不受信任或未经身份验证的数据，请参阅 XML 漏洞。只是为了谨慎。

【解决方案3】：

正如其他人所说，minidom 是去这里的方式。您打开（并解析）文件，同时通过节点检查它是否相关并且应该被读取。这样，您也知道是否要读取子节点。

把这个放在一起，似乎做你想做的事。某些值是按属性位置而不是属性名称读取的。并且没有错误处理。最后的 print() 表示它的 Python 3.x。

我将把它作为一个练习来改进它，只是想发布一个 sn-p 让你开始。

黑客愉快！ :)

xml.txt

<doc>
<id name="X">
  <type name="A">
    <min val="100" id="80"/>
    <max val="200" id="90"/>
   </type>
  <type name="B">
    <min val="100" id="20"/>
    <max val="20" id="90"/>
  </type>
</id>
</doc>

parsexml.py

from xml.dom import minidom
data={}
doc=minidom.parse("xml.txt")
for n in doc.childNodes[0].childNodes:
    if n.localName=="id":
        id_name = n.attributes.item(0).nodeValue
        data[id_name] = {}
        for j in n.childNodes:
            if j.localName=="type":
                type_name = j.attributes.item(0).nodeValue
                data[id_name][type_name] = [(),()]
                for k in j.childNodes:
                    if k.localName=="min":
                        data[id_name][type_name][0] = \
                            (k.attributes.item(1).nodeValue, \
                             k.attributes.item(0).nodeValue)
                    if k.localName=="max":
                        data[id_name][type_name][1] = \
                            (k.attributes.item(1).nodeValue, \
                             k.attributes.item(0).nodeValue)
print (data)

输出：

{'X': {'A': [('100', '80'), ('200', '90')], 'B': [('100', '20'), ('20', '90')]}}

【讨论】：

抱歉，打错房间了。丑陋的代码竞赛即将结束。

【解决方案4】：

另一个 XML 解析库：http://www.crummy.com/software/BeautifulSoup/

解析 XML 文档从这里开始：http://www.crummy.com/software/BeautifulSoup/documentation.html#Parsing%20XML

【讨论】：

比起本地 XML 文件，我更熟悉 BeautifulSoup 和解析 URL，所以这对我来说是一个很好的解决方案。

【解决方案5】：

为什么不试试PyXml 库之类的东西。他们有很多文档和教程。

【讨论】：

警告挪威蓝鹦鹉综合征：最后一次发布是 5 年前。没有适用于 Python 2.5 和 2.6 的 Windows 安装程序。

【解决方案6】：

我建议使用minidom 库。

文档非常好，因此您应该立即启动并运行。

丹。

【讨论】：