使用 Python 和 minidom 进行 XML 解析答案

【问题标题】：XML Parsing with Python and minidom使用 Python 和 minidom 进行 XML 解析
【发布时间】：2009-10-20 19:36:07
【问题描述】：

我正在使用 Python (minidom) 解析一个 XML 文件，该文件打印出类似这样的层次结构（此处使用缩进来显示重要的层次关系）：

My Document
Overview
    Basic Features
    About This Software
        Platforms Supported

相反，程序在节点上迭代多次并生成以下打印重复节点。（查看每次迭代的节点列表，很明显为什么会这样做，但我似乎无法找到获取我正在寻找的节点列表的方法。）

My Document
Overview
Basic Features
About This Software
Platforms Supported
Basic Features
About This Software
Platforms Supported
Platforms Supported

这里是 XML 源文件：

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

这是 Python 程序：

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("test.xml")
Topic=dom.getElementsByTagName('Topic')
i = 0
for node in Topic:
    alist=node.getElementsByTagName('Title')
    for a in alist:
        Title= a.firstChild.data
        print Title

我可以通过不嵌套“主题”元素来解决问题，方法是将较低级别的主题名称更改为“子主题1”和“子主题2”之类的名称。但是，我想利用内置的 XML 层次结构而不需要不同的元素名称；看来我应该能够嵌套“主题”元素，并且应该有某种方法可以知道我当前正在查看的“主题”级别。

我尝试了许多不同的 XPath 函数，但都没有成功。

【问题讨论】：

如果你想要第一个的输出，你可以打印每个元素的文本 - 我不清楚结构如何影响想要的输出

标签： python xml minidom

【解决方案1】：

getElementsByTagName 是递归的，您将获得 all 具有匹配 tagName 的后代。因为您的主题包含其他也有标题的主题，所以调用将多次获取较低的标题。

如果您只想查询所有匹配的直接子节点，并且没有可用的 XPath，您可以编写一个简单的过滤器，例如：

def getChildrenByTagName(node, tagName):
    for child in node.childNodes:
        if child.nodeType==child.ELEMENT_NODE and (tagName=='*' or child.tagName==tagName):
            yield child

for topic in document.getElementsByTagName('Topic'):
    title= list(getChildrenByTagName('Title'))[0]         # or just get(...).next()
    print title.firstChild.data

【讨论】：

感谢您的尝试。它没有用，但它给了我一些想法。以下工作（相同的一般想法；FWIW，nodeType 是 ELEMENT_NODE）： import xml.dom.minidom from xml.dom.minidom import Node dom = xml.dom.minidom.parse("docmap.xml") def getChildrenByTitle( node): for child in node.childNodes: if child.localName=='Title': yield child Topic=dom.getElementsByTagName('Topic') for node in Topic: alist=getChildrenByTitle(node) for a in alist: # Title = a.firstChild.data 标题= a.childNodes[0].nodeValue 打印标题

【解决方案2】：

以下作品：

import xml.dom.minidom
from xml.dom.minidom import Node

dom = xml.dom.minidom.parse("docmap.xml")

def getChildrenByTitle(node):
    for child in node.childNodes:
        if child.localName=='Title':
            yield child

Topic=dom.getElementsByTagName('Topic')
for node in Topic:
    alist=getChildrenByTitle(node)
    for a in alist:
        Title= a.childNodes[0].nodeValue
        print Title

【讨论】：

我会调用函数 getTitle（或 get_title），并让它不返回所有直接子标题元素，而只返回第一个（因为每个子元素应该只有一个标题，无论如何） .
也许这就是我没有得到的。我想要所有直系子女的头衔。也许更好的名字是 getTitlesOfChildren。

【解决方案3】：

我认为这会有所帮助

import os
import sys
import subprocess
import base64,xml.dom.minidom
from xml.dom.minidom import Node
f = open("file.xml",'r')
data = f.read()
i = 0
doc = xml.dom.minidom.parseString(data)
for topic in doc.getElementsByTagName('Topic'):
   title= doc.getElementsByTagName('Title')[i].firstChild.nodeValue
   print title
   i +=1

输出：

My Document
Overview
Basic Features
About This Software
Platforms Supported

【讨论】：

【解决方案4】：

您可以使用以下生成器遍历列表并获取具有缩进级别的标题：

def f(elem, level=-1):
    if elem.nodeName == "Title":
        yield elem.childNodes[0].nodeValue, level
    elif elem.nodeType == elem.ELEMENT_NODE:
        for child in elem.childNodes:
            for e, l in f(child, level + 1):
                yield e, l

如果你用你的文件测试它：

import xml.dom.minidom as minidom
doc = minidom.parse("test.xml")
list(f(doc))

你会得到一个包含以下元组的列表：

(u'My Document', 1), 
(u'Overview', 1), 
(u'Basic Features', 2), 
(u'About This Software', 2), 
(u'Platforms Supported', 3)

当然，微调只是一个基本的想法。如果您只想在开头使用空格，则可以直接在生成器中对其进行编码，尽管在级别上您具有更大的灵活性。您还可以自动检测第一个级别（这里将级别初始化为 -1 只是一项糟糕的工作......）。

【讨论】：

这正是我在遇到发电机之前一整天都在尝试做的事情。非常感谢。

【解决方案5】：

递归函数：

import xml.dom.minidom

def traverseTree(document, depth=0):
  tag = document.tagName
  for child in document.childNodes:
    if child.nodeType == child.TEXT_NODE:
      if document.tagName == 'Title':
        print depth*'    ', child.data
    if child.nodeType == xml.dom.Node.ELEMENT_NODE:
      traverseTree(child, depth+1)

filename = 'sample.xml'
dom = xml.dom.minidom.parse(filename)
traverseTree(dom.documentElement)

你的 xml：

<?xml version="1.0" encoding="UTF-8"?>
<DOCMAP>
    <Topic Target="ALL">
        <Title>My Document</Title>
    </Topic>
    <Topic Target="ALL">
        <Title>Overview</Title>
        <Topic Target="ALL">
            <Title>Basic Features</Title>
        </Topic>
        <Topic Target="ALL">
            <Title>About This Software</Title>
            <Topic Target="ALL">
                <Title>Platforms Supported</Title>
            </Topic>
        </Topic>
    </Topic>
</DOCMAP>

你想要的输出：

 $ python parse_sample.py 
      My Document
      Overview
          Basic Features
          About This Software
              Platforms Supported

【讨论】：