如何使用 Python 正确解析父/子 XML答案

【问题标题】：How to properly parse parent/child XML with Python如何使用 Python 正确解析父/子 XML
【发布时间】：2014-01-15 10:32:24
【问题描述】：

我最近几天一直在处理一个 XML 解析问题，但我就是想不通。我使用了 Python 内置的 ElementTree 以及 LXML 库，但得到了相同的结果。如果可以的话，我想继续使用 ElementTree，但如果该库有限制，那么 LXML 就可以了。请参阅以下 XML 示例。我要做的是找到一个连接元素并查看该元素包含哪些类。我期望每个连接至少包含一个类。如果它没有至少一个类，我想知道它没有。我面临的问题是我的代码为每个连接返回文档中的所有类，而不仅仅是该特定连接的类。

<test>
  <connections>
    <connection>
      <id>10</id>
      <classes>
        <class>
          <classname>DVD</classname>
        </class>
        <class>
          <classname>DVD_TEST</classname>
        </class>
      </classes>
    </connection>
    <connection>
      <id>20</id>
      <classes>
        <class>
          <classname>TV</classname>
        </class>
      </classes>
    </connection>
  </connections>
</test>

例如，这是我的 Python 代码及其返回的输出：

            for parentConnection in elemetTree.getiterator('connection'):
                # print parentConnection.tag
                for childConnection in parentConnection:
                    # print childConnection.text
                    if childConnection.tag == 'id':
                        connID = childConnection.text
                        print connID
                for p in tree.xpath('./connections/connection/classes/class'):
                    for attrib in p.attrib:
                        print '@' + attrib + '=' + p.attrib[attrib]

                    children = p.getchildren()
                    for child in children:
                        print child.text

这是输出：

10
DVD
DVD_TEST
电视

20
DVD
DVD_TEST
电视

如您所见，我打印出 CONNECTION ID 的文本，然后是每个 CLASSNAME 的文本。但是，如您所见，它们都包含相同的 CLASSNAME 文本。输出应该看起来像这样：

10
DVD
DVD_TEST

20
电视

现在，正如上面手动修改的示例所示，每个连接 ID（父）都有相应的类/类名（子）。我只是不知道如何使这项工作。如果你们中的任何人有知识来完成这项工作，我很想听听。

我已尝试在此论坛上构建数据结构和其他示例，但无法正常工作。

【问题讨论】：

标签： python-2.7 xml-parsing parent-child lxml elementtree

【解决方案1】：

我的解决方案不使用 xpath。 我建议进一步深入研究 lxml 文档。可能有更优雅和直接的方法来实现这一点。有很多值得探索的地方！。

解决方案：

from lxml import etree
from io import BytesIO


class FindClasses(object):
    @staticmethod
    def parse_xml(xml_string):
        parser = etree.XMLParser()
        fs = etree.parse(BytesIO(xml_string), parser)
        fstring = etree.tostring(fs, pretty_print=True)
        element = etree.fromstring(fstring)
        return element

    def find(self, xml_string):
        for parent in self.parse_xml(xml_string).getiterator('connection'):
            for child in parent:
                if child.tag == 'id':
                    print child.text
                    self.find_classes(child)

    @staticmethod
    def find_classes(child):
        for parent in child.getparent():  # traversing up -> connection
            for children in parent.getchildren():  # children of connection -> classes
                for child in children.getchildren():  # child of classes -> class
                    print child.text
        print

if __name__ == '__main__':
    xml_file = open('foo.xml', 'rb')  #foo.xml or path to your xml file
    xml = xml_file.read()
    f = FindClasses()
    f.find(xml)

输出：

10
DVD
DVD_TEST

20
TV

【讨论】：

看起来是一个很好的例子，我已经测试过了。我如何让它从文件中读取，而不是像您的示例显示的那样从 xml 字符串中读取？
@user2643864 进行了更改。看看是否有帮助。如果您觉得有帮助，请接受我的回答

【解决方案2】：

您的问题在于您的 xpath 表达式。它不理解嵌套 for 循环中的逻辑。结果：

tree.xpath('./connections/connection/classes/class')

是遵循提供给 xpath 的模式的每个元素的列表。在这种情况下，所有遵循此模式的 <class> 元素都会被选中（这实际上是 xpath 令人难以置信的强大功能，当您以这种方式存储数据时它可以选择所有这些节点）。

【讨论】：

知道如何解决这个问题吗？有代码示例吗？我正在苦苦思索如何绕开它。
我试过删除 XPATH 表达式，只是使用了 iterparse 并得到了同样的问题。