如何使用python获取xml文件中的特定节点答案

【问题标题】：how to get specific nodes in xml file with python如何使用python获取xml文件中的特定节点
【发布时间】：2020-06-06 13:42:44
【问题描述】：

我正在寻找一种从一个非常大的 xml 文档中获取特定标签的方法。带有内置模块的python dom
例如：

<AssetType longname="characters" shortname="chr" shortnames="chrs">
  <type>
    pub
  </type>
  <type>
    geo
  </type>
  <type>
    rig
  </type>
</AssetType>

<AssetType longname="camera" shortname="cam" shortnames="cams">
  <type>
    cam1
  </type>
  <type>
    cam2
  </type>
  <type>
    cam4
  </type>
</AssetType>

我想检索获得属性的 AssetType 节点的子节点的值 (longname="characters" ) 得到'pub','geo','rig'的结果
请记住，我有超过 1000 个节点
提前谢谢

【问题讨论】：

标签： python xml

【解决方案1】：

假设您的文档名为 assets.xml 并具有以下结构：

<assets>
    <AssetType>
        ...
    </AssetType>
    <AssetType>
        ...
    </AssetType>
</assets>

然后您可以执行以下操作：

from xml.etree.ElementTree import ElementTree
tree = ElementTree()
root = tree.parse("assets.xml")
for assetType in root.findall("//AssetType[@longname='characters']"):
    for type in assetType.getchildren():
        print type.text

【讨论】：

【解决方案2】：

如果您不介意将整个文档加载到内存中：

from lxml import etree
data = etree.parse(fname)
result = [node.text.strip() 
    for node in data.xpath("//AssetType[@longname='characters']/type")]

您可能需要删除标签开头的空格才能使其正常工作。

【讨论】：

这也是我的方法。请记住，它需要安装 lxml 模块，它不是默认 Python 库的一部分。但是，我现在在一个项目中使用它，其中一些 XML 文件的大小为 65 兆，并且它没有抱怨（与脚本的作者相反）。
+1 表示lxml.etree，这大大优于默认安装的ElementTree。

【解决方案3】：

您可以使用pulldom API 来处理大文件的解析，而无需一次将其全部加载到内存中。与使用 SAX 相比，这提供了一个更方便的界面，而性能只有轻微的损失。

它基本上可以让您流式传输 xml 文件，直到找到您感兴趣的位，然后开始使用 regular DOM operations。


from xml.dom import pulldom

# http://mail.python.org/pipermail/xml-sig/2005-March/011022.html
def getInnerText(oNode):
    rc = ""
    nodelist = oNode.childNodes
    for node in nodelist:
        if node.nodeType == node.TEXT_NODE:
            rc = rc + node.data
        elif node.nodeType==node.ELEMENT_NODE:
            rc = rc + getInnerText(node)   # recursive !!!
        elif node.nodeType==node.CDATA_SECTION_NODE:
            rc = rc + node.data
        else:
            # node.nodeType: PROCESSING_INSTRUCTION_NODE, COMMENT_NODE, DOCUMENT_NODE, NOTATION_NODE and so on
           pass
    return rc


# xml_file is either a filename or a file
stream = pulldom.parse(xml_file) 
for event, node in stream:
    if event == "START_ELEMENT" and node.nodeName == "AssetType":
        if node.getAttribute("longname") == "characters":
            stream.expandNode(node) # node now contains a mini-dom tree
            type_nodes = node.getElementsByTagName('type')
            for type_node in type_nodes:
                # type_text will have the value of what's inside the type text
                type_text = getInnerText(type_node)

【讨论】：

【解决方案4】：

使用xml.sax 模块。构建自己的处理程序并在startElement 中检查名称是否为 AssetType。这样，您应该只能在处理 AssetType 节点时执行操作。

Here 你有示例处理程序，它显示了如何构建一个（虽然这不是最漂亮的方式，那时我还不知道 Python 的所有酷技巧 ;-)）。

【讨论】：

【解决方案5】：

您可以使用 xpath，例如“//AssetType[longname='characters']/xyz”。

对于 Python 中的 XPath 库，请参阅 http://www.somebits.com/weblog/tech/python/xpath.html

【讨论】：

【解决方案6】：

类似于eswald的解决方案，再次剥离空白，再次将文档加载到内存中，但一次返回三个文本项

from lxml import etree

data = """<AssetType longname="characters" shortname="chr" shortnames="chrs"
  <type>
    pub
  </type>
  <type>
    geo
  </type>
  <type>
    rig
  </type>
</AssetType>
"""

doc = etree.XML(data)

for asset in doc.xpath('//AssetType[@longname="characters"]'):
  threetypes = [ x.strip() for x in asset.xpath('./type/text()') ]
  print threetypes

【讨论】：