python ElementTree.Element 缺少文本？答案

【问题标题】：python ElementTree.Element missing text?python ElementTree.Element 缺少文本？
【发布时间】：2018-10-16 00:24:28
【问题描述】：

所以，我正在解析这个大小适中的 xml 文件（大约 27K 行）。不远处，我看到 ElementTree.Element 的意外行为，我在其中获得了一个条目的 Element.text，但没有获得下一个条目，但它在源 XML 中，如您所见：

<!-- language: lang-xml -->
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xs:enumeration value="24">
   <xs:annotation>
      <xs:documentation>UPC12 (item-specific) on cover 2</xs:documentation>
      <xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
   </xs:annotation>
</xs:enumeration>
<xs:enumeration value="25">
   <xs:annotation>
      <xs:documentation>UPC12+5 (item-specific) on cover 2</xs:documentation>
      <xs:documentation>AKA item/price; ‘cover 2’ is defined as the inside front cover of a book</xs:documentation>
   </xs:annotation>
</xs:enumeration>

当我遇到 enumeration 标签时，我调用这个函数：

import xml.etree.cElementTree as ElementTree
...
    def _parse_list_item(xmlns: str, list_id: int, itemElement: ElementTree.Element) -> ListItem:
      if isinstance(itemElement, ElementTree.Element):
        if itemElement.attrib['value'] is not None:
            item_id = itemElement.attrib['value']  # string
            if list_id == 6 and (item_id == '25' or item_id=='24'):
                print(list_id, item_id)  # <== debug break point here
            desc = None
            notes = ""
            for child in itemElement:
                if child.tag == (xmlns + 'annotation'):
                    for grandchild in child:
                        if grandchild.tag == (xmlns + 'documentation'):
                            if desc is None:
                                desc = grandchild.text
                            else:
                                if len(notes)>0:
                                    notes += " "  # add a space
                                notes += grandchild.text or ""
            if item_id is not None and desc is not None:
                return Codex.ListItem({'itemId': item_id, 'listId': list_id, 'description': desc, 'notes': notes})

如果我在 print 语句中放置一个断点，当我到达“24”的枚举节点时，我可以查看孙节点的文本，它们如 XML 中所示，即“UPC12...”或“AKA item ...”，但是当我到达“25”的枚举节点并查看孙子文本时，它是无。

当我通过预过滤 XML 文件删除 xs: 命名空间时，孙子文本可以正常显示。

我是否可能超出了某些大小限制或存在语法问题？
对不起，少于 pythonic 的代码，但我希望能够检查 pycharm 中的所有中间值。这是python 3.6。

感谢您提供的任何见解！

【问题讨论】：

标签： python xml python-3.x xml-parsing elementtree

【解决方案1】：

在for 循环中，这个条件永远不会满足：if child.tag == (xmlns + 'annotation'):。

为什么？

尝试输出孩子的标签。如果我们假设您的命名空间 (xmlns) 是 'Steve' 那么：

print(child.tag) 将输出：{Steve}annotation，而不是 Steveannotation。

因此，鉴于这个事实，if child.tag == (xmlns + 'annotation'): 始终是 False。
您应该将其更改为：if child.tag == ('{'+xmlns+'}annotation'):

用同样的逻辑，你会发现你也必须改变这个条件：

if grandchild.tag == (xmlns + 'documentation'):

到：

if grandchild.tag == ('{'+xmlns+'}documentation'):

【讨论】：

抱歉，我看到需要更多信息。 XML 文件在顶部附近包含这一行：
```
 w3.org/2001/XMLSchema"> 
```
所以在调用这个函数之前，我已经解析了这一行并将 xmlns 设置为{w3.org/2001/XMLSchema} 所以当我们遇到 child.tag = '{w3.org/2001/XMLSchema}annotation' 然后 xmlns+'annotation' 匹配...我正在考虑使用预处理从标签中删除前缀，如果那是什么让我失望关闭。
@SteveL：评论中的信息应该是问题的一部分。您应该提供minimal reproducible example。
@mzjn - 谢谢 - 信息从评论移到了问题的主体。
@SteveL：好的，但我仍然无法复制和粘贴代码并运行它。您没有提供重现问题的最少但完整的段代码。

【解决方案2】：

因此，最终，我通过对 XML 文件运行预处理以从所有打开/关闭 XML 标记中删除 xs: 命名空间来解决我的问题，然后我能够使用以下函数成功处理文件定义如上。不知道为什么命名空间会导致问题，但可能在 cElementTree 中存在一个错误，用于大型 XML 文件中的命名空间前缀。致@mzjn - 我希望构建一个最小的示例会很困难，因为它确实在失败之前正确处理了数百个项目，所以我至少必须提供一个相当大的 XML 文件。不过，感谢您成为一个共鸣板。

【讨论】：