引用无效字符号：（Python ElementTree 解析）答案

【问题标题】：reference to invalid character number: (Python ElementTree parse)引用无效字符号：（Python ElementTree 解析）
【发布时间】：2019-09-18 03:01:46
【问题描述】：

我有一个包含以下内容的 xml 文件：

    <word>vegetation</word>
    <word>cover</word>
    <word>(&#x2;31%</word>
    <word>split_identifier ;</word>
    <word>Still</word>
    <word>and</word>

当我使用 ElmentTree 解析读取文件时，它给了我错误：

xml.etree.ElementTree.ParseError: 引用无效字符号码

因为 ( 是 "~")。

我该如何处理这些问题。我不确定我将来会得到多少其他符号。

【问题讨论】：

您确定&#x2; 是波浪号html 代码吗？我认为&#x126; 是正确的代码：w3schools.com/charsets/tryit.asp?deci=126（并尝试：w3schools.com/charsets/tryit.asp?deci=2）。
实际上，它在 pdf 中的“~”我使用 PDFbox 转换为文本（xml），然后我使用 pythong 库 ElementTree 解析该 XML，这就是它导致我问题的地方。因此，在 XML 中它是。如果我没有得到正确的解析，我不会介意。但我想让文件解析没有任何错误，因为其他内容很重要。
你不能用空格替换代码
当然。那是一个选项。但我该怎么做。有没有 uniocde 编解码器可以做到这一点？

标签： python elementtree

【解决方案1】：

如果你想去掉那些特殊字符，你可以通过将输入的 XML 擦洗为字符串：

respXML = response.content.decode("utf-16")

scrubbedXML = re.sub('&.+[0-9]+;', '', respXML)

respRoot = ET.fromstring(scrubbedXML)

如果您希望保留特殊字符，您可以预先解析它们。在您的情况下，它看起来像 html，因此您可以使用 python html 模块：

import html
respRoot = ET.fromstring(html.unescape(response.content.decode("utf-16"))

【讨论】：