Unicode 在 html.parser 中消失答案

【问题标题】：Unicode Disappearing in html.parserUnicode 在 html.parser 中消失
【发布时间】：2013-04-28 03:49:21
【问题描述】：

我正在从一些带有 Unicode 字符的网页中提取 HTML，如下所示：

def extract(url):
     """ Adapted from Python3_Google_Search.py """
     user_agent = ("Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) "
                   "AppleWebKit/525.13 (KHTML,     like Gecko)"
                   "Chrome/0.2.149.29 Safari/525.13")
     request = urllib.request.Request(url)
     request.add_header("User-Agent",user_agent)
     response = urllib.request.urlopen(request)
     html = response.read().decode("utf8")
     return html

如您所见，我正在正确解码。所以 html 现在是一个 unicode 字符串。打印 html 时，我可以看到 Unicode 字符。

我正在使用html.parser 来解析 HTML 并将其子类化：

from html.parser import HTMLParser
class Parser(HTMLParser):
  def __init__(self):
    ## some init stuff
  #### rest of class

当使用类的handle_data 解析 HTML 时，Unicode 字符似乎被删除/突然消失了。文档没有提到任何关于编码的内容。为什么 HTML Parser 会删除非 ascii 字符，我该如何解决这个问题？

【问题讨论】：

您使用什么程序/工具来查看输出？
1.您是否 100% 确定您的脚本接收到的数据中包含字符，以及 2. 您如何验证非 ascii 字符已“消失”？
我在终端中使用了 Emacs（启用了 Unicode 编码），然后又使用了 Mac TextEdit。
@MartijnPieters，当我在返回extract 函数之前打印html 时，我看到了：<td>&Ouml;sterreich</td>。所以是的，我 100% 确定我的脚本收到了正确的 Unicode 字符。我正在通过打开我写出的文本文件并看到它们不存在来验证 unicode 字符是否已消失。
@Darksky：这些是 HTML 转义码，仅使用 ASCII 字符。其他东西正在删除那些，到目前为止这与 Python 无关。 &Ouml; 是 6 个字符，一个 & 符号，一个大写 O，小写 u，m 和 l，然后是一个分号。

标签： python unicode utf-8 python-3.x python-unicode

【解决方案1】：

显然，html.parser 在遇到非 ascii 字符时会调用 handle_entityref。它传递命名字符引用，并将其转换为 unicode 字符，我使用了：

html.entities.html5[name]

Python 的文档没有提到这一点。我从未见过比 Python 更糟糕的文档。

【讨论】：