HTMLParser.HTMLParser().unescape() 不起作用答案

【问题标题】：HTMLParser.HTMLParser().unescape() doesn't workHTMLParser.HTMLParser().unescape() 不起作用
【发布时间】：2013-07-19 16:48:55
【问题描述】：

我想将 HTML 实体转换回人类可读的格式，例如'&pound;' 到 '£'，'&deg;' 到 '°' 等等。

我已经阅读了几篇关于这个问题的帖子

Converting html source content into readable format with Python 2.x

Decode HTML entities in Python string?

Convert XML/HTML Entities into Unicode String in Python

据他们说，我选择使用未记录的函数 unescape()，但它对我不起作用...

我的代码示例如下：

import HTMLParser

htmlParser = HTMLParser.HTMLParser()
decoded = htmlParser.unescape('&copy; 2013')
print decoded

当我运行这个 python 脚本时，输出仍然是：

&copy; 2013

而不是

© 2013

我正在使用 Python 2.X，在 Windows 7 和 Cygwin 控制台上工作。我用谷歌搜索并没有发现任何类似的问题..有人可以帮我解决这个问题吗？

【问题讨论】：

我已经尝试从命令行和 IDLE 调用它，它确实对我有用（Windows 8 上的 Python 2.7）。

标签： python html unicode

【解决方案1】：

显然HTMLParser.unescape 在Python 2.6 之前是bit more primitive。

Python 2.5：

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('&copy;')
'&copy;'

Python 2.6/2.7：

>>> import HTMLParser
>>> HTMLParser.HTMLParser().unescape('&copy;')
u'\xa9'

查看2.5 implementation 与2.6 implementation / 2.7 implementation

【讨论】：

在 Python 3.4+ 中是html.unescape()

【解决方案2】：

This site 列出了一些解决方案，这里是其中之一：

from xml.sax.saxutils import escape, unescape

html_escape_table = {
    '"': "&quot;",
    "'": "&apos;",
    "©": "&copy;"
    # etc...
}
html_unescape_table = {v:k for k, v in html_escape_table.items()}

def html_unescape(text):
    return unescape(text, html_unescape_table)

虽然不是最漂亮的，因为您必须手动列出每个转义符号。

编辑：

这个怎么样？

import htmllib

def unescape(s):
    p = htmllib.HTMLParser(None)
    p.save_bgn()
    p.feed(s)
    return p.save_end()

【讨论】：

您好，谢谢您的回答。但是我的html页面的内容是未知的，所以除非我列出所有的html特殊字符...

【解决方案3】：

在 python 3.9 中使用HTMLParser()unescape(<str>) 会导致错误AttributeError: 'HTMLParser' object has no attribute 'unescape'

您可以将其更新为：

import html
html.unescape(<str>)

【讨论】：