在 Python3 中使用 HTMLParser 解析 HTML答案

【问题标题】：Parse HTML with HTMLParser in Python3在 Python3 中使用 HTMLParser 解析 HTML
【发布时间】：2013-04-17 15:36:22
【问题描述】：

我在 Python 3 中有一段代码可以在 Windows 中使用 HTMLParser 成功解析 HTML，问题是我也想在 Linux 中运行该脚本，但它似乎不起作用。

我使用以下内容检索 HTML 代码：

html = urllib.request.urlopen(url).read()
html_str = str(html)
parse = MyHTMLParser()
parse.feed(html_str)

html的原始输出如下：

b'\n \n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\n
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n
    <html xmlns="http://www.w3.org/1999/xhtml">\n
        <head>\n

html 是二进制的，所以我将它转换为string，所以parse.feed 不会抱怨。问题是我转换为字符串时得到的html是这样的：

'b\'\\n \\n<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\\n
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\\n
<html xmlns="http://www.w3.org/1999/xhtml">\\n
    <head>\\n

如您所见，我有几个\\n，Windows 并不在乎它们，但对于 Linux，它们是转义序列，因此无法解析 html。我现在不记得确切的错误，但它类似于can't parse \\

我尝试使用re 来删除多余的\ 和re.sub("\\","",html_str)，但在Windows 中似乎没有任何作用，在Linux 中我也遇到了错误。

这是我在 Linux 中尝试re.sub html 时遇到的错误：

>>> re.sub("\\","",html_str)
Traceback (most recent call last):
  File "/usr/lib/python3.1/sre_parse.py", line 194, in __next
    c = self.string[self.index + 1]
IndexError: string index out of range

知道如何删除html_str 中多余的\，以便在Linux 中解析它吗？

【问题讨论】：

\\n 不是 Linux 上的转义序列。 \\n 是两个字符，一个反斜杠（转义为 \\ 以使输出成为有效的 python 字节文字）和一个 n 字符。这些字符在 Windows 和 Linux 上具有相同的含义。你能查一下确切的错误和回溯吗？

标签： linux windows parsing python-3.x html-parsing

【解决方案1】：

在 python3 中，你不能像你正在做的那样将 bytes 转换为 str：

html_str = str(html)

这在 python2 中有效，因为 bytes 和 str 是相同的，但现在您将获得原始字符串的表示。要解码字符串，您需要提供 encoding 参数，或使用：

hmtl_str = html.decode(encoding)

如果您无法从 http 标头中获取字符集，您可以尝试猜测，或者使用 chardet 来确定正确的编码。

【讨论】：

请注意，str(html, 'ascii') 与 html.decode('ascii') 相同。