错误：不能在类似字节的对象上使用字符串模式[重复]

【问题标题】：Error: Can't use string pattern on a byte-like object [duplicate]错误：不能在类似字节的对象上使用字符串模式[重复]
【发布时间】：2014-03-31 16:38:28
【问题描述】：

我正在使用 Python 3.2.3 运行此代码：

regex = '<title>(.+?)</title>'
pattern = re.compile(regex)

然后使用 findall 搜索模式：

titles = re.findall(pattern,html)
print(titles)

html 对象从特定的 url 获取 html 代码。

html = response.read()

我收到错误“不能在类似字节的对象上使用字符串模式”。我试过使用：

regex = b'<title>(.+?)</title>'

但这会在我的结果中附加一个“b”吗？谢谢。

【问题讨论】：

什么是html 和why are you using a regex to parse HTML?
html 对象是什么？尝试使用str(html)。会发生什么？
你推荐哪个 Python 的 HTML 解析器 Ignacio？

标签： python regex string compilation byte

【解决方案1】：

urllib.request 响应给你字节，而不是 unicode 字符串。这就是为什么re 模式也需要是一个bytes 对象，并且您会再次得到bytes 结果。

您可以使用服务器在 HTTP 标头中为您提供的编码来解码响应：

html = response.read()
# no codec set? We default to UTF-8 instead, a reasonable assumption
codec = response.info().get_param('charset', 'utf8')
html = html.decode(codec)

现在你有了 Unicode，也可以使用 unicode 正则表达式了。

如果服务器对编码撒谎或没有设置编码并且UTF-8的默认值也不正确，上述情况仍然会导致UnicodeDecodeException错误。

在任何情况下，用b'...' 表示的返回值都是bytes 对象；尚未解码为 Unicode 的原始字符串数据，如果您知道数据的正确编码，则无需担心。

【讨论】：

这代表了读取和写入字符串数据时的一般规则：在读取输入时将其解码为 Unicode，在写入之前对 Unicode 字符串进行编码。程序中的所有文本都应使用 Unicode 处理。