为什么 Python 坚持使用 ascii？答案

【问题标题】：Why is Python insisting on using ascii?为什么 Python 坚持使用 ascii？
【发布时间】：2013-06-06 06:48:37
【问题描述】：

使用 Requests 和 Beautiful Soup 解析 HTML 文件时，以下行在某些网页上引发异常：

if 'var' in str(tag.string):

这里是上下文：

response = requests.get(url)  
soup = bs4.BeautifulSoup(response.text.encode('utf-8'))

for tag in soup.findAll('script'):
    if 'var' in str(tag.string):    # This is the line throwing the exception
        print(tag.string)

这是一个例外：

UnicodeDecodeError：“ascii”编解码器无法解码位置 15 中的字节 0xc3：序数不在范围内 (128)

我已经尝试过在BeautifulSoup 行中使用和不使用encode('utf-8') 函数，这没有区别。我确实注意到，对于抛出异常的页面，javascript 的注释中有一个字符 Ã，即使 response.encoding 报告的编码是 ISO-8859-1。我确实意识到我可以使用 unicodedata.normalize 删除有问题的字符，但是我更愿意将 tag 变量转换为 utf-8 并保留这些字符。以下方法都不能帮助将变量更改为utf-8：

tag.encode('utf-8')
tag.decode('ISO-8859-1').encode('utf-8')
tag.decode(response.encoding).encode('utf-8')

我必须对此字符串做什么才能将其转换为可用的utf-8？

【问题讨论】：

你尝试了这些方法但一直在做：if 'var' in str(tag.string):??
@PauloBu：不，我当然使用转换的输出！

标签： python utf-8 ascii beautifulsoup python-requests

【解决方案1】：

好的，基本上你会得到一个用Latin-1 编码的HTTP 响应。给您带来问题的字符确实是Ã，因为查看here 您可能会看到0xC3 正是Latin-1 中的那个字符。

我认为你对你想象的解码/编码请求的每个组合都进行了盲目测试。首先，如果你这样做：if 'var' in str(tag.string): 只要string var 包含非 ASCII 字节，python 就会抱怨。

查看您与我们共享的代码，恕我直言，正确的方法是：

response = requests.get(url)
# decode the latin-1 bytes to unicode  
#soup = bs4.BeautifulSoup(response.text.decode('latin-1'))
#try this line instead
soup = bs4.BeautifulSoup(response.text, from_encoding=response.encoding)

for tag in soup.findAll('script'):
    # since now soup was made with unicode strings I supposed you can treat
    # its elements as so
    if u'var' in tag.string:    # This is the line throwing the exception
        # now if you want output in utf-8
        print(tag.string.encode('utf-8'))

编辑：看看the encoding section from the BeautifiulSoup 4 doc对你很有用

基本上，逻辑是：

您会得到一些以编码X 编码的字节
您通过执行bytes.decode('X') and this returns a unicode byte sequence 解码X
您使用 unicode
您将 unicode 编码为某种编码 Y 以输出 ubytes.encode('Y')

希望这能给问题带来一些启示。

【讨论】：

谢谢。而不是response.text.decode('latin-1')，我正在尝试response.text.decode(response.encoding)，因为这个应用程序也需要与其他站点一起工作。那条线现在正在抛出错误消息（当然，尽管位置不同）。有没有通用的方法来处理任何编码？
现在的错误是什么？这是使用任何编码的方式。你得到响应编码，解码它，使用 unicode 并编码 int utf-8。现在抛出什么错误，response.encoding 看起来如何？
同样的错误：UnicodeEncodeError: 'ascii' codec can't encode characters in position 5837-5838: ordinal not in range(128)，现在在这一行：soup = bs4.BeautifulSoup(response.text.decode(response.encoding))（全部从 CLI 错误消息中复制）。我在这个例子中解析的页面是poemhunter.com/poems/hate（不是我的网站，只是我偶然发现的一个例子）。
我在实例化 BeautifulSoup 对象时编辑了答案中的代码。还为您提供了一个有用的文档链接。我会看看那个页面。如果有效，请通知我。
谢谢，发送您提到的from_encoding= 编码似乎确实有帮助！我现在正在测试。感谢您提供文档相关部分的链接。

【解决方案2】：

您也可以尝试使用 Unicode Dammit lib（它是 BS4 的一部分）来解析页面。详细说明在这里：http://scriptcult.com/subcategory_176/article_852-use-beautifulsoup-unicodedammit-with-lxml-html.html

【讨论】：