Beautifulsoup 的 Python 编码问题答案

【问题标题】：Python encode issue with beautifulsoupBeautifulsoup 的 Python 编码问题
【发布时间】：2011-07-02 14:34:33
【问题描述】：

你好，我有一个编码问题

当我把字符串放到 beautifulsoup 中时，所有的 National char 都丢失了

addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"                                
content = urllib2.urlopen(addr) .read()
html_pag = BeautifulSoup(content) #<- there i lost all national letters 
table_html= html_pag.find("div",  id="808")

在我的头文件中：

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")

【问题讨论】：

您发布的代码有效，并保留了所有“国家”字符。

标签： python encoding utf-8 ascii beautifulsoup

【解决方案1】：

根据BeautifulSoup的文档，所有输入都在内部转换为UTF8：

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Hello")
soup.contents[0]
# u'Hello'
soup.originalEncoding
# 'ascii'

如果您的输入未指定编码（例如元标记），BeautifulSoup 会猜测。您可以通过 fromEncodingparamter 到 BeautifulSoup 指定输入的编码来禁用猜测：

soup = BeautifulSoup("hello", fromEncoding="UTF-8")

或者你真正的问题是结果到控制台的“损坏”输出？

【讨论】：

仅供参考：他的网页使用 Content-Type 标头和标签正确指定了编码。我猜你的“真正问题”猜测是实际问题是什么......
请注意，在 BeautifulSoup 4 中，fromEncoding 已重命名为 from_encoding

【解决方案2】：

您的代码运行良好：

>>> addr = "http://zjazdowa.com.pl/index.php/aktualne-ceny-warzyw-i-owocow-.html"                                
>>> content = urllib2.urlopen(addr) .read()
>>> html_pag = BeautifulSoup(content) #<- there i lost all national letters 
>>> table_html= html_pag.find("div",  id="808")
>>> print table_html.findAll('td')[8].string
Kapusta włoska

对此有几点说明：

#!/usr/bin/python2.7
# -*- coding: utf-8 -*-
from BeautifulSoup import BeautifulSoup
import urllib2, string, re , sys
reload(sys)
sys.setdefaultencoding("utf-8")

reload 重新加载一个模块。我不确定您希望通过重新加载 sys 来做什么，但这并没有给您带来任何好处。

【讨论】：