使用 BeautifulSoup 抓取数据的问题答案

【问题标题】：Problem with scraping data using BeautifulSoup使用 BeautifulSoup 抓取数据的问题
【发布时间】：2011-03-10 15:53:36
【问题描述】：

我编写了以下试用代码，以从欧洲议会取回立法法案的标题。

import urllib2
from BeautifulSoup import BeautifulSoup

search_url = "http://www.europarl.europa.eu/sides/getDoc.do?type=REPORT&mode=XML&reference=A7-2010-%.4d&language=EN"

for number in xrange(1,10):   
    url = search_url % number
    page = urllib2.urlopen(url).read()
    soup = BeautifulSoup(page)
    title = soup.findAll("title")
    print title

但是，每当我运行它时，我都会收到以下错误：

Traceback (most recent call last):
  File "<stdin>", line 20, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2013' in position 70: ordinal not in range(128)

我已将其范围缩小到 BeautifulSoup 无法读取循环中的第四个文档。谁能向我解释我做错了什么？

致以诚挚的问候

托马斯

【问题讨论】：

标签： python loops beautifulsoup web-scraping

【解决方案1】：

BeautifulSoup 在 Unicode 中工作，因此它不对解码错误负责。更有可能的是，您的问题来自 print 语句——您的标准输出似乎在 ascii 中（即 sys.stdout.encoding = 'ascii' 或不存在），因此如果尝试打印包含非 ascii 字符的字符串，您确实会遇到此类错误.

您的操作系统是什么？您的控制台 AKA 终端设置如何（例如，如果在 Windows 上是什么“代码页”）？您是在环境中设置PYTHONIOENCODING 来控制sys.stdout.encoding 还是只是希望自动获取编码？

在我的 Mac 上，检测到编码正确，运行您的代码（除了为了清楚起见，还要将数字与每个标题一起打印）工作正常并显示：

$ python ebs.py 
1 [<title>REPORT Report on the proposal for a Council regulation temporarily suspending autonomous Common Customs Tariff duties on imports of certain industrial products into the autonomous regions of Madeira and the Azores - A7-0001/2010</title>]
2 [<title>REPORT Report on the proposal for a Council directive concerning mutual assistance for the recovery of claims relating to taxes, duties and other measures - A7-0002/2010</title>]
3 [<title>REPORT Report on the proposal for a regulation of the European Parliament and of the Council amending Council Regulation (EC) No 1085/2006 of 17 July 2006 establishing an Instrument for Pre-Accession Assistance (IPA) - A7-0003/2010</title>]
4 [<title>REPORT on equality between women and men in the European Union – 2009 - A7-0004/2010</title>]
5 [<title>REPORT Report on the proposal for a Council decision on the conclusion by the European Community of the Convention on the International Recovery of Child Support and Other Forms of Family Maintenance - A7-0005/2010</title>]
6 [<title>REPORT on the proposal for a Council directive on administrative cooperation in the field of taxation - A7-0006/2010</title>]
7 [<title>REPORT Report on promoting good governance in tax matters - A7-0007/2010</title>]
8 [<title>REPORT Report on the proposal for a Council Directive amending Directive 2006/112/EC as regards an optional and temporary application of the reverse charge mechanism in relation to supplies of certain goods and services susceptible to fraud - A7-0008/2010</title>]
9 [<title>REPORT Recommendation on the proposal for a Council decision concerning the conclusion, on behalf of the European Community, of the Additional Protocol to the Cooperation Agreement for the Protection of the Coasts and Waters of the North-East Atlantic against Pollution - A7-0009/2010</title>]
$

【讨论】：

嗨，Alex，我确实使用的是 Mac，您的 Mac 设置如何？现在我只是希望编码会被自动拾取（我仍在学习这整个令人困惑的编码业务:)）
@Thomas，我还没有进行任何设置——开箱即用（我相信，utf8 是 Terminal.App 的默认设置——如果没有，那么这是我唯一设置的在终端的首选项中）。你的 Python 中的 sys.stdout.encoding 是什么（事实上，你的 Python 和 MacOSX 是什么？我有 OSX 10.5，它适用于 Apple 分发的 Python 2.5，以及 python.org 分发的 2.4、2.6 和 3.1——全部开箱即用，没有环境变量设置）。
嗨，alex，我使用的是 MacOSx 10.5.8 和 python 2.6。
那么 sys.stdout.encoding 在 Py2.6 会话中显示了什么？（我看不出编辑器是如何改变事物的——或者你是从inside你的编辑器运行你的python代码，也许，而不是在普通的Terminal.App上？）。
如果我运行 sys.std.encoding 我得到“us-ascii”。感谢您花时间帮助我，Alex！

【解决方案2】：

更换

print title

与

for t in title:
    print(t)

或

print('\n'.join(t.string for t in title))

有效。我不完全确定为什么print <somelist> 有时有效，但有时却无效。

【讨论】：

【解决方案3】：

如果要将标题打印到文件中，则需要指定一些可以表示非 ascii 字符的编码，utf8 应该可以正常工作。为此，您需要添加：

out = codecs.open('titles.txt', 'w', 'utf8')

在脚本的顶部

并打印到文件：

print >> out, title

【讨论】：

嗨 Maltjuv，感谢您的帮助，但它仍然给我同样的错误。