Beautifulsoup 在特定站点上不起作用答案

【问题标题】：Beautifulsoup not working on a specific siteBeautifulsoup 在特定站点上不起作用
【发布时间】：2014-05-06 23:54:15
【问题描述】：

我正在尝试解析 this site，但由于我无法理解的原因，什么都没有发生。

url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
response = urllib2.urlopen(url).read()
doc = BeautifulSoup(response)
divs = doc.findAll('div')
print len(divs) # prints 0.

这个网站是巴西里约热内卢的一个房地产广告。我在 html 源代码中找不到任何可能阻止 Beautifulsoup 工作的东西。会是大小吗？

我正在使用 Enthought Canopy Python 2.7.6、IPython Notebook 2.0、Beautifulsoup 4.3.2。

【问题讨论】：

同样的代码非常适合我，它显示 560...
使用下面的提示，我的环境只适用于 'html.parser' 配置。

标签： python html python-2.7 html-parsing beautifulsoup

【解决方案1】：

这是因为您让BeautifulSoup 为您选择最合适的解析器。而且，这实际上取决于您的 python 环境中安装了哪些模块。

根据documentation：

BeautifulSoup 构造函数的第一个参数是一个字符串或一个打开文件句柄——你要解析的标记。第二个论点是如何你想解析标记。

如果您不指定任何内容，您将获得最好的 HTML 解析器安装。 Beautiful Soup 将 lxml 的解析器评为最佳，然后 html5lib 的，然后是 Python 的内置解析器。

所以，不同的解析器 - 不同的结果：

>>> from bs4 import BeautifulSoup
>>> url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
>>> import urllib2
>>> response = urllib2.urlopen(url).read()
>>> len(BeautifulSoup(response, 'lxml').find_all('div'))
558
>>> len(BeautifulSoup(response, 'html.parser').find_all('div'))
558
>>> len(BeautifulSoup(response, 'html5lib').find_all('div'))
0

您的解决方案是指定一个可以处理此特定页面解析的解析器，您可能需要安装lxml 或html5lib。

另见：Differences between parsers。

【讨论】：

有趣。我认为解析器之间的区别只是速度。我认为默认解析器是最宽容的？
@dilbert 如果你说的是html.parser，那么这取决于python版本，如果是2.7，是的，它比以前更宽容。 html5lib 在这里应该是最宽松的，但是，如您所见，它无法正确解析页面。
@alecxe 我已经尝试过 lxml 和 html5lib 解析器，并得到相同的结果，什么都没有。但是使用我不知道的 html.parser，它运行良好。问题的原因可能是Entought分布吗？无论如何，与 html.parser 它一样，我终于可以继续前进了。谢谢！

【解决方案2】：

你的环境有问题，这是我得到的输出：

>>> url = 'http://www.zap.com.br/imoveis/rio-de-janeiro+rio-de-janeiro/apartamento-padrao/venda/'
>>> from bs4 import BeautifulSoup
>>> import urllib2
>>> response = urllib2.urlopen(url).read()
>>> doc = BeautifulSoup(response)
>>> divs = doc.findAll('div')
>>> print len(divs) # prints 0.
558

【讨论】：

你建议我找什么？我不知道我的环境出了什么问题。