Beautifulsoup soup.body 返回无答案

【问题标题】：Beautifulsoup soup.body return NoneBeautifulsoup soup.body 返回无
【发布时间】：2014-09-20 20:49:51
【问题描述】：

知道soup.title 返回预期结果的情况下，什么会导致beautifulsoup 返回soup.body 为None

这是我正在解析的页面的链接http://goo.gl/6T3RKV

print(soup.prettify())

给出页面的准确 html

【问题讨论】：

标签： python python-2.7 html-parsing beautifulsoup

【解决方案1】：

这是因为differences in BeautifulSoup parsers:

>>> from urllib2 import urlopen
>>> from bs4 import BeautifulSoup
>>> url = 'http://www.emploi.ma/offre-emploi-maroc/commerciaux-en-emission-appels-1019077'
>>> soup = BeautifulSoup(urlopen(url), "html5lib")
>>> print soup.body
None

>>> soup = BeautifulSoup(urlopen(url), "html.parser")
>>> print soup.body
<body class="not-front not-logged-in page-node node-type-offre no-sidebars candidate-context full-node layout-main-last sidebars-split font-size-12 grid-type-960 grid-width-16 role-other" id="pid-node-1019077">
<div class="page" id="page">
... 

>>> soup = BeautifulSoup(urlopen(url), "lxml")
>>> print soup.body
<body class="not-front not-logged-in page-node node-type-offre no-sidebars candidate-context full-node layout-main-last sidebars-split font-size-12 grid-type-960 grid-width-16 role-other" id="pid-node-1019077">
<div class="page" id="page">
...

如您所见，html5lib 无法从这个特定的 html 中获取 body。而且，根据documentation，html5lib 将被选为默认值，以防lxml 未安装：

如果您不指定任何内容，您将获得最好的 HTML 解析器安装。 Beautiful Soup 将 lxml 的解析器评为最佳，然后 html5lib，然后是 Python 的内置解析器。

【讨论】：

非常感谢！妈的，你让我免于花这么多时间在这上面！