BeautifulSoup (bs4) 解析错误答案

【问题标题】：BeautifulSoup (bs4) parsing wrongBeautifulSoup (bs4) 解析错误
【发布时间】：2015-07-09 08:31:15
【问题描述】：

使用 bs4 解析此示例文档，来自 python 2.7.6：

<html>
<body>
<p>HTML allows omitting P end-tags.

<p>Like that and this.

<p>And this, too.

<p>What happened?</p>

<p>And can we <p>nest a paragraph, too?</p></p>

</body>
</html>

使用：

from bs4 import BeautifulSoup as BS
...
tree = BS(fh)

长期以来，HTML 允许各种元素类型的省略结束标记，包括 P（检查架构或解析器）。但是，本文档中 bs4 的 prettify() 表明它不会结束任何这些段落，直到它看到

标签： python html python-2.7 bs4

【解决方案1】：

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser 的文档讲述了如何让 BS4 使用不同的解析器。显然默认是 html.parse，BS4 文档说它在 Python 2.7.3 之前就被破坏了，但显然在 2.7.6 中仍然存在上述问题。

切换到“lxml”对我来说不成功，但切换到“html5lib”会产生正确的结果：

tree = BS(htmSource, "html5lib")

【讨论】：