BeautifulSoup 不同的解析器答案

【问题标题】：BeautifulSoup different parsersBeautifulSoup 不同的解析器
【发布时间】：2019-09-16 17:22:04
【问题描述】：

谁能详细说明 html.parser 和 html5lib 等解析器之间的区别？我偶然发现了一种奇怪的行为，当使用 html.parser 时，它会忽略特定位置的所有标签。看看这段代码

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
print(tags)

这将返回一个空列表，而在使用 html5lib 时，将按预期返回所需的“a”标签。有人知道原因吗？

我已经阅读了文档，但是关于不同解析器的解释很模糊..

我还注意到 html5lib 会忽略嵌套表单标签等无效标签，有没有办法使用 html5lib 来避免 html.parser 的上述行为并获得嵌套表单标签等无效标签？（使用 html5lib 解析时，其中一个表单标签被删除）

提前致谢。

【问题讨论】：

python: difference between 'lxml' and "html.parser" and "html5lib" with beautiful soup?的可能重复

标签： python-3.x beautifulsoup

【解决方案1】：

您可以使用lxml，速度非常快，可以使用find_all或select获取所有标签。

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)

或

from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
 <![endif]-->
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
    <a href="test"></a>
   <!--[if lte IE 8]>
  <![endif]-->
  </body>
</html>
"""

soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

【讨论】：