用 Beautiful Soup 4 解析不平衡的 html 文件答案

【问题标题】：parse unbalanced html file with Beautiful Soup 4用 Beautiful Soup 4 解析不平衡的 html 文件
【发布时间】：2017-01-23 18:24:12
【问题描述】：

我正在解析部分没有平衡 html 标签的 html 文件。

假设此部分 html 文件中缺少第一行。 Beautiful Soup 是否仍然可以解析其余文件，并且我仍然可以提取不同标签内部的信息？

非常感谢您的帮助。

Example Domain</title>   <!-- <====missing tag in this line -->

<meta charset="utf-8" />
<meta http-equiv="Content-type" content="text/html; charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<style type="text/css">
body {
    background-color: #f0f0f2;
    margin: 0;
    padding: 0;
    font-family: "Open Sans", "Helvetica Neue", Helvetica, Arial, sans-serif;

}
div {
    width: 600px;
    margin: 5em auto;
    padding: 50px;
    background-color: #fff;
    border-radius: 1em;
}
a:link, a:visited {
    color: #38488f;
    text-decoration: none;
}
@media (max-width: 700px) {
    body {
        background-color: #fff;
    }
    div {
        width: auto;
        margin: 0 auto;
        border-radius: 0;
        padding: 1em;
    }
}
</style>

【问题讨论】：

您需要指定一个非默认的解析器。你可以试试lxml 或html5lib。我也没有这方面的经验。
这是我尝试使用 lxml 时得到的结果 "bs4.FeatureNotFound: 找不到具有您请求的功能的树生成器：lxml。您需要安装解析器库吗？"切换到 html5lib 解析器时，我收到了类似的错误消息“bs4.FeatureNotFound：找不到具有您请求的功能的树构建器：html5lib。您需要安装解析器库吗？”我试图 pip install 两个库，但失败了。我正在使用 OSX 10.9.5。 Python3.4.4。任何想法表示赞赏！
您是否收到 pip 错误消息？我做了pip install html5lib，下面的代码对我有用from bs4 import BeautifulSoup; soup = BeautifulSoup("<span>asdf", "html5lib"); print(soup)

标签： python html beautifulsoup

【解决方案1】：

使用任何高级解析器（html5lib 更健壮，但速度较慢）。结果会有所不同：

soup = BeautifulSoup(open('foo.html'), 'lxml')
#<html><body><p>Example Domain   <!-- <====missing tag in this line -->
#<meta charset="utf-8"/>

soup = BeautifulSoup(open('foo.html'), 'html5lib')
#<html><head></head><body>Example Domain   <!-- <====missing tag in this line -->
#
#<meta charset="utf-8"/>

【讨论】：