爬虫基本库之beautifulsoup

一、beautifulsoup的简单使用

简单来说，Beautiful Soup是python的一个库，最主要的功能是从网页抓取数据。官方解释如下：

Beautiful Soup提供一些简单的、python式的函数用来处理导航、搜索、修改分析树等功能。
它是一个工具箱，通过解析文档为用户提供需要抓取的数据，因为简单，所以不需要多少代码就可以写出一个完整的应用程序。

更多知识访问：官方文档

1.安装

pip3 install beautifulsoup4

（1）解析器

Beautiful Soup支持Python标准库中的HTML解析器,还支持一些第三方的解析器，如果我们不安装它，则 Python 会使用 Python默认的解析器，lxml 解析器更加强大，速度更快，推荐安装

pip3 install lxml

另一个可供选择的解析器是纯Python实现的 html5lib , html5lib的解析方式与浏览器相同,可以选择下列方法来安装html5lib:

pip install html5lib

（2）解析器对比

爬虫基本库之beautifulsoup

2.快速开始

下面的一段HTML代码将作为例子被多次用到.这是 爱丽丝梦游仙境的 的一段内容(以后内容中简称为 爱丽丝 的文档):

<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>

使用BeautifulSoup解析这段代码,能够得到一个 BeautifulSoup 的对象,并能按照标准的缩进格式的结构输出:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser') #<class 'bs4.BeautifulSoup'> 类型,html解析器：html.parser

print(soup.prettify())   #以标准格式输出

结果展示：

 1 <html>
 2  <head>
 3   <title>
 4    The Dormouse's story
 5   </title>
 6  </head>
 7  <body>
 8   <p class="title">
 9    <b>
10     The Dormouse's story
11    </b>
12   </p>
13   <p class="story">
14    Once upon a time there were three little sisters; and their names were
15    <a class="sister" href="http://example.com/elsie" id="link1">
16     Elsie
17    </a>
18    ,
19    <a class="sister" href="http://example.com/lacie" id="link2">
20     Lacie
21    </a>
22    and
23    <a class="sister" href="http://example.com/tillie" id="link3">
24     Tillie
25    </a>
26    ;
27 and they lived at the bottom of a well.
28   </p>
29   <p class="story">
30    ...
31   </p>
32  </body>
33 </html>

View Code