爬虫（一）—— 爬取一个简单的网站

一、爬取一个简单的网站

本章教程，我们使用python爬取博客园的文章，并解析获取到的数据，重点会介绍几种不同python库的使用方法来获取数据。爬取一个网站的第一步是分析请求，工具是使用Chrome浏览器、Finder或Charles来分析，不清楚的请求分析过程，可自行百度。

1、使用requests库获取数据

代码1




# -*- coding: utf-8 -*-

# 设置系统编码
import sys
reload(sys)
sys.setdefaultencoding(\'utf-8\')

# 引入requests库
import requests
from bs4 import BeautifulSoup

# 目标页面地址
url = \'https://www.cnblogs.com/\'

# 构建请求头
headers = {
        \'Accept\': \'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\',
        \'Accept-Encoding\': \'gzip, deflate, br\',
        \'Accept-Language\': \'zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4\',
        \'Connection\': \'keep-alive\',
        \'Host\': \'www.cnblogs.com\',
        \'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\'
}


# 获取页面数据
page = requests.get(url,headers=headers)

# 打印页面结果
print page.text

2、使用Request库获取数据

代码2


# -*- coding: utf-8 -*-

# 设置系统编码
# -*- coding: utf-8 -*-

# 设置系统编码
import sys
reload(sys)
sys.setdefaultencoding(\'utf-8\')

from urllib2 import Request, urlopen, URLError, HTTPError

# 目标页面地址
url = \'https://www.cnblogs.com/\'

# 构建请求头
headers = {
        \'Accept\': \'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8\',
        \'Content-Type\': \'text/html;charset=utf-8\',
        \'Accept-Language\': \'zh-CN,zh;q=0.8,en-US;q=0.6,en;q=0.4\',
        \'Connection\': \'keep-alive\',
        \'Host\': \'www.cnblogs.com\',
        \'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36\'
}

# 构建请求
req = Request(url,headers=headers)

# 获取响应信息
response = urlopen(req)

# 获取页面数据
page = response.read()

print page

3、使用BeautifulSoup解析Html页面的数据

在代码1 或 代码2后面加上代码3

代码3

   
# 解析页面结果
soup = BeautifulSoup(page.text,"lxml")

for i in soup.select(\'div#post_list div.post_item\'):
    print \'---------\'
    # 获取文章标题
    print "文章标题:"+ i.select(\'div.post_item_body > h3 > a\')[0].string
    # 获取文章链接
    print "文章链接:"+ i.select(\'div.post_item_body > h3 > a\')[0].attrs.get(\'href\')
    # 获取作者
    print "作者昵称:"+ i.select(\'div.post_item_body > div > a\')[0].string
    # 获取作者博客链接
    print "博客链接:"+ i.select(\'div.post_item_body > div > a\')[0].attrs.get(\'href\')
    print \'---------\'

得到打印结果


---------
文章标题:Centos7搭建swarm集群
文章链接:http://www.cnblogs.com/shihuayun/p/7635329.html
作者昵称:美好人生shy
博客链接:http://www.cnblogs.com/shihuayun/
---------
---------
文章标题:H5音频处理的一些小知识
文章链接:http://www.cnblogs.com/interesting-me/p/7634276.html
作者昵称:我吃小月饼
博客链接:http://www.cnblogs.com/interesting-me/
---------
---------
文章标题:C#进阶之AOP
文章链接:http://www.cnblogs.com/xlxr45/p/7635297.html
作者昵称:一截生长
博客链接:http://www.cnblogs.com/xlxr45/
---------
---------
文章标题:逐步理解SpringMVC
文章链接:http://www.cnblogs.com/huhu1203/p/7633060.html
作者昵称:呼呼呼呼呼65
博客链接:http://www.cnblogs.com/huhu1203/
---------

...余下省略

目录

一、爬取一个简单的网站

1、使用requests库获取数据

2、使用Request库获取数据

3、使用BeautifulSoup解析Html页面的数据