03：requests与BeautifulSoup结合爬取网页数据应用

1.1 爬虫相关模块命令回顾

　　1、requests模块

1、 pip install requests

2、 response = requests.get('http://www.baidu.com/ ') #获取指定url的网页内容

3、 response.text #获取文本文件

4、 response.content #获取字节类型

5、 response.encoding = ‘utf-8’ #指定获取的网页内容用utf-8编码

response.encoding = response.apparent_encoding #下载的页面是什么编码就用什么编码格式

6、 response.cookies #拿到cookies

response.cookies.get_dict() #拿到cookie字典样式

2、beautisoup模块

1、 pip install beautifulsoup4

2、把文本转成对象

　　　　　　　　1）html.parser 是python内置模块无需安装

　　　　　　　　　　soup = BeautiSoup(response.text,parser='html.parser')

　　　　　　　　2）lxml是第三方库，但是性能好（生产用这个

soup = BeautifulSoup(response.text,features='lxml')

3、 .find()用法：返回的是对象

　　　　　　　　1）从爬取的内容找到中div的内容

target = soup.find()

　　　　　　　　2）从爬取的内容中找到一个div，并且这个div有一个属性是id=’i1’

target = soup.find('div',id='i1')

4、 .find_all()用法：返回的是对象列表

1）从以后取的target对象中找到所有li标签

li_list = target.find_all('li')

5、从.find()获取的对象中找到想要的属性

　　　　　　　　a.attrs.get('href') #获取所有a标签的所有href属性（a标签url路径）

　　　　　　　　a.find('h3').text #找到a标签中的所有h3标签，的内容

　　　　　　　　img_url = a.find('img').attrs.get('src') #从a标签中找到img标签所有src属性(图片url路径)

1.2 爬取需要登录和不需要登录页面内容的方法

import requests
from bs4 import BeautifulSoup
response = requests.get(
   url='http://www.autohome.com.cn/news/'
)

response.encoding = response.apparent_encoding          #下载的页面是什么编码就用什么编码格式

#1 把文本转成对象，
#soup = BeautifulSoup(response.text,features='lxml')        #lxml是第三方库，但是性能好（生产用这个）
soup = BeautifulSoup(response.text,features='html.parser')  # html.parser 是python内置模块无需安装

#2 从爬取的内容找到 中div的内容
target = soup.find(id="auto-channel-lazyload-article")

#3.1 找到所有li标签 .find()是找到第一个
#3.2 也可以这样用： .find('div',id='i1')  可以使用这种组合查找的方法
#3.3 .find()找到的是对象，.find_all() 获取的是列表
li_list = target.find_all('li')

for i in li_list:
   a = i.find('a')
   if a:
      print(a.attrs.get('href'))                   #获取所有a标签的url路径
      # a.find('h3') 获取的是对象， 加上 .text才是获取文本
      txt = a.find('h3').text                      #从a标签中找到所有h3标签的值
      print(txt,type(txt))
      img_url = a.find('img').attrs.get('src')#从a标签中找到img标签所有src属性(图片url路径)
      import uuid
      file_name = str(uuid.uuid4()) + '.jpg'

      if img_url.startswith('//www2'):        #由于获取的图片url做了处理，所以才这样处理
         img_url2 = img_url.replace('//www2','http://www3')
         img_response = requests.get(url=img_url2)
         with open(file_name,'wb') as f:
            f.write(img_response.content)       #把图片写到本地

例1：爬取汽车之家新闻页面（爬取无需登录的网页）