爬虫模块之requests模块

一模块的下载安装

　pip install requests

二爬虫的介绍

　什么是爬虫：就是模拟浏览器发送请求；保存到本地；提取有用的数据；保存到数据库

　爬虫的价值：获取有用的数据，保存到数据库

　爬虫的基本流程：

　　　　　　爬虫模块之requests模块

1.发起请求
使用http库向目标站点发起请求，即发送一个Request
Request包含：请求头、请求体等
 
2.获取响应内容
如果服务器能正常响应，则会得到一个Response
Response包含：html，json，图片，视频等
 
3.解析内容
解析html数据：正则表达式，第三方解析库如Beautifulsoup，pyquery等
解析json数据：json模块
解析二进制数据:以b的方式写入文件
 
4.保存数据
数据库
文件

　请求和响应：

爬虫模块之requests模块

Request：用户将自己的信息通过浏览器（socket client）发送给服务器（socket server）
 
Response：服务器接收请求，分析用户发来的请求信息，然后返回数据（返回的数据中可能包含其他链接，如：图片，js，css等）
 
浏览器在接收Response后，会解析其内容来显示给用户，而爬虫程序在模拟浏览器发送请求然后接收Response后，是要提取其中的有用数据。

三路由

　每一个路由都是由协议，ip和端口组成的。默认的端口就是80

　User-Agent：那个浏览器发送的请求。

　Referer：请求的来源。

　什么是盗链：自己建立的网站链接到别人的网站上面，访问量都属于自己的网站的。

　请求的重点：type（请求的类型），请求的url，请求头和请求体

　响应部分：

　　Preservelog：保持跳转的所有的所有信息，也就是抓包

　　Location：如果出现了Location就会重定向信息

　　Set_cookie：保持Cookie信息

　　Preview：目标主机上的数据

　响应的重点：状态码，响应体，响应头。

四 requests模块

　requests：模拟浏览器发送请求。

　requests.get：请求的方式，还有post请求方式

　　headers：模拟浏览器请求内容。

　　params：转换url文字的编码格式

　　　wd：查找的内容；pn：页码

　　cookies：存放cookies的信息

　status_code：返回的状态码。

　text：文本信息

　encoding：指定编码格式

# import requests
#
# num=1
# url='https://www.baidu.com/s?'
# pesponse=requests.get(url,
#              headers={
#                  'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36'
#              },
#              params={
#                  'wd':'美女',
#                  'pn':num
#              },
#             cookies={
#                 'user_session':'CpxXbf5MvLuoRxVeIqUNHs6WlwUOkF4vMqcZ2IoKAZ5Sia'
#             }
#              )
#
# print('378533872@qq.com' in pesponse.text)

View Code

　requests.post：请求的方式

　　data：请求体

　　　commit：不清楚，以后补处理

　　　utf8：是否是utf8编码吗格式

　　　authenticity-token：csrf-token

　　　password：密码

　　allow_redirects：是否跳转，True：允许跳转；False：不允许跳转

　cookies：获取cookies里面的信息

　　get_dict：转成字典的格式

　headers：响应头

　history：跳转前的页面

　requests帮我们处理cookie和session：requests。session：直接拿到session对象，可以直接忽略cookie信息。

# import requests
# import re
#
# response=requests.get('https://github.com/login',
#                        headers={
# 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
#                        },
#                        )
#
# cookies=response.cookies.get_dict()
# authenticity_token=re.findall('name="authenticity_token".*?value="(.*?)"',response.text,re.S)[0]
#
# response=requests.post('https://github.com/session',
#                        cookies=cookies,
#                        headers={
#                             'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
#                             'Referer':'https://github.com/',
#                        },
#                        data={
#                            'commit':'Sign in',
#                            'utf8':'✓',
#                            'authhenticity_token':authenticity_token,
#                            'login':'egonlin',
#                            'password':'lhf@123'
#
#                        },
#
#                        allow_redirects=False
#                        )
#
# ligin_cookies=response.cookies.get_dict()
# #
# # print(response.status_code)
# # print('Location' in response.headers)
# # print(response.text)
# # print(response.history)
#
#
# reponse=requests.get('https://github.com/settings/emails',
#                      cookies=cookies,
#                      headers={
#                          'Referer':'https://github.com/',
#                          'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.84 Safari/537.36',
#                      })
#
# print('378533872@qq.com'in response.text)

View Code