python之Requests模块

Requests中文官方文档地址：http://docs.python-requests.org/zh_CN/latest/
1.Requests模块快速入门

1.Requests模块作用：发送http请求，获取响应数据

2.Requests模块是第三方模块，需要在python环境中额外安装：pip/pip3 install requests

3.HTTP请求类型
import requests
r = requests.get('https://api.github.com/events')                              # get类型
r1 = requests.get(url='http://dict.baidu.com/s', params={'wd': 'python'})      # 带参数的get请求
r2 = requests.post("http://m.ctrip.com/post")                                  # post类型
r3 = requests.put("http://m.ctrip.com/put")                                    # put类型
r4 = requests.delete("http://m.ctrip.com/delete")                              # delete类型
r5 = requests.head("http://m.ctrip.com/head")                                  # head类型
r6 = requests.options("http://m.ctrip.com/get")                                # options类型

4.通过上面的方法返回的是一个Response对象，该对象有以下一些常用的属性和方法
属性/方法                         描述
status_code                       服务器返回的状态码
text                              服务器返回的字符串，requests根据自己的判断进行解码
content                           服务器响应内容的二进制形式
headers                           响应头
request.headers                   请求头
cookies                           返回响应携带的cookies（经过set-cookie动作）,返回类型为RequestsCookieJar对象
cookies.get_dict()                以字典形式返回响应的cookie
cookies.items()                   以List(set())形式返回响应的cookie
encoding                          requests猜测的相应内容编码方式，text就是根据该编码格式进行解码
json()                            返回内容进行json转换
url                               响应的的url，有时候响应的url和请求的url并不一致                                                    
raw                               获取来自服务器的原始套接字响应
history                           追踪重定向，返回一个Response对象列表
raise_for_status()                发送请求出现异常时，可以通过此方法抛出异常
 
5.requests库使用详解 
  
5.1 传递url参数
url_params = {'key1': 'value1', 'key2': 'value2'}
r7 = requests.get("http://httpbin.org/get", params=url_params)  #字典传递参数，注意字典里值为 None 的键都不会被添加到 URL 的查询字符串里。
print(r7.url)#输出：http://httpbin.org/get?key1=value1&key2=value2  
  
5.2 获取/修改网页编码
#coding = 'utf-8'
import requests
res = requests.get(url='http://dict.baidu.com/s', params={'wd': 'python'})
print(res.encoding)                #获取网页编码
res.encoding = 'ISO-8859-1'        #修改网页编码
 
5.3 获取响应内容

5.3.1 res.content
# 以字节的方式去显示，中文显示为字符，这个是直接从网络上面抓取的数据，没有经过任何解码。所以是一个bytes类型。
#其实在硬盘上和在网络上传输的字符串都是bytes类型。
print(res.content)

5.3.2 res.text 
# 以文本的方式去显示，这个是requests将response.content进行解码的字符串。解码需要指定一个编码方式，requests会根据自己的猜测来判断编码的方式。
#所以有的时候可能会猜测错误，产生乱码。这时就应该使用response.content.decode('utf-8')指定解码使用的编码方式（这里使用的utf-8）进行手动解码。
#response.content.decode() 默认utf-8
#常见的编码字符集：utf-8、gbk、gb2312、ascii、iso-8859-1
 
import requests
#目标url
url = 'https://www.baidu.com/'
#向目标url发送get请求
resp = requests.get(url)
#打印响应内容
print(resp.text)#内容乱码
print(resp.encoding)
print(resp.content.decode('utf-8'))#指定解码方式解决乱码问题

     
#5.3.3 获取json格式的响应内容
r = requests.get('https://github.com/timeline.json')  
print(r.json())#Requests中有一个内置的 JSON 解码器，助你处理 JSON 数据
#注意：如果 JSON 解码失败，r.json() 就会抛出一个异常。例如，响应内容是 401 (Unauthorized)，尝试访问 r.json() 将会抛出 ValueError: No JSON object could be decoded 异常。
#需要注意的是，成功调用 r.json() 并不意味着响应的成功。有的服务器会在失败的响应中包含一个 JSON 对象（比如 HTTP 500 的错误细节）。这种 JSON 会被解码返回。
# 要检查请求是否成功，请使用 r.raise_for_status() 或者检查 r.status_code 是否和你的期望相同。
 

5.3.4获取原始响应内容
#在罕见的情况下，你可能想获取来自服务器的原始套接字响应，那么你可以访问 r.raw。 使用raw属性时，确保在初始请求中设置了 stream=True。如下所示：
r8 = requests.get('https://api.github.com/events', stream=True)
print(r8.raw)
print(r8.raw.read(10))


6.定制请求头
#在请求头中带上User-Agent，模拟浏览器发送请求
url = 'http://m.ctrip.com'
headers = {'User-Agent' : 'Mozilla/5.0 (Linux; Android 4.2.1; en-us; Nexus 4 Build/JOP40D) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.166 Mobile Safari/535.19'}
r9 = requests.post(url, headers=headers)
print(r9.request.headers)

#httpbin.org 这个网站能测试 HTTP 请求和响应的各种信息，比如 cookie、ip、headers 和登录验证等，且支持 GET、POST 等多种方法，对 web 开发和测试很有帮助。
#下面用此网站详解request库的使用

7.复杂post请求
#a)给data传字典参数，类似实现HTML中的form表单形式
payload = {'key1': 'value1', 'key2': 'value2'}
r10 = requests.post("http://httpbin.org/post", data=payload)
print(r10.text)
#输出
# {
#   "args": {},
#   "data": "",
#   "files": {},
#   "form": {
#     "key1": "value1",
#     "key2": "value2"
#   },
#   "headers": {
#     "Accept": "*/*",
#     "Accept-Encoding": "gzip, deflate",
#     "Content-Length": "23",
#     "Content-Type": "application/x-www-form-urlencoded",
#     "Host": "httpbin.org",
#     "User-Agent": "python-requests/2.21.0",
#     "X-Amzn-Trace-Id": "Root=1-614c46f4-62d1363a3780f5b76459094e"
#   },
#   "json": null,
#   "origin": "222.211.234.162",
#   "url": "http://httpbin.org/post"
# }
 
#b)给data传元组列表参数，对应于form表单中多个元素使用同一 key的情况
payload = (('key1', 'value1'), ('key1', 'value2'))
r11 = requests.post('http://httpbin.org/post', data=payload)
print(r11.text)
#输出
# {
#   ...
#   "form": {
#     "key1": [
#       "value1",
#       "value2"
#     ]
#   },
#   ...
# }
 
8.POST一个Multipart-Encoded的文件
#示例1
url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)
print(r.text)
 
#示例2：显式地设置文件名，文件类型和请求头
url = 'http://httpbin.org/post'
files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': '0'})}
r = requests.post(url, files=files)

#示例3：发送作为文件来接收的字符串
import requests
url = 'http://httpbin.org/post'
files = {'file': ('report.csv', 'some,data,to,send\nanother,row,to,send\n')}
r = requests.post(url, files=files)
print(r.text)
#输出
{
    "args": {}, 
    "data": "", 
    "files": {
    "file": "some,data,to,send\nanother,row,to,send\n"
  }, 
    "form": {}, 
    "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Content-Length": "184", 
    "Content-Type": "multipart/form-data; boundary=2e709f54c60d7bff4b99ee79b0e28197", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.21.0", 
    "X-Amzn-Trace-Id": "Root=1-614d396f-42e527fa6c6baa544bb12d7d"
  }, 
   "json": null, 
   "origin": "222.211.234.162", 
   "url": "http://httpbin.org/post"
 }

9.状态响应码
r1 = requests.get('http://httpbin.org/get')
print(r1.status_code)
#为方便引用，Requests还附带了一个内置的状态码查询对象：
r.status_code == requests.codes.ok#如果返回的状态码为200，则为True
如果发送了一个错误请求(一个 4XX 客户端错误，或者 5XX 服务器错误响应)，我们可以通过 Response.raise_for_status() 来抛出异常：

10.响应头
r2 = requests.get('http://m.ctrip.com')
#a)查看以一个 Python 字典形式展示的服务器响应头
print (r2.headers)
#b)访问响应头字段的两种方式
#headers字典比较特殊：它是仅为 HTTP 头部而生的。根据 RFC 2616， HTTP 头部是大小写不敏感的。因此，我们可以使用任意大小写形式来访问这些响应头字段：
print (r2.headers['Content-Type'])
print (r2.headers.get('content-type'))

11.Cookie

11.1 在headers参数中携带cookie发送请求
网站经常利用请求头中的Cookie字段来做用户访问状态的保持，那么我们可以在headers参数中添加Cookie，模拟普通用户的请求。我们以github登陆为例
 
github登陆抓包分析
打开浏览器，右键-检查，点击Net work，勾选Preserve log
访问github登陆的url地址 https://github.com/login
输入账号密码点击登陆后，访问一个需要登陆后才能获取正确内容的url，比如点击右上角的Your profile访问https://github.com/USER_NAME
确定url之后，再确定发送该请求所需要的请求头信息中的User-Agent和Cookie
从浏览器中复制User-Agent和Cookie
浏览器中的请求头字段和值与headers参数中必须一致
headers请求参数字典中的Cookie键对应的值是字符串
示例：
import requests
 
url = 'https://github.com/USER_NAME'

# 构造请求头字典
headers = {
    # 从浏览器中复制过来的User-Agent
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
    #从浏览器中复制过来的Cookie
    'Cookie': 'xxx这里是复制过来的cookie字符串'
 }
 
# 请求头参数字典中携带cookie字符串
resp = requests.get(url, headers=headers)
print(resp.text)

11.2 使用cookies参数发送带cookie的请求

cookies参数的形式：字典
cookies = {"cookie的name":"cookie的value"}
 
该字典对应请求头中Cookie字符串，以分号、空格分割每一对字典键值对
等号左边的是一个cookie的name，对应cookies字典的key，等号右边对应cookies字典的value
注意：cookie一般是有过期时间的，一旦过期需要重新获取
示例1：将网页直接复制的cookie字符串转换为cookies参数所需的字典后进行请求发送
import requests

url = 'https://github.com/USER_NAME'
 
# 构造请求头字典
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
# 构造cookies字典
cookies_str = '从浏览器中copy过来的cookies字符串'
cookies_dict = {cookie.split('=')[0]:cookie.split('=')[-1] for cookie in cookies_str.split('; ')}
# 请求头参数字典中携带cookie字符串
resp = requests.get(url, headers=headers, cookies=cookies_dict)
print(resp.text)

示例2：
url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r4 = requests.get(url, cookies=cookies)
print(r4.text)
#输出
# {
#   "cookies": {
#     "cookies_are": "working"
#   }
# }

11.3 读取cookies，将cookieJar对象转换为cookies字典或列表
import requests
url = 'https://www.baidu.com/'
r3 = requests.get(url)
print(r3.cookies)#<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
print(r3.cookies.get_dict())#{'BDORZ': '27315'}
print(r3.cookies.items())#[('BDORZ', '27315')]

11.4 通过cookie名访问对应的cookie值
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
print(r.cookies['example_cookie_name'])
#'example_cookie_value'

12设置超时时间
#你可以告诉 requests 在经过以 timeout 参数设定的秒数时间之后停止等待响应。基本上所有的生产代码都应该使用这一参数。如果不使用，你的程序可能会永远失去响应：
#：发送请求后，timeout秒钟内返回响应，否则就抛出异常
r = requests.get('http://m.ctrip.com', timeout=0.001)

13.设置访问代理
1)proxy代理参数通过指定代理ip，让代理ip对应的正向代理服务器转发我们发送的请求，那么我们首先来了解一下代理ip以及代理服务器
2)理解使用代理的过程：代理ip是一个ip，指向的是一个代理服务器，代理服务器能够帮我们向目标服务器转发请求
3)正向代理和反向代理的区别：
前边提到proxy参数指定的代理ip指向的是正向的代理服务器，那么相应的就有反向服务器；现在来了解一下正向代理服务器和反向代理服务器的区别

从发送请求的一方的角度，来区分正向或反向代理
为浏览器或客户端（发送请求的一方）转发请求的，叫做正向代理;浏览器知道最终处理请求的服务器的真实ip地址，例如VPN
不为浏览器或客户端（发送请求的一方）转发请求、而是为最终处理请求的服务器转发请求的，叫做反向代理;浏览器不知道服务器的真实地址，例如nginx
4)代理ip分类

A.根据代理ip的匿名程度，代理IP可以分为下面三类：

透明代理(Transparent Proxy)：透明代理虽然可以直接“隐藏”你的IP地址，但是还是可以查到你是谁。目标服务器接收到的请求头如下：

REMOTE_ADDR = Proxy IP
HTTP_VIA = Proxy IP
HTTP_X_FORWARDED_FOR = Your IP
 
匿名代理(Anonymous Proxy)：使用匿名代理，别人只能知道你用了代理，无法知道你是谁。目标服务器接收到的请求头如下：

REMOTE_ADDR = proxy IP
HTTP_VIA = proxy IP
HTTP_X_FORWARDED_FOR = proxy IP

高匿代理(Elite proxy或High Anonymity Proxy)：高匿代理让别人根本无法发现你是在用代理，所以是最好的选择。毫无疑问使用高匿代理效果最好。目标服务器接收到的请求头如下：

REMOTE_ADDR = Proxy IP
HTTP_VIA = not determined
HTTP_X_FORWARDED_FOR = not determined

B.根据网站所使用的协议不同，需要使用相应协议的代理服务。从代理服务请求使用的协议可以分为：

http代理：目标url为http协议
https代理：目标url为https协议
socks隧道代理（例如socks5代理）等：
    socks 代理只是简单地传递数据包，不关心是何种应用协议（FTP、HTTP和HTTPS等）。
    socks 代理比http、https代理耗时少。
    socks 代理可以转发http和https的请求
    
5)为了让服务器以为不是同一个客户端在请求；为了防止频繁向一个域名发送请求被封ip，所以我们需要使用代理ip；那么我们接下来要学习requests模块是如何使用代理ip的

使用示例：
 proxies = {
           "http": "http://10.10.1.10:3128",
           "https": "http://10.10.1.100:4444",
          }
r = requests.get('http://m.ctrip.com', proxies=proxies)


#如果代理需要用户名和密码，则需要这样：
proxies = {
    "http": "http://user:pass@10.10.1.10:3128/",
}

14.使用verify参数忽略CA证书
#在使用浏览器上网的时候，有时能够看到证书提示
#运行代码查看代码中向不安全的链接发起请求的效果
#运行下面的代码将会抛出包含ssl.CertificateError ...字样的异常
import requests
url = "https://sam.huat.edu.cn:8443/selfservice/"
response = requests.get(url)
#解决方案：
为了在代码中能够正常的请求，我们使用verify=False参数，此时requests模块发送请求将不做CA证书的验证：verify参数能够忽略CA证书的认证
import requests
url = "https://sam.huat.edu.cn:8443/selfservice/" 
response = requests.get(url,verify=False)


15.重定向与请求历史
#默认情况下，除了 HEAD, Requests 会自动处理所有重定向。可以使用响应对象的 history 方法来追踪重定向。
#Response.history 是一个 Response 对象的列表，为了完成请求而创建了这些对象。这个对象列表按照从最老到最近的请求进行排序。
#例如，Github 将所有的 HTTP 请求重定向到 HTTPS：
import requests
headers = {'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36"}
r = requests.get('http://github.com/',headers=headers)
print(r.url)#https://github.com/
print(r.status_code)#200
print(r.history)#[<Response [301]>]

#如果你使用的是GET、OPTIONS、POST、PUT、PATCH 或者 DELETE，那么你可以通过 allow_redirects 参数禁用重定向处理：
r = requests.get('http://github.com', headers=headers,allow_redirects=False)
print(r.status_code)#301(301状态码指网页被永久重定向)
print(r.history)#[]
 
#如果你使用了 HEAD，你也可以启用重定向
r = requests.head('http://github.com', allow_redirects=True)
print(r.url)#'https://github.com/'
print(r.history) #[<Response [301]>]

16.错误与异常

16.1官网提及的异常
 
遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一个 ConnectionError 异常。
 
如果 HTTP 请求返回了不成功的状态码， Response.raise_for_status() 会抛出一个 HTTPError 异常。

若请求超时，则抛出一个 Timeout 异常。

若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。

所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。

16.2requests模块内置异常的层次结构

IOError
 +-- RequestException  # 处理不确定的异常请求
      +-- HTTPError        # HTTP错误
      +-- ConnectionError  # 连接错误
      |    +-- ProxyError  # 代理错误
      |    +-- SSLError  # SSL错误
      |    +-- ConnectTimeout(+-- Timeout)  # (双重继承，下同)尝试连接到远程服务器时请求超时，产生此错误的请求可以安全地重试。
      +-- Timeout  # 请求超时
      |    +-- ReadTimeout  # 服务器未在指定的时间内发送任何数据
      +-- URLRequired  # 发出请求需要有效的URL
      +-- TooManyRedirects  # 重定向太多
      +-- MissingSchema(+-- ValueError) # 缺少URL架构(例如http或https)
      +-- InvalidSchema(+-- ValueError) # 无效的架构，有效架构请参见defaults.py
      +-- InvalidURL(+-- ValueError)  # 无效的URL
      |    +-- InvalidProxyURL  # 无效的代理URL
      +-- InvalidHeader(+-- ValueError)  # 无效的Header
      +-- ChunkedEncodingError  # 服务器声明了chunked编码但发送了一个无效的chunk
      +-- ContentDecodingError(+-- BaseHTTPError)  # 无法解码响应内容
      +-- StreamConsumedError(+-- TypeError)  # 此响应的内容已被使用
      +-- RetryError  # 自定义重试逻辑失败
      +-- UnrewindableBodyError  # 尝试倒回正文时，请求遇到错误
      +-- FileModeWarning(+-- DeprecationWarning)  # 文件以文本模式打开，但Requests确定其二进制长度
      +-- RequestsDependencyWarning  # 导入的依赖项与预期的版本范围不匹配
 
Warning
 +-- RequestsWarning  # 请求的基本警告
 
 
实际应用常见异常举例：
 
(1)连接超时
服务器在指定时间内没有应答，抛出 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=0.001)
# 抛出错误
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f1b16da75f8>, 'Connection to github.com timed out. (connect timeout=0.001)'))

(2)连接、读取超时
若分别指定连接和读取的超时时间，服务器在指定时间没有应答，抛出 requests.exceptions..ReadTimeout
- timeout=([连接超时时间], [读取超时时间])
- 连接：客户端连接服务器并并发送http请求服务器
- 读取：客户端等待服务器发送第一个字节之前的时间

requests.get('http://github.com', timeout=(6.05, 0.01))
# 抛出错误
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='github.com', port=80): Read timed out. (read timeout=0.01)

(3)未知的服务器
抛出 requests.exceptions.ConnectionError

requests.get('http://github.comasf', timeout=(6.05, 27.05))
# 抛出错误
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.comasf', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f75826665f8>: Failed to establish a new connection: [Errno -2] Name or service not known',))

(4) 代理连接不上
代理服务器拒绝建立连接，端口拒绝连接或未开放，抛出 requests.exceptions.ProxyError
requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "192.168.10.1:800"})
# 抛出错误
requests.exceptions.ProxyError: HTTPConnectionPool(host='192.168.10.1', port=800): Max retries exceeded with url: http://github.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fce3438c6d8>: Failed to establish a new connection: [Errno 111] Connection refused',)))

(5) 连接代理超时
代理服务器没有响应 requests.exceptions.ConnectTimeout

requests.get('http://github.com', timeout=(6.05, 27.05), proxies={"http": "10.200.123.123:800"})
# 抛出错误
requests.exceptions.ConnectTimeout: HTTPConnectionPool(host='10.200.123.123', port=800): Max retries exceeded with url: http://github.com/ (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7fa8896cc6d8>, 'Connection to 10.200.123.123 timed out. (connect timeout=6.05)'))

(6)代理读取超时
说明与代理建立连接成功，代理也发送请求到目标站点，但是代理读取目标站点资源超时
即使代理访问很快，如果代理服务器访问的目标站点超时，这个锅还是代理服务器背
假定代理可用，timeout就是向代理服务器的连接和读取过程的超时时间，不用关心代理服务器是否连接和读取成功

requests.get('http://github.com', timeout=(2, 0.01), proxies={"http": "192.168.10.1:800"})

# 抛出错误
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='192.168.10.1:800', port=1080): Read timed out. (read timeout=0.5)

(7)网络环境异常
可能是断网导致，抛出 requests.exceptions.ConnectionError

requests.get('http://github.com', timeout=(6.05, 27.05))
# 抛出错误
requests.exceptions.ConnectionError: HTTPConnectionPool(host='github.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc8c17675f8>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
requests库的基本使用详解