1.requests模块的get请求
需求:爬取sogou首页页面数据
import requests
url = "https://www.sogou.com/"
response = requests.get(url=url)
# 获取字符串形式的页面数据
page = response.text
with open("./sogou.html", "w", encoding="utf-8") as fp:
fp.write(page)
其他的一些方法如下:
# 获取二进制/byte形式的页面数据 print(response.content) # 获取响应状态码 print(response.status_code) # 获取响应的头信息 print(response.headers) # 获取请求的url print(response.url)
至于带参数的get请求,直接调用get方法或者使用字典参数的方法,如下代码:
import requests
url = "https://www.sogou.com/web"
# 将参数封装到字典中
params = {
'query': "周杰伦",
'ie': "utf-8",
}
response = requests.get(url=url, params=params)
print(response.status_code)
print(response.content)
同样方法还有headers参数。
2.post请求
登陆豆瓣电影,获取登陆成功后的数据(这里由于豆瓣的url已经更换,所以只是示例)
import requests
# 目前这个url已经失效了,这里只做示例
url = "https://accounts.douban.com/login"
# 封装post请求的参数
data = {
"source": "movie",
"redir": "https://movie.douban.com/",
"form_email": "1111", # 你的账号密码
"form_password": "11111",
"login": "登录",
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
# 发起post请求
response = requests.post(url=url, data=data, headers=headers)
print(response.status_code)
print(response.text)
with open("./douban.html", "w", encoding="utf-8") as fp:
fp.write(response.text)
3.Ajax的get请求
需求:抓取豆瓣电影上排行榜上爱情片的详情
import requests
url = "https://movie.douban.com/j/chart/top_list?"
params = {
"type": "5",
"interval_id": "100:90",
"action": "",
"start": "120",
"limit": "20",
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
response = requests.get(url=url, params=params, headers=headers)
print(response.text)
4.Ajax的post请求
需求:爬取肯德基城市餐厅的位置数据
import requests
url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
"cname": "",
"pid": "",
"keyword": "北京",
"pageIndex": "1",
"pageSize": "10",
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
response = requests.post(url=url, params=params, headers=headers)
print(response.text)
5.综合操作
需求:爬取搜狗知乎某一个词条多个页码的页面数据
import requests
import os
# 创建一个文件夹
if not os.path.exists("./pages"):
os.mkdir("./pages")
url = "https://zhihu.sogou.com/zhihu?"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
# 搜索的词条
word = input("please enter your word:")
# 指定页码的范围
start_num = int(input("enter start page number:"))
end_num = int(input("enter end page number:"))
for page in range(start_num, end_num+1):
param = {
"query": word,
"page": page,
"ie": "utf-8",
}
response = requests.get(url=url, params=param, headers=headers)
filename = word + str(page) + ".html"
# 持久化数据
with open("pages/%s" % filename, "w", encoding="utf-8") as fp:
fp.write(response.text)
6.cookie操作
流程:1.登录,获取cookie 2.在发起个人主页请求时,需要cookie携带到该请求中
注意: session对象,发送请求(会将cookie对象进行自动存储)
import requests
session = requests.session()
# 发起登录请求
login_url = "https://accounts.douban.com/passport/login"
data = {
"source": 'None',
"redir": "https://movie.douban.com/people/123/",
"form_email": "123", # 你的账号密码
"form_password": "123",
"login": "登录",
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
}
session_response = session.post(url=login_url, data=data, headers=headers)
url = 'https://movie.douban.com/people/123/'
response = session.get(url=url, headers=headers)
page = response.text
with open("./doubanlogin.html", "w", encoding="utf-8") as fp:
fp.write(page)
注意:由于豆瓣的api更换了上述参数失效了,了解流程就好。当然你可以直接构造cookie来模拟登录,当然这样非常繁琐。
7.代理操作
import requests
proxies = {
"http": "http://10.10.1.10:3128",
"https": "http://10.10.1.10:1080",
}
url = "https://www.taobao.com"
requests.get(url=url, proxies=proxies)
当然这个代理是无效的,要换成你自己的有效代理。requests支持socks协议的代理,需要用到socks库。
对于requests模块,远不止这些功能,需要的自己详细了解。