一、爬虫基础简介
1. 爬虫简介
什么是爬虫:通过编写程序,模拟浏览器上网,然后让其去互联网上抓取数据的过程。
2. 爬虫合法性探究
爬虫究竟是合法还是违法的?
- 在法律中是不被禁止的
- 具有违法风险
- 善意爬虫 & 恶意爬虫
爬虫带来的风险可以体现在如下两个方面:
- 爬虫干扰了被访问网站的正常运营
- 爬虫抓取了受到法律保护的特定类型的数据或信息
如何在编写使用的过程中避免进入局子的厄运?
- 时常优化自己的程序,避免干扰被访问网站的正常运行
- 在使用,传播爬取到的数据时,审查抓取到的内容,如果发现了涉及到用户隐私或者商业机密等敏感内容,需要及时停止爬取或者传播。
3. 爬虫初试深入
爬虫在使用场景中的分类:
-
通用爬虫:抓取系统的重要组成部分。抓取的是一整张页面数据。
-
聚焦爬虫:是建立在通用爬虫的基础之上。抓取的是页面中特定的局部内容。
-
增量式爬虫:监测网站中数据更新的情况。只会抓取网站中最新更新出来的数据。
爬虫的矛与盾:
robots.txt协议:君子协议。规定了网站中那些数据可以被爬虫爬取,那些数据不允许被爬取。
例如:www.tabao.com/robots.txt
4. http&https协议
(1)http协议
概念:就是服务器和客户端进行数据交互的一种形式。
常用请求头信息:
-
User-Agent:请求载体的身份标识
-
Connection:请求完毕后,是断开连接还是保持连接
常用响应头信息:
-
Content-Type:服务器响应回客户端的数据类型
(2)https协议
概念:安全的超文本传输协议
(3)加密方式
1. requests第一血
requests模块:Python中原生的一款基于网络请求的模块,功能非常强大,简单便捷,效率极高。
作用:模拟浏览器发请求。
如何使用:(requests模块的编码流程)
环境的安装:pip install requests
实战编码:
复制代码 隐藏代码
import requests
if __name__ == \'__main__\':
url = \'https://www.sogou.com/\'
response = requests.get(url = url)
page_text = response.text
print(page_text)
with open(\'./sogou.html\',\'w\',encoding = \'utf-8\') as fp:
fp.write(page_text)
print(\'爬取数据结束!\')
2. requests巩固深入案例介绍
(1)简易网页采集器
复制代码 隐藏代码
\'\'\'UA检测:门户网站的服务器会监测对应请求的载体身份标识,
如果检测到请求载体身份标识是某一款浏览器,说明该请求时一个正常的请求;
但是,如果检测到请求的载体身份不是基于某一款浏览器的,则表示该请求为不正常请求(爬虫),
则服务器很有可能拒绝该次请求\'\'\'
import requests
if __name__ == \'__main__\':
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'https://www.sogou.com/web\'
kw = input(\'Enter a word:\')
param ={
\'query\':kw
}
response = requests.get(url = url,params = param,headers = headers)
page_text = response.text
fileName = kw + \'.html\'
with open(fileName,\'w\',encoding =\'utf-8\') as fp:
fp.write(page_text)
print(fileName,\'保存成功!!\')
(2)破解百度翻译
- post请求(携带了参数)
- 响应数据是一组json数据
复制代码 隐藏代码
import requests
import json
if __name__ == \'__main__\':
post_url = \'https://fanyi.baidu.com/sug\'
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
word = input(\'Enter a word:\n\')
data = {
\'kw\':word
}
response = requests.post(url = post_url,data = data,headers = headers)
dict_obj = response.json()
print(dict_obj)
fileName = word + \'.json\'
fp = open(fileName,\'w\',encoding=\'utf-8\')
json.dump(dict_obj,fp = fp,ensure_ascii = False)
print(\'Over!\')
(3)豆瓣电影
复制代码 隐藏代码
import requests
import json
if __name__ == \'__main__\':
url = \'https://movie.douban.com/j/chart/top_list\'
param = {
\'type\':\'24\',
\'interval_id\':\'100:90\',
\'action\':\'\',
\'start\':\'0\',
\'limit\':\'20\'
}
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
response = requests.get(url = url,params = param,headers = headers)
list_data = response.json()
fp = open(\'./douban.json\',\'w\',encoding = \'utf-8\')
json.dump(list_data,fp = fp,ensure_ascii = False)
print(\'Over!\')
3. 作业---肯德基餐厅查询
复制代码 隐藏代码
import requests
import json
if __name__ == \'__main__\':
post_url = \'https://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword\'
keyword = input(\'请输入要查询的城市:\')
data ={
\'cname\': \'\',
\'pid\': \'\',
\'keyword\': keyword,
\'pageindex\': \'1\',
\'pageSize\': \'10\'
}
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
response = requests.post(url = post_url, data = data, headers = headers)
page = response.json()
for dict in page[\'Table1\']:
StoreName = dict[\'storeName\']
address = dict[\'addressDetail\']
print(\'StoreName:\' + StoreName, \'address:\' + address + \'\n\')
4. 综合练习---药监总局
复制代码 隐藏代码
import requests
import json
if __name__ == \'__main__\':
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
id_list = []
all_data_list = []
url = \'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsList\'
for page in range(1, 11):
page = str(page)
data = {
\'on\': \'true\',
\'page\': page,
\'pageSize\': \'15\',
\'productName\': \'\',
\'conditionType\': \'1\',
\'applyname\': \'\',
\'applysn\': \'\',
}
json_ids = requests.post(url=url, headers=headers, data=data).json()
for dic in json_ids[\'list\']:
id_list.append(dic[\'ID\'])
post_url = \'http://scxk.nmpa.gov.cn:81/xk/itownet/portalAction.do?method=getXkzsById\'
for id in id_list:
data = {
\'id\': id
}
json_detail = requests.post(url=post_url, data=data, headers=headers).json()
all_data_list.append(json_detail )
all_data_list.append(\'---------------------------------------------------------\')
fp = open(\'./allData.json\', \'w\', encoding=\'utf-8\')
json.dump(all_data_list, fp=fp, ensure_ascii=False, indent= True)
print(\'Over!\')
三、数据解析
1. 数据解析概述
-
聚焦爬虫:爬取页面中指定的页面内容。
- 编码流程:1. 指定URL → 2. 发起请求 → 3. 获取响应数据 → 4. 数据解析 → 5. 持久化存储
-
数据解析分类:
- 正则表达式
-
bs4 解析
-
xpath 解析(重点)
-
数据解析原理概述:解析的局部的文本内容都会在标签对应的属性中进行存储。
- 进行指定标签的定位
- 标签或者标签对应的属性中存储的数据值进行提取(解析)
2. 图片数据爬取---正则表达式
| 函数 |
说明 |
| re.search() |
在一个字符串中搜索匹配正则表达式的第一个位置,返回match对象
|
| re.match() |
从字符串的开始位置起匹配正则表达式,返回match对象
|
| re.findall() |
搜搜字符串,以列表类型返回全部能匹配的子串 |
| re.split() |
将一个字符串按照正则表达式匹配结果进行分割,返回列表类型
|
| re.finditer() |
搜索字符串,返回一个匹配结果的迭代类型,每个迭代元素是match对象
|
| re.sub() |
在一个字符串中替换所有匹配正则表达式的子串,返回替换后的字符串 |
| 修饰符 |
描述 |
| re.I |
使匹配对大小写不敏感 |
| re.L |
做本地化识别匹配 |
| re.M |
多行匹配,影响^和$ |
| re.S |
使.匹配包括换行在内的所有字符 |
| re.U |
根据Unicode字符集解析字符,这个标志影响\w,\W,\b,\B |
| re.X |
该标志通过给予你跟灵活的格式以便你将正则表达式写得更易于理解 |
复制代码 隐藏代码
常用的正则表达式
单字符:
. : 除换行以外所有字符
[ ] : [aoe] [a-w] 匹配集合中任意一个字符
\d : 数字 [0-9]
\D : 非数字
\w : 数字、字母、下划线、中文
\W : 非\w
\s : 所有的空白字符包,包括空格、制表符、换页符等等,等价于[ \f \n \r \t \v ]
\S : 非空白
数量修饰:
\* : 任意多次 >=0
\+ : 至少一次 >=1
? : 可有可无 0次或者1次
{m} : 固定m次 hello{3,}
{m,} : 至少m次
{m,n} : m-n次
边界:
\$ : 以某某结尾
^ : 以某某开头
分组:
(ab)
贪婪模式: .\*
非贪婪(惰性)模式: .\*?
re.I : 忽略大小写
re.M : 多行匹配
re.S : 单行匹配
re.sub : 正则表达式,替换内容,字符串
复制代码 隐藏代码
\'\'\'正则练习\'\'\'
import re
key = "javapythonc++php"
re.findall(\'python\', key)[0]
key = "<html><h1><hello world><h1></html>"
re.findall(\'<h1>(.*)<h1>\', key)[0]
string = \'我喜欢身高为170的女孩’
re.findall(\'\d+\', string)
#提取出http://和https://
key = \'http://www.baidu.com and https://boob.com\'
re.findall(\'https?://\', key)
#提取出hello
key = \'lalala<hTml><hello></HtMl>hahah\' #输出<hTml><hello></HtMl>
re.findall(\'<[Hh][Tt][mM][lL]>(.*)</[Hh][Tt][mM][lL]>\', key)
#提取出hit.
key = \'bobo@hit.edu.com\' #想要匹配到hit
re.findall(\'h.*?\.\', key)
#匹配sas和saas
key = \'sasa and sas and saaas\'
re.findall(\'sa{1,2}s\', key)
复制代码 隐藏代码
import requests
if __name__ == \'__main__\':
url = \'https://pic.qiushibaike.com/system/pictures/12409/124098453/medium/YNPHJQC101MS31E1.jpg\'
img_data = requests.get(url = url).content
with open(\'./qiutu.jpg\', \'wb\') as fp:
fp.write(img_data)
3. 正则解析案例
复制代码 隐藏代码
\'\'\'<div class="thumb">
<a href="/article/124098472" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12409/124098472/medium/HSN2WWN0TP1VUPNG.jpg" alt="糗事#124098472" class="illustration" width="100%" height="auto">
</a>
</div>\'\'\'
import re
import os
import requests
if __name__ == \'__main__\':
if not os.path.exists(\'./qiutuLibs\'):
os.mkdir(\'./qiutuLibs\')
url = \'https://www.qiushibaike.com/imgrank/ \'
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
page_text = requests.get(url=url, headers=headers).text
ex = \'<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>\'
img_src_list = re.findall(ex, page_text, re.S)
print(img_src_list)
for src in img_src_list:
src = \'https:\' + src
img_data = requests.get(url = src, headers = headers).content
img_name = src.split(\'/\')[-1]
imgPath = \'./qiutuLibs/\' + img_name
with open(imgPath, \'wb\') as fp:
fp.write(img_data)
print(img_name, \'下载成功!\')
复制代码 隐藏代码
import re
import os
import requests
if __name__ == \'__main__\':
if not os.path.exists(\'./qiutuLibs\'):
os.mkdir(\'./qiutuLibs\')
url = \'https://www.qiushibaike.com/imgrank/page/%d/\'
for pageNum in range(1, 11):
new_url = format(url % pageNum)
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
page_text = requests.get(url=new_url, headers=headers).text
ex = \'<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>\'
img_src_list = re.findall(ex, page_text, re.S)
print(img_src_list)
for src in img_src_list:
src = \'https:\' + src
img_data = requests.get(url = src, headers = headers).content
img_name = src.split(\'/\')[-1]
imgPath = \'./qiutuLibs/\' + img_name
with open(imgPath, \'wb\') as fp:
fp.write(img_data)
print(img_name, \'下载成功!\')
4. bs4解析概述
5. bs4 解析具体讲解
- **如?*? BeautifulSoup 对象:
- 导包,
from bs4 import BeautifulSoup
- 对象的实例化:
- (1)将本地的 html 文档中的数据加载到该对象中;
- (2)将互联网上获取的页面源码加载到该对象中。
- 提供的用于数据解析的方法和属性:
-
soup.tagName:返回的是文档中第一次出现的 tagName 标签;
-
soup.find(tagName):可以等同于soup.tagName;也可以进行属性定位;
-
soup.find_all( ):返回符合要求的所有标签;
-
select(\'某种选择器(id,class,标签...选择器)\')返回的是一个列表;层级选择器
- 获取标签之间的文本数据:
soup.a.text/string/get_text( )
-
text/get_text( ):可以获取某一个标签中所有的文本内容
-
string:只可以获取该标签下面直系的文本内容
-
获取标签中的属性值:
soup.a[\'href\']
复制代码 隐藏代码
<html lang="en">
<head>
<meta charset="UTF-8" />
<title>测试bs4</title>
</head>
<body>
<div>
<p>百里守约</p>
</div>
<div class="song">
<p>李清照</p>
<p>王安石</p>
<p>苏轼</p>
<p>柳宗元</p>
<a title="赵匡胤" target="_self">
<span>this is span</span>
宋朝是最强大的王朝,不是军队的强大,而是经济很强大,国民都很有钱</a>
<a href="" class="du">总为浮云能蔽日,长安不见使人愁</a>
<img src="http://www.baidu.com/meinv.jpg" alt="" />
</div>
<div class="tang">
<ul>
<li><a title="qing">清明时节雨纷纷,路上行人欲断魂,借问酒家何处有,牧童遥指杏花村</a></li>
<li><a title="qin">秦时明月汉时关,万里长征人未还,但使龙城飞将在,不教胡马度阴山</a></li>
<li><a alt="qi">岐王宅里寻常见,崔九堂前几度闻,正是江南好风景,落花时节又逢君</a></li>
<li><a class="du">杜甫</a></li>
<li><a class="du">杜牧</a></li>
<li><b>杜小月</b></li>
<li><i>度蜜月</i></li>
<li><a id="feng">凤凰台上凤凰游,凤去台空江自流,吴宫花草埋幽径,晋代衣冠成古丘</a></li>
</ul>
</div>
</body>
</html>
复制代码 隐藏代码
from bs4 import BeautifulSoup
if __name__ == \'__main__\':
fp = open(\'./test.html\', \'r\', encoding=\'utf-8\')
soup = BeautifulSoup(fp, \'lxml\')
print(soup.a)
print(soup.div)
print(soup.find(\'div\'))
print(soup.find(\'div\', class_=\'song\'))
print(soup.find_all(\'a\'))
print(soup.select(\'.tang\'))
print(soup.select(\'.tang > ul > li > a\')[0])
print(soup.select(\'.tang > ul a\')[0])
print(soup.select(\'.tang > ul a\')[0].text)
print(soup.select(\'.tang > ul a\')[0].get_text())
print(soup.select(\'.tang > ul a\')[0].string)
print(soup.select(\'.tang > ul a\')[0][\'href\'])
6. bs4 解析案例实战
复制代码 隐藏代码
import requests
from bs4 import BeautifulSoup
if __name__ == \'__main__\':
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'https://www.shicimingju.com/book/sanguoyanyi.html\'
response = requests.get(url = url, headers = headers)
response.encoding = \'utf-8\'
page_text = response.text
soup = BeautifulSoup(page_text, \'lxml\')
li_list = soup.select(\'.book-mulu > ul > li\')
fp = open(\'./sanguo.txt\', \'w\', encoding = \'utf-8\')
for li in li_list:
title = li.a.string
detail_url =\'http://www.shicimingju.com\' + li.a[\'href\']
detail_response = requests.get(url = detail_url, headers = headers)
detail_response.encoding = \'utf-8\'
detail_page_text = detail_response.text
detail_soup = BeautifulSoup(detail_page_text, \'lxml\')
div_tag = detail_soup.find(\'div\', class_ = \'chapter_content\')
content = div_tag.text
fp.write(title + \':\' + content + \'\n\')
print(title, \'爬取成功!\')
7. xpath解析基础
-
xpath解析:最常用且最便捷高效的一种解析方式。通用性。
-
xpath解析原理:
- (1)实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中;
- (2)调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。
-
环境的安装:
pip install lxml (lxml解析器)
- **如?*桓鰁tree对象:
from lxml import etree
- (1)将本地的html文档中的源码数据加载到etree对象中:
etree.parse(filePath)
- (2)可以将从互联网上获取的源码数据加载到该对象中:
etree.HTML(\'page_text\')
-
xpath(\'xpath表达式\'):
- 其中 / 表示从根节点定位或者表示一个层级;
- // 表示多个层级或者从任意位置开始定位;
- 属性定位:
tag[@attrName="attrValue"];
- 索引定位:
tag[@attrName="attrValue"]/p[3],注意索引从1开始
- 取文本:
/text( ) :获取的是标签中直系的文本内容;//text( ) :标签中非直系的文本内容(所有的文本内容)
- 取属性:
/@attrName ==> img/@src
复制代码 隐藏代码
from lxml import etree
if __name__ == "__main__":
tree = etree.parse(\'test.html\')
r = tree.xpath(\'//div[@class="song"]/img/@src\')
print(r)
8. xpath实战-58二手房
复制代码 隐藏代码
import requests
from lxml import etree
if __name__ == \'__main__\':
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'https://bj.58.com/ershoufang/\'
page_text = requests.get(url = url,headers = headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath(\'//section[@class="list"]/div\')
fp = open(\'58.txt\',\'w\',encoding = \'utf-8\')
for div in div_list:
title = div.xpath(\'./a/div[2]//h3/text()\')[0]
fp.write(title + \'\n\n\')
print(\'---------------Over!------------------\')
9. xpath解析案例
(1)4k图片解析下载
复制代码 隐藏代码
import requests
from lxml import etree
import os
if __name__ == "__main__":
url = \'http://pic.netbian.com/4kmeinv/\'
headers = {
\'User-Agent\':\'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36\'
}
response = requests.get(url=url, headers=headers)
page_text = response.text
tree = etree.HTML(page_text)
li_list = tree.xpath(\'//div[@class="slist"]/ul/li\')
if not os.path.exists(\'./picLibs\'):
os.mkdir(\'./picLibs\')
for li in li_list:
img_src = \'http://pic.netbian.com\'+li.xpath(\'./a/img/@src\')[0]
img_name = li.xpath(\'./a/img/@alt\')[0]+\'.jpg\'
img_name = img_name.encode(\'iso-8859-1\').decode(\'gbk\')
img_data = requests.get(url=img_src, headers=headers).content
img_path = \'picLibs/\'+img_name
with open(img_path, \'wb\') as fp:
fp.write(img_data)
print(img_name, \'下载成功!!!\')
print(\'------------------------OVER!---------------------------------\')
(2)全国城市名称爬取
复制代码 隐藏代码
import requests
from lxml import etree
if __name__ == \'__main__\':
\'\'\'headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'https://www.aqistudy.cn/historydata/\'
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
#数据解析
hot_li_list = tree.xpath(\'//div[@class="bottom"]/ul/li\')
all_city_names = []
#解析热门城市名字
for li in hot_li_list:
hot_city_names = li.xpath(\'./a/text()\')[0]
all_city_names.append(hot_city_names)
#解析全部城市名字:
city_names_list = tree.xpath(\'.//div[@class="bottom"]/ul/div[2]/li\')
for li in city_names_list:
city_name = li.xpath(\'./a/text()\')[0]
all_city_names.append(city_name)
print(all_city_names,len(all_city_names))\'\'\'
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'https://www.aqistudy.cn/historydata/\'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
a_list = tree.xpath(\'//div[@class="bottom"]/ul/li/a | //div[@class="bottom"]/ul/div[2]/li/a \')
all_city_names = []
for a in a_list:
a_name = a.xpath(\'./text()\')[0]
all_city_names.append(a_name)
print(all_city_names, len(all_city_names))
10. xpath作业---爬取站长素材中免费简历模板
复制代码 隐藏代码
import os
import requests
from lxml import etree
if __name__ == \'__main__\':
if not os.path.exists(\'./jianli\'):
os.mkdir(\'./jianli\')
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'https://sc.chinaz.com/jianli/free_%d.html\'
page = int(input(\'您一共想要爬取多少页:\'))
for pageNum in range(1, page):
if pageNum == 1:
new_url = \'https://sc.chinaz.com/jianli/free.html\'
else:
new_url = format(url%pageNum)
page_text = requests.get(url = new_url, headers = headers).text
tree = etree.HTML(page_text)
url_div_list = tree.xpath(\'//*[@id="container"]/div\')
for detail_url in url_div_list:
detail_url = \'https:\' + detail_url.xpath(\'./a/@href\')[0]
detail_page_text = requests.get(url = detail_url, headers =headers).text
tree = etree.HTML(detail_page_text)
name = tree.xpath(\'//h1/text()\')[0].encode(\'iso-8859-1\').decode(\'utf-8\')
download_url = tree.xpath(\'//*[@id="down"]/div[2]/ul/li[1]/a/@href\')[0]
file_path = \'jianli/\' + name + \'.rar\'
download_content = requests.get(url = download_url, headers = headers).content
with open(file_path, \'wb\') as fp:
fp.write(download_content)
print(name, \'下载完成\')
print(\'-------------------------------OVER!---------------------------------------\')
四、验证码
1. 验证码识别简介
验证码和爬虫之间的爱恨情仇:
- 反爬机制:验证码。识别验证码图片中的数据,用于模拟登录操作。
识别验证码的操作:
2. 云打码使用流程
<!--作者学习期间,该平台已经挂掉,故而使用超级鹰进行代替。同类打码平台可以自行百度选择-->
- 注册:用户中心身份
- 登录:用户中心身份
- 查询余额,题分是否足够(第一次使用,绑定微信即可免费获赠1000题分;非首次使用,建议小额充值,1元即可)
- 创建软件ID——用户中心左下角
- 下载示例代码 ——开发文档
复制代码 隐藏代码
from lxml import etree
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode(\'utf8\')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
\'user\': self.username,
\'pass2\': self.password,
\'softid\': self.soft_id,
}
self.headers = {
\'Connection\': \'Keep-Alive\',
\'User-Agent\': \'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)\',
}
def PostPic(self, im, codetype):
"""
im: 图片字节
codetype: 题目类型 参考 http://www.chaojiying.com/price.html
"""
params = {
\'codetype\': codetype,
}
params.update(self.base_params)
files = {\'userfile\': (\'ccc.jpg\', im)}
r = requests.post(\'http://upload.chaojiying.net/Upload/Processing.php\', data=params, files=files, headers=self.headers)
return r.json()
def ReportError(self, im_id):
"""
im_id:报错题目的图片ID
"""
params = {
\'id\': im_id,
}
params.update(self.base_params)
r = requests.post(\'http://upload.chaojiying.net/Upload/ReportError.php\', data=params, headers=self.headers)
return r.json()
def tranformImgCode(imgPath,imgType):
chaojiying = Chaojiying_Client(\'此处是账户\', \'此处是密码\', \'此处是软件ID\')
im = open(imgPath, \'rb\').read()
return chaojiying.PostPic(im,imgType)[\'pic_str\']
print(tranformImgCode(\'./a.jpg\',1902))
3. 古诗文网验证码识别
复制代码 隐藏代码
session = requests.Session()
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'https://so.gushiwen.cn/user/login.aspx?from=http://so.gushiwen.cn/user/collect.aspx\'
page_text = session.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
img_src = \'https://so.gushiwen.org\' + tree.xpath(\'//*[@id="imgCode"]/@src\')[0]
img_data = session.get(img_src, headers=headers).content
with open(\'./code.jpg\', \'wb\') as fp:
fp.write(img_data)
code_text = tranformImgCode(\'./code.jpg\', 1902)
print(code_text)
login_url = \'https://so.gushiwen.cn/user/login.aspx?from=http%3a%2f%2fso.gushiwen.cn%2fuser%2fcollect.aspx\'
data = {
\'__VIEWSTATE\': \'f1ECt6+6MPtdTZMJtYOYS/7ww2d/DPy9t8JQcIt1QuOneLTbNQuYqPcCjZNbDAbfb9vj3k6f0M7EKTf0YqElM1k1A5ELwyTvUzBii+9LDRBbIMmc/jb0DJPsYfI=\',
\'__VIEWSTATEGENERATOR\': \'C93BE1AE\',
\'from\': \'http://so.gushiwen.cn/user/collect.aspx\',
\'email\': \'账号\',
\'pwd\': \'密码\',
\'code\': code_text,
\'denglu\': \'登录\',
}
page_text_login = session.post(url=login_url, headers=headers, data=data).text
with open(\'./gushiwen.html\', \'w\', encoding=\'utf-8\') as fp:
fp.write(page_text_login)
五、requests模块高级
1. 模拟登录实现流程梳理
模拟登录:爬取基于某些用户的用户信息。
需求:对人人网进行模拟登录
- 点击登录按钮后会发起一个post请求
- post请求中会携带登陆之前录入的相关的登录信息(用户名、密码、验证码.......)
- 验证码:每次请求都会动态变化
2. 人人网模拟登录
复制代码 隐藏代码
import requests
from lxml import etree
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'http://www.renren.com/SysHome.do\'
page_text = response.get(url = url,headers = headers).text
tree = etree.HTML(page_text)
code_img_src = tree.xpath(\'//*[@id="verifyPic_login"]/@src\')[0]
code_img_data = requests.get(url = code_img_src,headers = headers).content
with open(\'./code.jpg\',\'wb\') as fp:
fp.write(code_img_data)
login_url = \' \'
data = {
}
response = requests.post(url = login_url,headers = headers,data = data)
print(response.satus_code)
fp.write(login_page_text)
复制代码 隐藏代码
\'\'\'视频UP主的源代码\'\'\'
from CodeClass import YDMHttp
import requests
from lxml import etree
def getCodeText(imgPath,codeType):
username = \'bobo328410948\'
password = \'bobo328410948\'
appid = 6003
appkey = \'1f4b564483ae5c907a1d34f8e2f2776c\'
filename = imgPath
codetype = codeType
timeout = 20
result = None
if (username == \'username\'):
print(\'请设置好相关参数再测试\')
else:
yundama = YDMHttp(username, password, appid, appkey)
uid = yundama.login();
print(\'uid: %s\' % uid)
balance = yundama.balance();
print(\'balance: %s\' % balance)
cid, result = yundama.decode(filename, codetype, timeout);
print(\'cid: %s, result: %s\' % (cid, result))
return result
headers = {
\'User-Agent\': \'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
url = \'http://www.renren.com/SysHome.do\'
page_text = requests.get(url=url,headers=headers).text
tree = etree.HTML(page_text)
code_img_src = tree.xpath(\'//*[@id="verifyPic_login"]/@src\')[0]
code_img_data = requests.get(url=code_img_src,headers=headers).content
with open(\'./code.jpg\',\'wb\') as fp:
fp.write(code_img_data)
result = getCodeText(\'code.jpg\',1000)
print(result)
login_url = \'http://www.renren.com/ajaxLogin/login?1=1&uniqueTimestamp=2019431046983\'
data = {
\'email\': \'www.zhangbowudi@qq.com\',
\'icode\': result,
\'origURL\': \'http://www.renren.com/home\',
\'domain\': \'renren.com\',
\'key_id\': \'1\',
\'captcha_type\': \'web_login\',
\'password\': \'06768edabba49f5f6b762240b311ae5bfa4bcce70627231dd1f08b9c7c6f4375\',
\'rkey\': \'1028219f2897941c98abdc0839a729df\',
\'f\':\'https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3Dgds6TUs9Q1ojOatGda5mVsLKC34AYwc5XiN8OuImHRK%26wd%3D%26eqid%3D8e38ba9300429d7d000000035cedf53a\',
}
response = requests.post(url=login_url,headers=headers,data=data)
print(response.text)
print(response.status_code)
3. 模拟登录cookie操作
复制代码 隐藏代码
session = requests.Session()
\'\'\'手动获取Cookie(不推荐) headers = {
‘\'Cookie\':\'xxxx\'
}\'\'\'
detail_url = \'http://www.renren.com/976279344/profile\'
detail_page_test = session.get(url = detail_url,headers = headers).text
with open(\'bobo.html\',\'w\',encoding = \'utf-8\' ) as fp:
fp.write(detail_page_test)
4. 代{过}{滤}理理论讲解
-
代{过}{滤}理:破解封 IP 这种反爬机制。
-
什么是代{过}{滤}理?代{过}{滤}理服务器。
-
代{过}{滤}理的作用:
- 突破自身 IP 被访问的限制
- 可以隐藏自身真实的 IP,免受攻击
-
相关网站:
-
代{过}{滤}理 ip 的类型:
- http:只能应用到 http 协议对应的 url 中
- https:只能应用到 https 协议对应的 url 中
-
代{过}{滤}理ip的匿名度:
- 透明:服务器知道该次请求使用了代{过}{滤}理,也知道请求对应的真实 ip
- 匿名:知道使用了代{过}{滤}理,不知道真实 ip
- 高匿:不知道使用了代{过}{滤}理,也不知道真实 ip
5. 代{过}{滤}理在爬虫中的应用
复制代码 隐藏代码
import requests
url = \'http://www.baidu.com/s?wd=ip\'
headers = {
\'User-Agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
page_text = requests.get(url = url, headers = headers, proxies = {"http": "http://124.205.155.153:9090"}).text
with open(\'ip.html\', \'w\', encoding = \'utf-8\') as fp:
fp.write(page_text)
六、高性能异步爬虫
1. 异步爬虫概述
-
同步:不同程序单元为了完成某个任务,在执行过程中需靠某种通信方式以协调一致,称这些程序单元是同步执行的。 例如购物系统中更新商品库存,需要用 “行锁” 作为通信信号,让不同的更新请求强制排队顺序执行,那更新库存的操作是同步的。 简言之,同步意味着有序。
-
异步:为完成某个任务,不同程序单元之间过程中无需通信协调,也能完成任务的方式,不相关的程序单元之间可以是异步的。 例如,爬虫下载网页。调度程序调用下载程序后,即可调度其他任务,而无需与该下载任务保持通信以协调行为。不同网页的下载、保存等操作都是无关的,也无需相互通知协调。这些异步操作的完成时刻并不确定。 简言之,异步意味着无序。
-
目的:在爬虫中使用异步实现高性能的数据爬取操作。
复制代码 隐藏代码
import requests
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
urls = [
\'https://downsc.chinaz.net/Files/DownLoad/jianli/202102/jianli14667.rar\',
\'https://downsc.chinaz.net/Files/DownLoad/jianli/202102/jianli14665.rar\',
\'https://downsc.chinaz.net/Files/DownLoad/jianli/202102/jianli14648.rar\'
]
def get_content(url):
print(\'正在爬取:\', url)
response = requests.get(url=url, headers=headers)
if response.status_code == 200:
return response.content
def parse_content(content):
print(\'响应数据的长度为:\', len(content))
for url in urls:
content = get_content(url)
parse_content(content)
2. 多线程and多线程
异步爬虫的方式:
-
多线程,多进程:(不建议)
- 好处:可以为相关阻塞的操作单独开启线程或者进程,阻塞操作就可以异步执行
- 弊端:无法无限制的开启多线程或者多进程
3. 线程池and进程池
-
线程池、进程池:(适当使用)
- 好处:可以降低系统对进程或者线程创建和销毁的一个频率,从而很好地降低系统地开销。
- 弊端:池中线程或进程地数量是有上限的。
4. 线程池的基本使用
复制代码 隐藏代码
import time
def get_page(str):
print(\'正在下载:\',str)
time.sleep(2)
print(\'下载成功:\',str)
name_list = [\'xiaozi\',\'aa\',\'bb\',\'cc\']
start_time = time.time()
for i in range(len(name_list)):
get_page(name_list[i])
end_time = time.time()
print(\'%d second\' % (end_time-start_time))
复制代码 隐藏代码
import time
from multiprocessing.dummy import Pool
start_time = time.time()
def get_page(str):
print(\'正在下载:\', str)
time.sleep(2)
print(\'下载成功:\', str)
name_list = [\'xiaozi\',\'aa\',\'bb\',\'cc\']
pool = Pool(4)
pool.map(get_page, name_list)
end_time = time.time()
print(end_time - start_time)
5. 线程池案例应用
复制代码 隐藏代码
import requests
import os
from multiprocessing.dummy import Pool
from lxml import etree
import random
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\'
}
if __name__ == \'__main__\':
if not os.path.exists(\'./video\'):
os.mkdir(\'./video\')
url = \'https://www.pearvideo.com/category_5\'
page_text = requests.get(url=url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath(\'//ul[@id="listvideoListUl"]/li\')
urls = []
for li in li_list:
detail_url = \'https://www.pearvideo.com/\' + li.xpath(\'./div/a/@href\')[0]
name = li.xpath(\'./div/a/div[2]/text()\')[0] + \'.mp4\'
detail_page_text = requests.get(url=detail_url, headers=headers).text
detail_tree = etree.HTML(detail_page_text)
name = detail_tree.xpath(\'//*[@id="detailsbd"]/div[1]/div[2]/div/div[1]/h1/text()\')[0]
str_ = str(li.xpath(\'./div/a/@href\')[0]).split(\'_\')[1]
ajax_url = \'https://www.pearvideo.com/videoStatus.jsp?\'
params = {
\'contId\': str_,
\'mrd\': str(random.random())
}
ajax_headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36\',
\'Referer\': \'https://www.pearvideo.com/video_\' + str_
}
dic_obj = requests.get(url=ajax_url, params=params, headers=ajax_headers).json()
video_url = dic_obj["videoInfo"][\'videos\']["srcUrl"]
video_true_url = \'\'
s_list = str(video_url).split(\'/\')
for i in range(0, len(s_list)):
if i < len(s_list) - 1:
video_true_url += s_list[i] + \'/\'
else:
ss_list = s_list[i].split(\'-\')
for j in range(0, len(ss_list)):
if j == 0:
video_true_url += \'cont-\' + str_ + \'-\'
elif j == len(ss_list) - 1:
video_true_url += ss_list[j]
else:
video_true_url += ss_list[j] + \'-\'
dic = {
\'name\': name,
\'url\': video_true_url
}
urls.append(dic)
def get_video_data(dic):
urll = dic[\'url\']
data = requests.get(url=urll, headers=headers).content
path = \'./video/\' + dic[\'name\'] + \'.mp4\'
print(dic[\'name\'], \'正在下载.......\')
with open(path, \'wb\') as fp:
fp.write(data)
print(dic[\'name\']+ \'.mp4\', \'下载成功!\')
pool = Pool(4)
pool.map(get_video_data, urls)
pool.close()
pool.join()
6. 协程相关概念回顾
7. 协程相关操作回顾
复制代码 隐藏代码
import asyncio
async def request(url):
print(\'正在请求的url是\',url)
print(\'请求成功,\',url)
return url
c = request(\'www.baidu.com\')
def callback_func(task):
print(task.result())
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(c)
task.add_done_callback(callback_func)
loop.run_until_complete(task)
8. 多任务异步协程实现
复制代码 隐藏代码
import time
import asyncio
async def request(url):
print(\'正在下载\',url)
await asyncio.sleep(2)
print(\'下载完毕\',url)
start = time.time()
urls =[
\'www.baidu.com\',
\'www.sougou.com\',
\'www.goubanjia.com\'
]
stasks = []
for url in urls:
c = request(url)
task = asyncio.ensure_future(c)
stasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(stasks))
print(time.time()-start)
9. aiohttp 模块引出
复制代码 隐藏代码
import requests
import asyncio
import time
start = time.time()
urls = [
\'http://127.0.0.1:1080/bobo\',
\'http://127.0.0.1:1080/jay\',
\'http://127.0.0.1:1080/tom\'
]
async def get_page(url):
print(\'正在下载\', url)
response = requests.get(url = url)
print(\'下载完毕\', response.text)
tasks = []
for url in urls:
c = get_page(url)
task = asyncio.ensure_future(c)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print(\'总耗时:\', end-start)
10. aiohttp + 多任务异步协程实现异步爬虫
复制代码 隐藏代码
import asyncio
import time
import aiohttp
start = time.time()
urls = [
\'http://www.baidu.com\',
\'http://www.sougou.com\',
\'http://www.taobao.com\'
]
async def get_page(url):
async with aiohttp.ClientSession() as session:
async with await session.get(url) as response:
page_text = await response.text()
tasks = []
for url in urls:
c = get_page(url)
task = asyncio.ensure_future(c)
tasks.append(task)
loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print(\'总耗时:\', end-start)
七、动态加载数据处理
1. selenium简介
2. selenium初试
selenium使用流程:
- 环境安装:
pip install selenium
- 下载一个对应浏览器的驱动程序(以谷歌浏览器为例)
复制代码 隐藏代码
from selenium import webdriver
from lxml import etree
from time import sleep
bro = webdriver.Chrome(executable_path=\'./chromedriver.exe\')
bro.get(\'http://scxk.nmpa.gov.cn:81/xk/\')
page_text = bro.page_source
tree = etree.HTML(page_text)
li_list = tree.xpath(\'//ul[@id="gzlist"]/li\')
for li in li_list:
name = li.xpath(\'./dl/@title\')[0]
print(name)
sleep(5)
bro.quit()
3. selenium其他自动化操作
复制代码 隐藏代码
from selenium import webdriver
from time import sleep
bro = webdriver.Chrome(executable_path=\'./chromedriver.exe\')
bro.get(\'https://www.taobao.com/\')
search_input = bro.find_element_by_id(\'q\')
search_input.send_keys(\'iphone\')
bro.execute_script(\'window.scrollTo(0,document.body.scrollHeight)\')
sleep(2)
btn = bro.find_element_by_css_selector(\'.btn-search\')
btn.click()
bro.get(\'https://baidu.com/\')
sleep(2)
bro.back()
sleep(2)
bro.forward()
sleep(5)
bro.quit()
4. iframe 处理+动作链
**selenium处理iframe:**
- 如果定位的标签存在于iframe标签之中,则必须使用
switch_to.frame(id)
- 动作链(拖动):
from selenium.webdriver import ActionChains
- 实例化一个动作链对象:
action = ActionChains(bro)
-
click_and_hold(div):长按且点击
move_by_offset(x,y)
-
perform( ):让动作链立即执行
-
action.release( ):释放动作链对象
复制代码 隐藏代码
from selenium import webdriver
from time import sleep
from selenium.webdriver import ActionChains
bro = webdriver.Chrome(executable_path=\'./chromedriver.exe\')
bro.get(\'https://www.runoob.com/try/try.php?filename=jqueryui-example-droppable\')
bro.switch_to.frame(\'iframeResult\')
div = bro.find_element_by_id(\'draggable\')
action = ActionChains(bro)
action.click_and_hold(div)
for i in range(5):
action.move_by_offset(11, 0).perform()
sleep(0.3)
action.release()
bro.quit()
5. selenium模拟登录QQ空间
复制代码 隐藏代码
from selenium import webdriver
from time import sleep
bro = webdriver.Chrome(executable_path=\'./chromedriver.exe\')
bro.get(\'https://qzone.qq.com/\')
bro.switch_to.frame(\'login_frame\')
a_tag = bro.find_element_by_id(\'switcher_plogin\')
a_tag.click()
userName_tag = bro.find_element_by_id(\'u\')
password_tag = bro.find_element_by_id(\'p\')
sleep(1)
userName_tag.send_keys(\'QQ号码\')
password_tag.send_keys(\'QQ密码\')
sleep(1)
btn = bro.find_element_by_id(\'login_button\')
btn.click()
sleep(3)
bro.quit()
6. 无头浏览器+规避操作
复制代码 隐藏代码
from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.webdriver import ChromeOptions
chrome_options = Options()
chrome_options.add_argument(\'--headless\')
chrome_options.add_argument(\'--disable-gpu\')
option = ChromeOptions()
option.add_experimental_option(\'excludeSwitches\', [\'enable-automation\'])
bro = webdriver.Chrome(executable_path=\'./chromedriver.exe\', chrome_options=chrome_options,options=option)
bro.get(\'https://www.baidu.com\')
print(bro.page_source)
sleep(2)
bro.quit()
7. 超级鹰的基本使用
超级鹰:https://www.chaojiying.com/about.html
- 注册:普通用户
- 登录:普通用户
- 题分查询:充值
- 软件ID——创建一个软件ID
- 下载示例代码
8. 12306模拟登录
编码流程:
- 使用
selenium打开登录界面
- 对当前
selenium打开的这张界面进行截图
- 对截取的图片进行局部区域(验证码图片)的裁剪
- 使用超级鹰识别验证码图片(坐标)
复制代码 隐藏代码
import requests
from hashlib import md5
class Chaojiying_Client(object):
def __init__(self, username, password, soft_id):
self.username = username
password = password.encode(\'utf8\')
self.password = md5(password).hexdigest()
self.soft_id = soft_id
self.base_params = {
\'user\': self.username,
\'pass2\': self.password,
\'softid\': self.soft_id,
}
self.headers = {
\'Connection\': \'Keep-Alive\',
\'User-Agent\': \'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0)\',
}
def PostPic(self, im, codetype):
"""
im: 图片字节
codetype: 题目类型 参考 http://www.chaojiying.com/price.html
"""
params = {
\'codetype\': codetype,
}
params.update(self.base_params)
files = {\'userfile\': (\'ccc.jpg\', im)}
r = requests.post(\'http://upload.chaojiying.net/Upload/Processing.php\', data=params, files=files, headers=self.headers)
return r.json()
def ReportError(self, im_id):
"""
im_id:报错题目的图片ID
"""
params = {
\'id\': im_id,
}
params.update(self.base_params)
r = requests.post(\'http://upload.chaojiying.net/Upload/ReportError.php\', data=params, headers=self.headers)
return r.json()
from selenium import webdriver
import time
from PIL import Image
from selenium.webdriver import ActionChains
bro = webdriver.Chrome(executable_path=\'./chromedriver.exe\')
bro.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, \'webdriver\', {
get: () => undefined
})
"""
})
bro.get(\'https://kyfw.12306.cn/otn/resources/login.html\')
bro.maximize_window()
time.sleep(1)
zhanghao_tag = bro.find_element_by_class_name(\'login-hd-account\')
zhanghao_tag.click()
time.sleep(1)
bro.save_screenshot(\'aa.png\')
code_img_ele = bro.find_element_by_class_name(\'touclick-wrapper\')
location = code_img_ele.location
print(\'location:\', location)
size = code_img_ele.size
print(\'size:\', size)
rangle = (location[\'x\']*1.25, location[\'y\']*1.25, (location[\'x\']+size[\'width\'])*1.25, (location[\'y\']+size[\'height\'])*1.25)
i = Image.open(\'./aa.png\')
code_img_name = \'./code.png\'
frame = i.crop(rangle)
frame.save(code_img_name)
time.sleep(3)
chaojiying = Chaojiying_Client(\'超级