1.基于Requests和BeautifulSoup的单线程爬虫
1.1 BeautifulSoup用法总结
1. find,获取匹配的第一个标签
tag = soup.find('a') print(tag) tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie') print(tag)
2.find_all,获取匹配的所有标签,包含标签里的标签,若不想要标签里的标签,可将recursive(递归寻找)=False
tag = soup.find('a') print(tag) tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie') tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie') print(tag)
3.get 获得属性的值
img_url = soup.find('div',class_='main-image').find('img').get('src')
4.text 获取标签内容
title = soup.find('h2',class_='main-title').text.strip()
1.2 简单应用,爬取mzitu图片
import requests,os from bs4 import BeautifulSoup base_url = 'http://www.mzitu.com/' BASE_DIR = os.path.dirname(os.path.abspath(__file__)) r1 = requests.get(url=base_url) # print(r1.text) soup = BeautifulSoup(r1.text,features='lxml') # 获取所有套图链接 tags = soup.find(name='ul',id="pins").find_all('li') url_list = [] for tag in tags: url = tag.find('span').find('a').get('href') # print(img_url) url_list.append(url) for url in url_list: # 获取套图链接信息 r2 = requests.get(url=url) soup = BeautifulSoup(r2.text,features='lxml') title = soup.find('h2',class_='main-title').text.strip() # img_url = soup.find('div',class_='main-image').find('img').get('src') # 获取套图总张数 num = int(soup.find('div',class_='pagenavi').find_all('span')[-2].text) # 保存路径文件夹 path = os.path.join(BASE_DIR,title) # print(path) if os.path.exists(path): pass else: os.makedirs(path) #循环获取各图片URL for i in range(1,num+1): url_new = "%s/%s"%(url,i) r3 = requests.get(url=url_new) soup = BeautifulSoup(r3.text,features='lxml') img_url = str(soup.find('div',class_='main-image').find('img').get('src')) # 添加请求头应对图片防盗链 r4 = requests.get(url=img_url, headers={'Referer':url_new}) # print(type(img_url)) dict = img_url.rsplit('/',maxsplit=1) file_name = os.path.join(path,dict[1]) # print(file_name) with open(file_name,'wb') as f: f.write(r4.content)
1.3 模拟登录choti网站并点赞
import requests from fake_useragent import UserAgent agent = UserAgent() # ############## 方式一 ############## """ ## 1、首先登陆任何页面,获取cookie i1 = requests.get(url="https://dig.chouti.com/", headers={ "User-Agent":agent.random, }) i1_cookies = i1.cookies.get_dict() print(i1_cookies) # ## 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权 i2 = requests.post( url="https://dig.chouti.com/login", data={ 'phone': "8615057101356", 'password': "199SulkyBuckets", 'oneMonth': "1" }, headers={"User-Agent":agent.random,}, cookies=i1_cookies, ) # ## 3、点赞(只需要携带已经被授权的gpsd即可) i3 = requests.post( url="https://dig.chouti.com/link/vote?linksId=19444596", headers={"User-Agent":agent.random,}, cookies=i1_cookies, ) print(i3.text) """ # ############## 方式二 ############## # import requests session = requests.Session() i1 = session.get(url="https://dig.chouti.com", headers={"User-Agent": agent.random}) i2 = session.post( url="https://dig.chouti.com/login", data={ 'phone': "8615057101356", 'password': "199SulkyBuckets", 'oneMonth': "1" }, headers={"User-Agent": agent.random} ) i3 = session.post( url="https://dig.chouti.com/link/vote?linksId=19444596", headers={"User-Agent": agent.random} ) print(i3.text)
2.Scrapy框架
Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。
Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下
Scrapy主要包括了以下组件:
-
引擎(Scrapy)
用来处理整个系统的数据流处理, 触发事务(框架核心) -
调度器(Scheduler)
用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址 -
下载器(Downloader)
用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的) -
爬虫(Spiders)
爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面 -
项目管道(Pipeline)
负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。 -
下载器中间件(Downloader Middlewares)
位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。 -
爬虫中间件(Spider Middlewares)
介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。 -
调度中间件(Scheduler Middewares)
介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。
Scrapy运行流程大概如下:
- 引擎从调度器中取出一个链接(URL)用于接下来的抓取
- 引擎把URL封装成一个请求(Request)传给下载器
- 下载器把资源下载下来,并封装成应答包(Response)
- 爬虫解析Response
- 解析出实体(Item),则交给实体管道进行进一步的处理
- 解析出的是链接(URL),则把URL交给调度器等待抓取
2.1 基本命令
1. scrapy startproject 项目名称 - 在当前目录中创建中创建一个项目文件(类似于Django) 2. scrapy genspider [-t template] <name> <domain> - 创建爬虫应用 如: scrapy gensipider -t basic oldboy oldboy.com scrapy gensipider -t xmlfeed autohome autohome.com.cn PS: 查看所有命令:scrapy gensipider -l 查看模板命令:scrapy gensipider -d 模板名称 3. scrapy list - 展示爬虫应用列表 4. scrapy crawl 爬虫应用名称 --nolog(无运行日志显示) - 运行单独爬虫应用
2.2 选择器SELECTOR
#!/usr/bin/env python # -*- coding:utf-8 -*- from scrapy.selector import Selector, HtmlXPathSelector from scrapy.http import HtmlResponse html = """<!DOCTYPE html> <html> <head lang="en"> <meta charset="UTF-8"> <title></title> </head> <body> <ul> <li class="item-"><a id='i1' href="link.html">first item</a></li> <li class="item-0"><a id='i2' href="llink.html">first item</a></li> <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li> </ul> <div><a href="llink2.html">second item</a></div> </body> </html> """ response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8') # hxs = HtmlXPathSelector(response) # print(hxs) # hxs = Selector(response=response).xpath('//a') # print(hxs) # hxs = Selector(response=response).xpath('//a[2]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@id]') # print(hxs) # hxs = Selector(response=response).xpath('//a[@> # print(hxs) # hxs = Selector(response=response).xpath('//a[@href="link.html"][@> # print(hxs) # hxs = Selector(response=response).xpath('//a[contains(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]') # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract() # print(hxs) # hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract() # print(hxs) # hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first() # print(hxs) # ul_list = Selector(response=response).xpath('//body/ul/li') # for item in ul_list: # v = item.xpath('./a/span') # # 或 # # v = item.xpath('a/span') # # 或 # # v = item.xpath('*/a/span') # print(v)
chouti 自动登入点赞
import scrapy from scrapy.selector import HtmlXPathSelector from scrapy.http.request import Request from scrapy.http.cookies import CookieJar from scrapy import FormRequest class ChouTiSpider(scrapy.Spider): # 爬虫应用的名称,通过此名称启动爬虫命令 name = "chouti" # 允许的域名 allowed_domains = ["chouti.com"] cookie_dict = {} has_request_set = {} # 重写起始函数 def start_requests(self): url = 'http://dig.chouti.com/' # return [Request(url=url, callback=self.login)] yield Request(url=url, callback=self.login) def login(self, response): cookie_jar = CookieJar() cookie_jar.extract_cookies(response, response.request) for k, v in cookie_jar._cookies.items(): for i, j in v.items(): for m, n in j.items(): self.cookie_dict[m] = n.value print(self.cookie_dict) req = Request( url='http://dig.chouti.com/login', method='POST', headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'}, body='phone=8615057101356&password=199SulkyBuckets&Month=1', cookies=self.cookie_dict, callback=self.check_login ) yield req def check_login(self, response): # print(response.text) req = Request( url='http://dig.chouti.com/', method='GET', callback=self.show, cookies=self.cookie_dict, dont_filter=True ) yield req def show(self, response): # print(response.text) hxs = HtmlXPathSelector(response) news_list = hxs.select('//div[@>) for new in news_list: # temp = new.xpath('div/div[@class="part2"]/@share-linkid').extract() link_id = new.xpath('*/div[@class="part2"]/@share-linkid').extract_first() yield Request( url='http://dig.chouti.com/link/vote?linksId=%s' %(link_id,), method='POST', cookies=self.cookie_dict, callback=self.do_favor ) # page_list = hxs.select('//div[@)]/@href').extract() # for page in page_list: # # page_url = 'http://dig.chouti.com%s' % page # import hashlib # hash = hashlib.md5() # hash.update(bytes(page_url,encoding='utf-8')) # key = hash.hexdigest() # if key in self.has_request_set: # pass # else: # self.has_request_set[key] = page_url # yield Request( # url=page_url, # method='GET', # callback=self.show # ) def do_favor(self, response): print(response.text)
注意:settings.py中设置DEPTH_LIMIT = 1来指定“递归”的层数。注意:settings.py中设置DEPTH_LIMIT = 1来指定“递归”的层数。
多次爬取同一个页面注意设置REQUEST:dont_filter=True,防止爬虫自行去重
2.3 避免重复访问
scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:
DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter' DUPEFILTER_DEBUG = False JOBDIR = "保存范文记录的日志路径,如:/root/" # 最终路径为 /root/requests.seen
2.4 爬取mzitu图片
# -*- coding: utf-8 -*- import scrapy from scrapy.http import Request from scrapy.selector import Selector,XmlXPathSelector from ..items import MzituItem class MeizituSpider(scrapy.Spider): name = 'meizitu' allowed_domains = ['mzitu.com'] # start_urls = ['http://mzitu.com/'] def start_requests(self): url = 'http://www.mzitu.com/all/' yield Request(url=url,method='GET',callback=self.main_page) def main_page(self,response): # 取得所有套图地址 hxs = Selector(response = response).xpath('//p[contains(@class,"url")]/a/@href').extract() for url in hxs: req = Request(url = url, callback=self.fenye) yield req def fenye(self,response): # 取得图片路径和标题 img_url = Selector(response=response).xpath('//div[@class="main-image"]//img/@src').extract_first().strip() title = Selector(response=response).xpath('//div[@class="main-image"]//img/@alt').extract_first().strip() yield MzituItem(img_url=img_url,title=title) # 取得下方导航条页面路径 xhs = Selector(response=response).xpath('//div[@class="pagenavi"]/a/@href').extract() for url in xhs: req = Request( url=url, callback=self.fenye, ) yield req