1.基于Requests和BeautifulSoup的单线程爬虫

1.1 BeautifulSoup用法总结

1. find,获取匹配的第一个标签

tag = soup.find('a')
print(tag)
tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
print(tag)

2.find_all,获取匹配的所有标签,包含标签里的标签,若不想要标签里的标签,可将recursive(递归寻找)=False

tag = soup.find('a')
print(tag)
tag = soup.find(name='a', attrs={'class': 'sister'}, recursive=True, text='Lacie')
tag = soup.find(name='a', class_='sister', recursive=True, text='Lacie')
print(tag)

3.get 获得属性的值

img_url = soup.find('div',class_='main-image').find('img').get('src')

4.text 获取标签内容

title = soup.find('h2',class_='main-title').text.strip()

1.2 简单应用,爬取mzitu图片

import requests,os
from bs4 import BeautifulSoup


base_url = 'http://www.mzitu.com/'
BASE_DIR = os.path.dirname(os.path.abspath(__file__))

r1 = requests.get(url=base_url)
# print(r1.text)
soup = BeautifulSoup(r1.text,features='lxml')
# 获取所有套图链接
tags = soup.find(name='ul',id="pins").find_all('li')
url_list = []
for tag in tags:
    url = tag.find('span').find('a').get('href')
    # print(img_url)
    url_list.append(url)

for url in url_list:
    # 获取套图链接信息
    r2 = requests.get(url=url)
    soup = BeautifulSoup(r2.text,features='lxml')

    title = soup.find('h2',class_='main-title').text.strip()
    # img_url = soup.find('div',class_='main-image').find('img').get('src')
    # 获取套图总张数
    num = int(soup.find('div',class_='pagenavi').find_all('span')[-2].text)
    # 保存路径文件夹
    path = os.path.join(BASE_DIR,title)
    # print(path)
    if os.path.exists(path):
        pass
    else:
        os.makedirs(path)
    #循环获取各图片URL
    for i in range(1,num+1):
        url_new = "%s/%s"%(url,i)
        r3 = requests.get(url=url_new)
        soup = BeautifulSoup(r3.text,features='lxml')
        img_url = str(soup.find('div',class_='main-image').find('img').get('src'))
        # 添加请求头应对图片防盗链
        r4 = requests.get(url=img_url,
                    headers={'Referer':url_new})
        # print(type(img_url))
        dict = img_url.rsplit('/',maxsplit=1)
        file_name = os.path.join(path,dict[1])
        # print(file_name)
        with open(file_name,'wb') as f:
            f.write(r4.content)

1.3 模拟登录choti网站并点赞

import requests
from fake_useragent import UserAgent

agent = UserAgent()
# ############## 方式一 ##############
"""
## 1、首先登陆任何页面,获取cookie
i1 = requests.get(url="https://dig.chouti.com/",
                  headers={
                      "User-Agent":agent.random,
                  })
i1_cookies = i1.cookies.get_dict()
print(i1_cookies)

# ## 2、用户登陆,携带上一次的cookie,后台对cookie中的 gpsd 进行授权
i2 = requests.post(
    url="https://dig.chouti.com/login",
    data={
        'phone': "8615057101356",
        'password': "199SulkyBuckets",
        'oneMonth': "1"
    },
    headers={"User-Agent":agent.random,},
    cookies=i1_cookies,
)

# ## 3、点赞(只需要携带已经被授权的gpsd即可)

i3 = requests.post(
    url="https://dig.chouti.com/link/vote?linksId=19444596",
    headers={"User-Agent":agent.random,},
    cookies=i1_cookies,
)
print(i3.text)
"""

# ############## 方式二 ##############

# import requests

session = requests.Session()
i1 = session.get(url="https://dig.chouti.com",
                 headers={"User-Agent": agent.random})
i2 = session.post(
    url="https://dig.chouti.com/login",
    data={
        'phone': "8615057101356",
        'password': "199SulkyBuckets",
        'oneMonth': "1"
    },
    headers={"User-Agent": agent.random}
)
i3 = session.post(
    url="https://dig.chouti.com/link/vote?linksId=19444596",
    headers={"User-Agent": agent.random}
)
print(i3.text)

2.Scrapy框架

 

Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 其可以应用在数据挖掘,信息处理或存储历史数据等一系列的程序中。
其最初是为了页面抓取 (更确切来说, 网络抓取 )所设计的, 也可以应用在获取API所返回的数据(例如 Amazon Associates Web Services ) 或者通用的网络爬虫。Scrapy用途广泛,可以用于数据挖掘、监测和自动化测试。

 

Scrapy 使用了 Twisted异步网络库来处理网络通讯。整体架构大致如下

Requests爬虫和scrapy框架多线程爬虫

Scrapy主要包括了以下组件:

  • 引擎(Scrapy)
    用来处理整个系统的数据流处理, 触发事务(框架核心)
  • 调度器(Scheduler)
    用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
  • 下载器(Downloader)
    用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
  • 爬虫(Spiders)
    爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
  • 项目管道(Pipeline)
    负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序处理数据。
  • 下载器中间件(Downloader Middlewares)
    位于Scrapy引擎和下载器之间的框架,主要是处理Scrapy引擎与下载器之间的请求及响应。
  • 爬虫中间件(Spider Middlewares)
    介于Scrapy引擎和爬虫之间的框架,主要工作是处理蜘蛛的响应输入和请求输出。
  • 调度中间件(Scheduler Middewares)
    介于Scrapy引擎和调度之间的中间件,从Scrapy引擎发送到调度的请求和响应。

Scrapy运行流程大概如下:

    1. 引擎从调度器中取出一个链接(URL)用于接下来的抓取
    2. 引擎把URL封装成一个请求(Request)传给下载器
    3. 下载器把资源下载下来,并封装成应答包(Response)
    4. 爬虫解析Response
    5. 解析出实体(Item),则交给实体管道进行进一步的处理
    6. 解析出的是链接(URL),则把URL交给调度器等待抓取

2.1 基本命令

1. scrapy startproject 项目名称
   - 在当前目录中创建中创建一个项目文件(类似于Django)
 
2. scrapy genspider [-t template] <name> <domain>
   - 创建爬虫应用
   如:
      scrapy gensipider -t basic oldboy oldboy.com
      scrapy gensipider -t xmlfeed autohome autohome.com.cn
   PS:
      查看所有命令:scrapy gensipider -l
      查看模板命令:scrapy gensipider -d 模板名称
 
3. scrapy list
   - 展示爬虫应用列表
 
4. scrapy crawl 爬虫应用名称 --nolog(无运行日志显示)
   - 运行单独爬虫应用

2.2 选择器SELECTOR

#!/usr/bin/env python
# -*- coding:utf-8 -*-
from scrapy.selector import Selector, HtmlXPathSelector
from scrapy.http import HtmlResponse
html = """<!DOCTYPE html>
<html>
    <head lang="en">
        <meta charset="UTF-8">
        <title></title>
    </head>
    <body>
        <ul>
            <li class="item-"><a id='i1' href="link.html">first item</a></li>
            <li class="item-0"><a id='i2' href="llink.html">first item</a></li>
            <li class="item-1"><a href="llink2.html">second item<span>vv</span></a></li>
        </ul>
        <div><a href="llink2.html">second item</a></div>
    </body>
</html>
"""
response = HtmlResponse(url='http://example.com', body=html,encoding='utf-8')
# hxs = HtmlXPathSelector(response)
# print(hxs)
# hxs = Selector(response=response).xpath('//a')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[2]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@id]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@>
# print(hxs)
# hxs = Selector(response=response).xpath('//a[@href="link.html"][@>
# print(hxs)
# hxs = Selector(response=response).xpath('//a[contains(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[starts-with(@href, "link")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]')
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/text()').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//a[re:test(@id, "i\d+")]/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('/html/body/ul/li/a/@href').extract()
# print(hxs)
# hxs = Selector(response=response).xpath('//body/ul/li/a/@href').extract_first()
# print(hxs)
 
# ul_list = Selector(response=response).xpath('//body/ul/li')
# for item in ul_list:
#     v = item.xpath('./a/span')
#     # 或
#     # v = item.xpath('a/span')
#     # 或
#     # v = item.xpath('*/a/span')
#     print(v)

chouti 自动登入点赞

import scrapy
from scrapy.selector import HtmlXPathSelector
from scrapy.http.request import Request
from scrapy.http.cookies import CookieJar
from scrapy import FormRequest


class ChouTiSpider(scrapy.Spider):
    # 爬虫应用的名称,通过此名称启动爬虫命令
    name = "chouti"
    # 允许的域名
    allowed_domains = ["chouti.com"]

    cookie_dict = {}
    has_request_set = {}
    # 重写起始函数
    def start_requests(self):
        url = 'http://dig.chouti.com/'
        # return [Request(url=url, callback=self.login)]
        yield Request(url=url, callback=self.login)

    def login(self, response):
        cookie_jar = CookieJar()
        cookie_jar.extract_cookies(response, response.request)
        for k, v in cookie_jar._cookies.items():
            for i, j in v.items():
                for m, n in j.items():
                    self.cookie_dict[m] = n.value
        print(self.cookie_dict)
        req = Request(
            url='http://dig.chouti.com/login',
            method='POST',
            headers={'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'},
            body='phone=8615057101356&password=199SulkyBuckets&Month=1',
            cookies=self.cookie_dict,
            callback=self.check_login
        )
        yield req

    def check_login(self, response):
        # print(response.text)
        req = Request(
            url='http://dig.chouti.com/',
            method='GET',
            callback=self.show,
            cookies=self.cookie_dict,
            dont_filter=True
        )
        yield req

    def show(self, response):
        # print(response.text)
        hxs = HtmlXPathSelector(response)
        news_list = hxs.select('//div[@>)
        for new in news_list:
            # temp = new.xpath('div/div[@class="part2"]/@share-linkid').extract()
            link_id = new.xpath('*/div[@class="part2"]/@share-linkid').extract_first()
            yield Request(
                url='http://dig.chouti.com/link/vote?linksId=%s' %(link_id,),
                method='POST',
                cookies=self.cookie_dict,
                callback=self.do_favor
            )

        # page_list = hxs.select('//div[@)]/@href').extract()
        # for page in page_list:
        #
        #     page_url = 'http://dig.chouti.com%s' % page
        #     import hashlib
        #     hash = hashlib.md5()
        #     hash.update(bytes(page_url,encoding='utf-8'))
        #     key = hash.hexdigest()
        #     if key in self.has_request_set:
        #         pass
        #     else:
        #         self.has_request_set[key] = page_url
        #         yield Request(
        #             url=page_url,
        #             method='GET',
        #             callback=self.show
        #         )

    def do_favor(self, response):
        print(response.text)

注意:settings.py中设置DEPTH_LIMIT = 1来指定“递归”的层数。注意:settings.py中设置DEPTH_LIMIT = 1来指定“递归”的层数。

多次爬取同一个页面注意设置REQUEST:dont_filter=True,防止爬虫自行去重

 2.3 避免重复访问

scrapy默认使用 scrapy.dupefilter.RFPDupeFilter 进行去重,相关配置有:

DUPEFILTER_CLASS = 'scrapy.dupefilter.RFPDupeFilter'
DUPEFILTER_DEBUG = False
JOBDIR = "保存范文记录的日志路径,如:/root/"  # 最终路径为 /root/requests.seen

2.4 爬取mzitu图片

# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
from scrapy.selector import Selector,XmlXPathSelector
from ..items import MzituItem

class MeizituSpider(scrapy.Spider):
    name = 'meizitu'
    allowed_domains = ['mzitu.com']
    # start_urls = ['http://mzitu.com/']

    def start_requests(self):
        url = 'http://www.mzitu.com/all/'
        yield Request(url=url,method='GET',callback=self.main_page)

    def main_page(self,response):
        # 取得所有套图地址
        hxs = Selector(response = response).xpath('//p[contains(@class,"url")]/a/@href').extract()
        for url in hxs:
            req = Request(url = url,
                          callback=self.fenye)
            yield req

    def fenye(self,response):
        # 取得图片路径和标题
        img_url = Selector(response=response).xpath('//div[@class="main-image"]//img/@src').extract_first().strip()
        title = Selector(response=response).xpath('//div[@class="main-image"]//img/@alt').extract_first().strip()
        yield MzituItem(img_url=img_url,title=title)
        # 取得下方导航条页面路径
        xhs = Selector(response=response).xpath('//div[@class="pagenavi"]/a/@href').extract()
        for url in xhs:
            req = Request(
                url=url,
                callback=self.fenye,
            )
            yield req
meizitu.py

相关文章: