前提:
记得去年5月份左右的时候写过一篇使用Requests方法来爬取猫眼榜单电影的文章,今天偶然翻到了这篇文章,又恰巧最近在学scrapy框架进行爬虫,于是决定饶有兴趣的使用scrapy框架再次进行爬取。
说明:
如图所示,这次爬取的猫眼榜单网页链接内容大致如下(图1-1),这次需要爬取的信息分别是电影名称、主演、上映时间、电影评分和电影图片链接,然后将获取的电影图片下载保存到本地,如图1-2所示。
图1-1
图1-2
爬虫解析:
1、首先使用谷歌浏览器打开网页,然后按下键盘“F12”进入开发者工具调试界面,选择左上角的箭头图标,然后鼠标移至一个电影名处,就可以定位到该元素源代码的具体位置,定位到元素的源代码之后,可以从源代码中读出改元素的属性,如图2-1所示:
图2-1
2、从上图可以看出,我们需要的信息隐藏在这个节点和属性值中,接下来就是如何获取到这些节点信息和属性值的问题,这里最简答的方法就是选择一个节点后,右击鼠标选择“Copy-Copy Xpath”,通过xpath方法来定位元素来获取信息。具体的xpath定位元素的使用方法,可自行百度进行学习。
代码:
spider文件
# -*- coding: utf-8 -*-
import scrapy
from maoyan.items import MaoyanItem
import urllib
class Top100Spider(scrapy.Spider):
name = 'top_100'
allowed_domains = ['trade.maoyan.com']
start_urls = ['https://trade.maoyan.com/board/4']
def parse(self, response):
#pass
dd_list = response.xpath('//dl[@class="board-wrapper"]/dd')
for dd in dd_list:
item = MaoyanItem()
item['name'] = dd.xpath('./a/@title').extract_first() #电影名称
item['starring'] = dd.xpath('./div/div/div/p[2]/text()').extract_first() #电影主演
if item['starring'] is not None:
item['starring'] = item['starring'].strip()
item['releasetime'] = dd.xpath('./div/div/div/p[3]/text()').extract_first() #电影上映时间
#item['image'] = 'https://trade.maoyan.com/' + dd.xpath('./a/@href').extract_first() #电影图片
score_one = dd.xpath('./div/div/div[2]/p/i[1]/text()').extract_first() #评分前半部分
score_two = dd.xpath('./div/div/div[2]/p/i[2]/text()').extract_first() #评分后半部分
item['score'] = score_one + score_two
#print(item)
url = 'https://trade.maoyan.com' + dd.xpath('./a/@href').extract_first() #电影详情页
yield scrapy.Request(
url,
callback= self.parse_datail,
meta= {'item':item}
)
#获取下一页网页信息
next_page = response.xpath('//div[@class="pager-main"]/ul/li/a[contains(text(), "下一页")]/@href').extract_first()
if next_page is not None:
print('当前爬取的网页链接是:%s'%next_page)
new_ilnk = urllib.parse.urljoin(response.url, next_page)
yield scrapy.Request(
new_ilnk,
callback=self.parse,
)
def parse_datail(self,response):
item = response.meta['item']
item['image'] = response.xpath('//div[@class ="celeInfo-left"]/div/img/@src').extract_first() #获取图片链接
yield item
# print('当前获取的信息')
# print(item)
item.py代码
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class MaoyanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
#pass
name = scrapy.Field() #电影名
starring = scrapy.Field() #主演
releasetime = scrapy.Field() #上映时间
image = scrapy.Field() #电影图片链接
score = scrapy.Field() #电影评分
pipelines.py代码
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.pipelines.images import ImagesPipeline
import scrapy
from scrapy.exceptions import DropItem
# class MaoyanPipeline(object):
# def process_item(self, item, spider):
# return item
#使用ImagesPipeline进行图片下载
class MaoyanPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
print('item-iamge是', item['image'])
yield scrapy.Request(item['image'])
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
return item
settings.py代码
# -*- coding: utf-8 -*-
# Scrapy settings for maoyan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
import random
BOT_NAME = 'maoyan'
SPIDER_MODULES = ['maoyan.spiders']
NEWSPIDER_MODULE = 'maoyan.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'maoyan (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
USER_AGENTS_LIST = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
USER_AGENT = random.choice(USER_AGENTS_LIST)
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
# 'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36",
'User-Agent':USER_AGENT
}
IMAGES_STORE = 'D:\\MaoYan' #文件保存路径
总结:
以上就是使用scrapy进行爬取猫眼前100榜单电影的方法,方法不是很难,主要难点还是在使用xpath进行元素定位获取数据方面,最后电影爬取成功后,就是去慢慢欣赏的时候 了,哈哈,祝各位周末愉快!
PS:如有需要Python学习资料的小伙伴可以加点击下方链接自行获取