练习介绍
要求:
本练习需要运用scrapy的知识,爬取豆瓣图书TOP250(https://book.douban.com/top250 )前2页的书籍(50本)的短评数据存储成Excel
书名
评论ID
短评内容
1、创建爬虫项目
1 D:\USERDATA\python>scrapy startproject duanping 2 New Scrapy project \'duanping\', using template directory \'c:\users\www1707\appdata\local\programs\python\python37\lib\site-packages\scrapy\templates\project\', created in: 3 D:\USERDATA\python\duanping 4 5 You can start your first spider with: 6 cd duanping 7 scrapy genspider example example.com 8 9 D:\USERDATA\python>
2、创建爬虫文件 D:\USERDATA\python\duanping\duanping\spiders\duanping.py
1 import scrapy 2 import bs4 3 import re 4 import requests 5 import math 6 from ..items import DuanpingItem 7 8 class DuanpingItemSpider(scrapy.Spider): 9 name = \'duanping\' 10 allowed_domains = [\'book.douban.com\'] 11 start_urls = [\'https://book.douban.com/top250?start=0\',\'https://book.douban.com/top250?start=0\'] 12 13 def parse(self,response): 14 bs = bs4.BeautifulSoup(response.text,\'html.parser\') 15 datas = bs.find_all(\'a\',class_=\'nbg\') 16 for data in datas: 17 book_url = data[\'href\'] 18 yield scrapy.Request(book_url,callback=self.parse_book) 19 20 def parse_book(self,response): 21 book_url = str(response).split(\' \')[1].replace(\'>\',\'\') 22 print(book_url) 23 bs = bs4.BeautifulSoup(response.text,\'html.parser\') 24 comments = int(bs.find(\'a\',href=re.compile(\'^https://book.douban.com/subject/.*/comments/\')).text.split(\' \')[1]) 25 pages = math.ceil(comments / 20) + 1 26 #for i in range(1,pages): 27 for i in range(1,3): 28 comments_url = \'{}comments/hot?p={}\'.format(book_url,i) 29 print(comments_url) 30 yield scrapy.Request(comments_url,callback=self.parse_comment) 31 32 def parse_comment(self,response): 33 bs = bs4.BeautifulSoup(response.text,\'html.parser\') 34 book_name = bs.find(\'a\',href=re.compile(\'^https://book.douban.com/subject/\')).text 35 datas = bs.find_all(\'li\',class_=\'comment-item\') 36 for data in datas: 37 item = DuanpingItem() 38 item[\'book_name\'] = book_name 39 item[\'user_id\']= data.find_all(\'a\',href=re.compile(\'^https://www.douban.com/people/\'))[1].text 40 item[\'comment\'] = data.find(\'span\',class_=\'short\').text 41 yield item
3、编辑item文件 D:\USERDATA\python\duanping\duanping\items.py
1 import scrapy 2 3 class DuanpingItem(scrapy.Item): 4 book_name = scrapy.Field() 5 user_id = scrapy.Field() 6 comment = scrapy.Field()
4、编辑文件 D:\USERDATA\python\duanping\duanping\pipelines.py
1 import openpyxl 2 3 class DuanpingPipeline(object): 4 def __init__(self): 5 self.wb = openpyxl.Workbook() 6 self.ws = self.wb.active 7 self.ws.append([\'书名\',\'评论ID\',\'评论内容\']) 8 9 def process_item(self, item, spider): 10 line = [item[\'book_name\'],item[\'user_id\'],item[\'comment\']] 11 self.ws.append(line) 12 return item 13 def close_spider(self,spider): 14 self.wb.save(\'./save.xlsx\') 15 self.wb.close()
5、编辑D:\USERDATA\python\duanping\duanping\settings.py
1 BOT_NAME = \'duanping\' 2 3 SPIDER_MODULES = [\'duanping.spiders\'] 4 NEWSPIDER_MODULE = \'duanping.spiders\' 5 6 USER_AGENT = \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36\' 7 8 ROBOTSTXT_OBEY = False 9 10 DOWNLOAD_DELAY = 1 11 12 ITEM_PIPELINES = { 13 \'duanping.pipelines.DuanpingPipeline\': 300, 14 } 15 16 FEED_URI = \'./save.csv\' 17 FEED_FORMAT=\'CSV\' 18 FEED_EXPORT_ENCODING=\'utf-8-sig\'
6、执行命令 scrapy crawl duanping
以下是草稿部分
1、图书列表页
<a class="nbg" href="https://book.douban.com/subject/1083428/"
2、图书详情页
评论总数 <a href="https://book.douban.com/subject/1770782/comments/">全部 112943 条</a>
3、图书短评页
书名 <a href="https://book.douban.com/subject/1770782/">追风筝的人</a>
find_all <li class="comment-item" data-cid="693413905">
评论ID[1] <a title="九尾黑猫" href="https://www.douban.com/people/LadyInSatin/">
<a href="https://www.douban.com/people/LadyInSatin/">九尾黑猫</a>
短评内容 <span class="short">“为你,千千万万遍。”我想,小说描写了一种最为诚挚的情感,而且它让你相信有些东西依然存在。在这个没有人相信承诺的年代,让人再次看到承诺背后那些美丽复杂的情感。这是一本好看的书,它让你重新思考。</span>
1 <li class="comment-item" data-cid="693413905"> 2 <div class="avatar"> 3 <a title="福根儿" href="https://www.douban.com/people/fugen/"> 4 <img src="https://img3.doubanio.com/icon/u3825598-141.jpg"> 5 </a> 6 </div> 7 <div class="comment"> 8 <h3> 9 <span class="comment-vote"> 10 <span id="c-693413905" class="vote-count">4756</span> 11 <a href="javascript:;" id="btn-693413905" class="j a_show_login" data-cid="693413905">有用</a> 12 </span> 13 <span class="comment-info"> 14 <a href="https://www.douban.com/people/fugen/">福根儿</a> 15 <span class="user-stars allstar50 rating" title="力荐"></span> 16 <span>2013-09-18</span> 17 </span> 18 </h3> 19 <p class="comment-content"> 20 21 <span class="short">为你,千千万万遍。</span> 22 </p> 23 </div> 24 </li>