2019 7.11学习笔记
爬取嗅事百科
在爬虫项目的根目录创建一个启动文件 来避免每次都要在命令行中输入代码来启动
from scrapy import cmdline # cmdline.execute("scrapy crawl qsbk_spider".split()) 等价于下面一行 cmdline.execute([\'scrapy\',\'crawl\',\'qsbk_spider\'])
编写的爬虫脚本的代码如下
# -*- coding: utf-8 -*- import scrapy from qsbk.items import QsbkItem from scrapy.http.response.html import HtmlResponse from scrapy.selector.unified import SelectorList class QsbkSpiderSpider(scrapy.Spider): name = \'qsbk_spider\' allowed_domains = [\'qiushibaike.com\'] start_urls = [\'https://www.qiushibaike.com/text/page/1/\'] def parse(self, response): #SelectorList duanzidivs=response.xpath("//div[@id=\'content-left\']/div") for duanzidiv in duanzidivs: author=duanzidiv.xpath(".//h2/text()").get().strip() article=duanzidiv.xpath(".//div[@class=\'content\']//text()").getall() article="".join(article).strip() item=QsbkItem(author=author,article=article) #固定传的参数 多了少了会报错 yield item
pipelines代码如下
import json class QsbkPipeline(object): def __init__(self): self.fp=open("duanzi.json",\'w\',encoding=\'utf-8\') def open_spider(self,spider): print("爬虫开始了...") def process_item(self, item, spider): item_json=json.dumps(dict(item),ensure_ascii=False) self.fp.write(item_json+\'\n\') return item def close_spider(self,spider): print("爬虫结束了...")
settings.py需要取消注释
ITEM_PIPELINES = { \'qsbk.pipelines.QsbkPipeline\': 300, }
iteams.py代码如下:
import scrapy class QsbkItem(scrapy.Item): author=scrapy.Field() article=scrapy.Field()