使用已保存的 html 页面使用 scrapy 进行刮擦答案

【问题标题】：Scrape with scrapy using saved html pages使用已保存的 html 页面使用 scrapy 进行刮擦
【发布时间】：2018-11-09 10:03:58
【问题描述】：

我正在寻找一种将scrapy 与我保存在计算机上的html 页面一起使用的方法。就我而言，我遇到了一个错误：

requests.exceptions.InvalidSchema: No connection adapters were found for 'file:///home/stage/Guillaume/scraper_test/mypage/details.html'

SPIDER_START_URLS = ["file:///home/stage/Guillaume/scraper_test/mypage/details.html"]

【问题讨论】：

1.除非我弄错了，否则 Scrapy 长期以来一直支持file: 方案。 2.根据你分享的日志，看起来像是由著名的HTTP客户端库requests生成的东西，而不是Scrapy。
现在我真的不知道，因为我是新手，所以我不会浪费任何时间并使用静态服务器
抱歉没有说清楚。我认为您可能需要提供更多信息（更多行日志？一些相关代码？等），然后其他人才能尝试进一步挖掘并提供帮助。
所有日志：Deferred 中未处理的错误：2018-11-09 13:05:25 [twisted] CRITICAL: Traceback（最近一次调用最后）：文件“/home/stage/miniconda3/envs/ scrapy_env/lib/python3.6/site-packages/twisted/internet/defer.py”，第 1386 行，在 _inlineCallbacks 结果 = g.send(result) 文件“/home/stage/miniconda3/envs/scrapy_env/lib/python3 .6/site-packages/scrapy/crawler.py"，第 82 行，在爬网中产生 self.engine.open_spider(self.spider, start_requests) requests.exceptions.InvalidSchema：没有为 'file:/// 找到连接适配器主页/stage/Guillaume/scraper_test/mypage/details.html'

标签： html web-scraping scrapy local

【解决方案1】：

我在使用request_fingerprint 将现有HTML 文件注入HTTPCACHE_DIR（几乎总是.scrapy/httpcache/${spider_name}）方面取得了巨大成功。然后，打开前面提到的http cache middleware，默认为基于文件的缓存存储，以及“虚拟策略”，它认为磁盘上的文件具有权威性，如果在缓存中找到 URL，则不会发出网络请求。

我希望脚本会是这样的（这只是一般的想法，甚至不能保证运行）：

import sys
from scrapy.extensions.httpcache import FilesystemCacheStorage
from scrapy.http import Request, HtmlResponse
from scrapy.settings import Settings

# this value is the actual URL from which the on-disk file was saved
# not the "file://" version
url = sys.argv[1]
html_filename = sys.argv[2]
with open(html_filename) as fh:
    html_bytes = fh.read()
req = Request(url=url)
resp = HtmlResponse(url=req.url, body=html_bytes, encoding='utf-8', request=req)
settings = Settings()
cache = FilesystemCacheStorage(settings)
spider = None  # fill in your Spider class here
cache.store_response(spider, req, resp)

【讨论】：