只抓取自上次抓取以来添加的内容答案

【问题标题】：scrape just what has been added since last scrape只抓取自上次抓取以来添加的内容
【发布时间】：2012-07-04 01:29:04
【问题描述】：

我需要爬一个网站，基本上有这样的链接：

www.website.com/link/page_1.html
www.website.com/link/page_2.html
www.website.com/link/page_3.html
...

抓取的内容通过管道直接进入数据库。

很容易告诉 django 类似：

if item exists do not insert it, otherwise insert it

但是有什么方法可以抓取自上次抓取后添加的其余链接？

例如，在 website.com 插入新项目后：

/link/page_1.html becomes /link/page_2.html
new items populate /link/page_1.html

此时，我需要告诉scrapy 什么只是从上次抓取后抓取新添加的项目？

【问题讨论】：

【解决方案1】：

最新的scrapy支持将请求序列化到磁盘[1]，还有Rolando的Redis集成[2]。

【讨论】：