Scrapy - 限制中间目录（Python）答案

【问题标题】：Scrapy - Limiting middle directories (Python)Scrapy - 限制中间目录（Python）
【发布时间】：2013-10-21 22:26:04
【问题描述】：

SgmlLinkExtractor 规则中有没有办法只允许 /static/ 和 /otherstuff/ 之间的目录数量有限（比如 3 个）？所以在下面的例子中，EX1 不会被爬取（因为 /static/ 和 /otherstuff/ 之间有四个目录），但 EX2 会被爬取。

EX1：http://www.domain.com/static/d1/d2/d3/d4/otherstuff/otherstuff2/bunchacrap
EX2：http:///www.domain.com/static/d1/d2/otherstuff/otherstuff2/bunchacrap

假设 /static/ 和 /otherstuff/ 总是在我想要的目录的两侧。

感谢 TON 的任何帮助！

【问题讨论】：

标签： python regex web-crawler scrapy

【解决方案1】：

您可以在allow 参数中使用正则表达式，也可以在process_value 参数中使用测试函数。（见docs。）

两者各有优缺点，这取决于您页面中链接的外观。如果您使用正则表达式，您将针对完全限定的 url（即http://domain.com/foo/bar）进行测试。如果您使用 process_value 参数，您将获得在网页中找到的原始值（即 /foo/bar 或更糟的是相对链接）。

例如，正则表达式domain.com/(?:\w+/){1,3}\w+$ 匹配

domain.com/foo/bar
domain.com/foo/bar/foo
domain.com/foo/bar/foo/bar

但不是

domain.com/foo/
domain.com/foo/bar/foo/bar/foo

如果你使用process_value，这样的功能会起作用

def filter_path(value):
    # at least 2, at most 3 /'s
    if 1 < value.count('/') < 4:
        return value

上述函数假设您的 html 链接具有 href 的值，例如 /foo、/foo/bar/foo 等。

在您的具体情况下，正则表达式类似于domain.com/static/(?:\w+/){3}otherstuff，而filter_path 函数可能会检查value.startswith('/static/') 和后缀。

如果您在CrawlSpider 中使用Rule 类，还有第三种选择。 process_links 参数允许您传递一个函数来处理链接列表。例如

def url_allowed(url):
    # check for the pattern /static/dir/dir/dir/ etc
    return True

def process_links(links):
    return [l for l in links if url_allowed(l.url)]

【讨论】：

我无法向我的代表投票，但这太棒了 - 非常感谢您详细说明替代方案！我实际上是在使用这个过程通过试验来学习，所以你的替代方案教会了我很多。我特别高兴能试用process_links 和url_allowed 参数！