scrapy如何爬取更多的url？答案

【问题标题】：How can scrapy crawl more urls?scrapy如何爬取更多的url？
【发布时间】：2012-06-26 10:46:52
【问题描述】：

如我们所见：

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    sites = hxs.select('//ul/li')
    items = []

    for site in sites:
        item = Website()
        item['name'] = site.select('a/text()').extract()
        item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()
        item['description'] = site.select('text()').extract()
        items.append(item)

    return items

scrapy 只是获取一个页面响应，并在页面响应中找到 url。我认为这只是表面爬行！！

但我想要更多具有定义深度的网址。

我能做些什么来实现它？？

谢谢！！

【问题讨论】：

标签： python scrapy

【解决方案1】：

我没看懂你的问题，但是我注意到你的代码中有几个问题，其中一些可能与你的问题有关（参见代码中的 cmets）：

sites = hxs.select('//ul/li')
items = []

for site in sites:
    item = Website()
    # this extracts a list, so i guess .extract()[0] is expected
    item['name'] = site.select('a/text()').extract() 
    # '//a[...]' maybe you expect that this gets the links within the `site`, but it actually get the links from the entire page; you should use './/a[...]'.
    # And, again, this returns a list, not a single url.
    item['url'] = site.select('//a[contains(@href, "http")]/@href').extract()

【讨论】：

【解决方案2】：

看看documentation on Requests and Responses。

当您抓取第一页时，您会收集一些链接，用于生成第二个请求并导致第二个回调函数来抓取第二个级别。抽象地说，这听起来很复杂，但您会从the example code in the documentation 看到它非常简单。

此外，CrawlSpider example 更加充实，并为您提供模板代码，您可能只是想适应您的情况。

希望这能让你开始。

【讨论】：

【解决方案3】：

您可以使用可以从scrapy.contrib.spiders 导入的CrawlSpider 来抓取更多页面，并定义您的rules 来确定您希望抓取工具抓取哪种类型的链接。

按照注释here 了解如何定义规则

顺便说一下，考虑更改函数名称，来自文档：

警告

在编写爬虫规则时，避免使用 parse 作为回调，因为 CrawlSpider 使用 parse 方法本身来实现其逻辑。所以如果你重写 parse 方法，爬虫将不再工作。

【讨论】：