【问题标题】:Scrapy: only parse from pages with meta noindexScrapy:仅从具有元 noindex 的页面解析
【发布时间】:2014-03-10 02:31:46
【问题描述】:

我正在尝试抓取一个网站并仅从具有元 noindex 的页面进行解析。 发生的情况是爬虫爬取了第一级,但完成了第一页。它似乎没有遵循链接。 以下是我的代码:

from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from wallspider.items import Website


class mydomainSpider(CrawlSpider):
    name = "0resultsTest"
    allowed_domains = ["www.mydomain.com"]
    start_urls = ["http://www.mydomain.com/cp/3944"]

    rules = (
    Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
    )

    def _response_downloaded(self, response):
        sel = HtmlXPathSelector(response)
        if sel.xpath('//meta[@content="noindex"]'):
            return super(mydomainSpider, self).parse_items(response)
        return

    def parse_items(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//html')
        items = []

        for site in sites:
            item = Website()
            item['url'] = response.url
            item['referer'] = response.request.headers.get('Referer')
            item['title'] = site.xpath('/html/head/title/text()').extract()
            item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
            items.append(item)

        yield items

【问题讨论】:

    标签: python web-crawler scrapy


    【解决方案1】:

    原来的_response_downloaded 调用_parse_response 函数,除了调用callback 函数外,还遵循链接,来自scrapy 代码:

    def _parse_response(self, response, callback, cb_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item
    
        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item
    

    您可以添加跟随链接部分,尽管我认为这不是最好的方法(领先的_ 可能暗示了这一点),为什么不在parse_items 函数的开头检查meta?如果你不想重复这个测试,甚至可以写一个 python 装饰器。

    【讨论】:

    • 在我的 parse_items 开头检查元数据似乎是最简单的方法。我会试试看,再次感谢盖伊!
    • 我下面的代码似乎没有解析任何网址,我在解析之前是否正确检查了元数据?
    • nop,你的代码看起来不错,尝试添加打印/日志以进行调试,例如print response.url就在parse_items函数的开头
    • 发现错误 - ERROR: Spider must return Request, BaseItem or None, got 'list' in <GET http://www.mydomain.com/browse/electronics/digital-cameras/3944_3959/ 看起来我应该返回项目而不是产量。
    • 要么逐项产生,要么将它们累积在列表中并返回,但不能产生列表,产生项目 y 项目更好 IMO
    【解决方案2】:

    我相信按照@Guy Gavriely 的建议,在我的 parse_items 开头检查元数据将是我的最佳选择。下面我测试一下下面的代码看看。

    from scrapy.selector import HtmlXPathSelector
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    
    from wallspider.items import Website
    
    
    class mydomainSpider(CrawlSpider):
        name = "0resultsTest"
        allowed_domains = ["www.mydomain.com"]
        start_urls = ["http://www.mydomain.com/cp/3944"]
    
        rules = (
        Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
        )
    
        def parse_items(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select('//html')
            items = []
    
            if hxs.xpath('//meta[@content="noindex"]'):
                for site in sites:
                    item = Website()
                    item['url'] = response.url
                    item['referer'] = response.request.headers.get('Referer')
                    item['title'] = site.xpath('/html/head/title/text()').extract()
                    item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
                    items.append(item)
    
                yield items
    

    工作代码更新,我需要返回项目而不是产量:

    from scrapy.selector import HtmlXPathSelector
    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    
    from wallspider.items import Website
    
    
    class mydomainSpider(CrawlSpider):
        name = "0resultsTest"
        allowed_domains = ["www.mydomain.com"]
        start_urls = ["http://www.mydomain.com/cp/3944"]
    
        rules = (
        Rule(SgmlLinkExtractor(allow=(),deny=()), callback="parse_items", follow= True,),
        )
    
        def parse_items(self, response):
            hxs = HtmlXPathSelector(response)
            sites = hxs.select('//html')
            items = []
    
            if hxs.xpath('//meta[@content="noindex"]'):
                for site in sites:
                    item = Website()
                    item['url'] = response.url
                    item['referer'] = response.request.headers.get('Referer')
                    item['title'] = site.xpath('/html/head/title/text()').extract()
                    item['robots'] = site.select('//meta[@name="robots"]/@content').extract()
                    items.append(item)
    
                return items
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2014-02-15
      • 2017-05-18
      • 1970-01-01
      • 1970-01-01
      • 2017-10-05
      • 1970-01-01
      • 2016-11-21
      相关资源
      最近更新 更多