【问题标题】:Scrapy CrawlSpider not following linksScrapy CrawlSpider 不关注链接
【发布时间】:2015-06-09 03:23:54
【问题描述】:

我正在尝试从该类别页面上给出的所有(#123)详细信息页面中抓取一些属性 - http://stinkybklyn.com/shop/cheese/ 但scrapy无法遵循我设置的链接模式,我也检查了scrapy文档和一些教程但是没运气!

下面是代码:

import scrapy

from scrapy.contrib.linkextractors import LinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule

class Stinkybklyn(CrawlSpider):
    name = "Stinkybklyn"
    allowed_domains = ["stinkybklyn.com"]
    start_urls = [
        "http://stinkybklyn.com/shop/cheese/chandoka",
    ]
    Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'),
         callback='parse_items', follow=True)


    def parse_items(self, response):
        print "response", response
        hxs= HtmlXPathSelector(response)
        title=hxs.select("//*[@id='content']/div/h4").extract()
        title="".join(title)
        title=title.strip().replace("\n","").lstrip()
        print "title is:",title

有人可以告诉我在这里做错了什么吗?

【问题讨论】:

    标签: python web-scraping web-crawler scrapy scrapy-spider


    【解决方案1】:

    您的代码的关键问题是您没有为CrawlSpider 设置rules

    我建议的其他改进:

    • 不需要实例化HtmlXPathSelector,可以直接使用response
    • select() 现在已弃用,请使用 xpath()
    • 获取title 元素的text() 以便检索,例如,获取Chandoka 而不是<h4>Chandoka</h4>
    • 我认为您的意思是从奶酪店目录页面开始:http://stinkybklyn.com/shop/cheese

    带有应用改进的完整代码:

    from scrapy.contrib.linkextractors import LinkExtractor
    from scrapy.contrib.spiders import CrawlSpider, Rule
    
    
    class Stinkybklyn(CrawlSpider):
        name = "Stinkybklyn"
        allowed_domains = ["stinkybklyn.com"]
    
        start_urls = [
            "http://stinkybklyn.com/shop/cheese",
        ]
    
        rules = [
            Rule(LinkExtractor(allow=r'\/shop\/cheese\/.*'), callback='parse_items', follow=True)
        ]
    
        def parse_items(self, response):
            title = response.xpath("//*[@id='content']/div/h4/text()").extract()
            title = "".join(title)
            title = title.strip().replace("\n", "").lstrip()
            print "title is:", title
    

    【讨论】:

      【解决方案2】:

      您似乎有一些语法错误。 试试这个,

      import scrapy
      from scrapy.contrib.spiders import CrawlSpider, Rule
      from scrapy.contrib.linkextractors import LinkExtractor
      from scrapy.selector import HtmlXPathSelector
      
      
      class Stinkybklyn(CrawlSpider):
          name = "Stinkybklyn"
          allowed_domains = ["stinkybklyn.com"]
          start_urls = [
              "http://stinkybklyn.com/shop/cheese/",
          ]
      
          rules = (
                  Rule(LinkExtractor(allow=(r'/shop/cheese/')), callback='parse_items'),
      
              )
      
          def parse_items(self, response):
          print "response", response
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2014-04-29
        • 2019-04-12
        • 1970-01-01
        • 1970-01-01
        • 2012-09-21
        • 2015-02-01
        • 1970-01-01
        相关资源
        最近更新 更多