【问题标题】:scrapy crawlspider is not following next linksscrapy crawlspider 没有关注下一个链接
【发布时间】:2015-11-09 17:41:41
【问题描述】:

我正在使用 scrapy 收集意大利国家警察的新闻稿。我遇到的问题是刮板没有遵循“下一个”链接,即使我设置了一个规则来查找“下一个”或意大利语“Successiva”按钮并遵循该链接。

这是我的代码。

from scrapy.spiders import Rule, CrawlSpider
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
from items import ScrapyCrimeScraperItem
from string import replace

class ItalyScraper(CrawlSpider):
name = 'italy_crawler_test'
allowed_domains = ['poliziadistato.it']

start_urls = [
    'http://www.poliziadistato.it/archivio/category/1298/2015/',
    'http://www.poliziadistato.it/archivio/category/1298/2015/9/'

]

rules = (Rule(LinkExtractor(allow=('http://.*/articolo/view/*.....')), callback='parse_article', follow=False),
Rule(LinkExtractor(restrict_xpaths=("/html/body/div[@class='container'[1]/div[@class='row']/div[@class='col-md-6 col-md-push-3 padding0']/div[@class='trecolonne']/div[@class='center']/div[@class='bar']/ul[@class='paginazione']/li/a[contains(""@title,""'Successiva')]",)), follow=True))

# def generate_article_links(self, response):
#     for href in response.css('a'):
#         url = href.extract()
#         yield scrapy.Request(url, callback=self.parse_article)


def parse_article(self, response):
    self.logger.info('Hi, this is an item page! %s', response.url)
    item = ScrapyCrimeScraperItem()
    item['city'] = response.selector.css('h1').extract()[0]
    item['country'] = 'italy'
    item['site_link'] = response.url
    item['article_link'] = response.url
    item['article_raw_text'] = self.remove_carriage_returns(response.selector.css('.resetfont '
                                                     'p').extract(

    )[0])
    item['article_raw_date'] = response.selector.css('.data').extract()[0]
    item['article_translated_text'] = ''
    item['article_translated_date'] = ''
    item['article_raw_markup'] = ''
    item['crimes'] = ''
    item['locations'] = ''
    item['dateformat'] = ''
    item['reserved1'] = ''
    item['reserved2'] = ''

    yield item


def remove_carriage_returns(self,item):

    return(item.replace("\n", " "))

我查看了对类似问题的其他一些回复,但我在第二条规则中使用了明确的follow=True。我是否需要回调来生成新请求——或者不应该跟随变量负责生成新请求?

【问题讨论】:

    标签: python web-scraping scrapy scrapy-spider


    【解决方案1】:

    我认为您只是弄乱了 XPath 表达式中的引号。改用更简单的:

    //ul[@class="paginazione"]/li/a[contains(@title, "Successiva")]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-04-29
      • 1970-01-01
      • 2012-09-21
      • 1970-01-01
      • 1970-01-01
      • 2019-04-12
      • 1970-01-01
      相关资源
      最近更新 更多