如何抓取项目的网页，每个项目都有指向新页面的链接答案

【问题标题】：How to scrape webpage of items, each item has link to new page如何抓取项目的网页，每个项目都有指向新页面的链接
【发布时间】：2019-12-03 00:44:42
【问题描述】：

我正在用 scrapy 和 python 创建一个网络爬虫。我正在抓取的页面将每个项目都构造为卡片，我可以从这些卡片中抓取一些信息（名称、位置），但我也想通过点击卡片 > 新页面 > 点击来获取信息打开表单的新页面上的按钮 > 从表单中抓取值。我应该如何构造解析函数，我需要嵌套循环还是单独的函数..？

class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["example.com"]
    start_urls = ["example.com/page"]
    def parse(self, response):
        for page_url in response.css('a[class ~= search-  card]::attr(href)').extract():
            page_url = response.urljoin(page_url)
            yield scrapy.Request(url=page_url, callback=self.parse)

        for vc in response.css('div#vc-profile.container').extract():
            item = StackItem()
            item['name'] = vc.xpath('//*[@id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
            item['firm'] = vc.expath('//*[@id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
            item['pos'] = vc.expath('//*[@id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
            em = vc.xpath('/*[@id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
            item['email'] = em.xpath('//*[@id="email"]/value').extract()
            yield item

scraper 正在爬行，但什么也没输出

【问题讨论】：

标签： python authentication web-scraping pagination scrapy

【解决方案1】：

最好的方法是在第一页上创建一个项目对象，抓取所需的数据并保存到项目中。再次向新 URL (card > new page > click the button to form) 发出请求并在其中传递相同的项目。从这里产生输出将解决问题。

【讨论】：

【解决方案2】：

您可能应该将刮板分为 1 'parse' 方法和 1 'parse_item' 方法。您的 parse 方法遍历页面并生成您想要获取其详细信息的项目的 url。 parse_item 方法将从 parse 函数返回响应，并获取特定项目的详细信息。在不了解网站的情况下很难说出它的外观，但它可能或多或少是这样的：

class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["example.com"]
    start_urls = ["example.com/page"]
    def parse(self, response):
        for page_url in response.css('a[class ~= search-  card]::attr(href)').extract():
            page_url = response.urljoin(page_url)
            yield scrapy.Request(url=page_url, callback=self.parse_item)

    def parse_item(self, response)
        item = StackItem()
        item['name'] = vc.xpath('//*[@id="vc-profile"]/div/div[2]/div[1]/div[1]/h1/text()').extract()
        item['firm'] = vc.expath('//*[@id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[1]').extract()
        item['pos'] = vc.expath('//*[@id="vc-profile"]/div/div[2]/div[1]/div[2]/h2/text()[2]').extract()
        em = vc.xpath('/*[@id="vc-profile"]/div/div[1]/div[2]/div[2]/div/div[1]/button').extract()
        item['email'] = em.xpath('//*[@id="email"]/value').extract()
        yield item

【讨论】：