Python Scrapy - 根据第一个网页为每个条目保存一个“类别”答案

【问题标题】：Python Scrapy - saving a 'category' for each entry based on first webpagePython Scrapy - 根据第一个网页为每个条目保存一个“类别”
【发布时间】：2020-11-30 21:24:13
【问题描述】：

我正在搜索 BBC food 的食谱。逻辑如下：

包含大约 20 种美食的主页
-> 在每种美食中，每个字母通常在 1-3 页上大约有 20 个食谱。
-> 在每个食谱中，我刮了大约 6 样东西（配料、评级等）

因此，我的逻辑是：进入主页，创建请求，提取所有美食链接，然后关注每个，从那里提取食谱的每一页，关注每个食谱链接，最后从每个食谱中获取所有数据。请注意，这还没有完成，因为我需要实现蜘蛛来遍历所有字母。

我希望有一个“类别”列，即“非洲美食”链接中的每个菜谱都有一个显示“非洲”的列，对于“意大利菜”中的每个菜谱都有一个“意大利”条目列等。

期望的结果：

cook_time  prep_time  name  cuisine
  10         30         A      italian
  20         10         B      italian
  30         20         C      indian
  20         10         D      indian
  30         20         E      indian

这是我的以下蜘蛛：

import scrapy
from recipes_cuisines.items import RecipeItem

class ItalianSpider(scrapy.Spider):
    
    name = "italian_spider"
    
    def start_requests(self):
        start_urls =  ['https://www.bbc.co.uk/food/cuisines']
        for url in start_urls:
            yield scrapy.Request(url = url, callback = self.parse_cuisines)
    
    def parse_cuisines(self, response):
        cuisine_cards = response.xpath('//a[contains(@class,"promo__cuisine")]/@href').extract()
        for url in cuisine_cards:
            yield response.follow(url = url, callback = self.parse_main)
    
    def parse_main(self, response):
        recipe_cards = response.xpath('//a[contains(@class,"main_course")]/@href').extract()
        for url in recipe_cards:
            yield response.follow(url = url, callback = self.parse_card)
        next_page = response.xpath('//div[@class="pagination gel-wrap"]/ul[@class="pagination__list"]/li[@class="pagination__list-item pagination__priority--0"]/a[@class="pagination__link gel-pica-bold"]/@href').get()
        if next_page is not None:
            next_page_url = response.urljoin(next_page)
            print(next_page_url)
            yield scrapy.Request(url = next_page_url, callback = self.parse_main)

    def parse_card(self, response):
        item = RecipeItem()
        item['name'] = response.xpath('//h1[contains(@class,"title__text")]/text()').extract()
        item['prep_time'] = response.xpath('//div[contains(@class,"recipe-metadata-wrap")]/p[@class="recipe-metadata__prep-time"]/text()').extract_first()
        item['cook_time'] = response.xpath('//p[contains(@class,"cook-time")]/text()').extract_first()
        item['servings'] = response.xpath('//p[contains(@class,"serving")]/text()').extract_first()
        item['ratings_amount'] = response.xpath('//div[contains(@class="aggregate-rating")]/span[contains(@class="aggregate-rating__total")]/text()').extract()
        #item['ratings_amount'] = response.xpath('//*[@id="main-content"]/div[1]/div[4]/div/div[1]/div/div[1]/div[2]/div[1]/span[2]/text()').extract()
        item['ingredients'] = response.css('li.recipe-ingredients__list-item > a::text').extract()
        return item

和项目：

import scrapy


class RecipeItem(scrapy.Item):
    name = scrapy.Field()
    prep_time = scrapy.Field()
    cook_time = scrapy.Field()
    servings = scrapy.Field()
    ratings_amount = scrapy.Field()
    rating = scrapy.Field()
    ingredients = scrapy.Field()
    cuisine = scrapy.Field()

请注意，我正在通过

保存输出

scrapy crawl italian_spider -o test.csv

我已阅读文档并尝试了几件事，例如将提取的美食添加到 parse_cuisine 或 parse_main 方法，但都产生错误。

【问题讨论】：

不确定类别的编码位置。如果在 URL 中，您可以使用response.url 以某种方式获取它，否则我会假设通过一些额外的抓取。然后，您可以将类别字符串作为可选参数传递给parse_card，如下所示：stackoverflow.com/a/60035564/9360161（请参阅当前 Scrapy 版本的文档，因为接口会随着时间而变化。）
很遗憾，没有，网址不包含美食，食谱页面中的任何地方也没有。我会检查你链接的线程。
没有这个想法，美食类别可能在主页上（我假设？，如果不在 URL 中），您将其提取到那里，即parse_main，然后将其传递给每个配方页面parse_card 用于存储。该链接包含一个如何向下传递值的示例。

标签： python scrapy

【解决方案1】：

这里有两种方法。最常见的方法是将一些信息从一个页面传递到另一个页面是在您的scrapy.Request 中使用cb_kwargs：

def parse_cousine(self, response):
    cousine = response.xpath('//h1/text()').get()
    for recipe_url in response.xpath('//div[@id="az-recipes--recipes"]//a[.//h3]').getall():
        yield scrapy.Request(
            url=response.urljoin(recipe_url),
            callback=self.parse_recipe,
            cb_kwargs={'cousine': cousine},
        )
def parse_recipe(self, response, cousine):
    print(cousine)

但是你可以在这个网站的食谱页面上找到它（解析 JSON 后在成分部分内）：

def parse_recipe(self, response):
    recipe_raw = response.xpath('//script[@type="application/ld+json"][contains(., \'"@type":"Recipe"\')]/text()').get()
    recipe = json.loads(recipe_raw)
    cousine = recipe['recipeCuisine']

更新此 XPath '//script[@type="application/ld+json"][contains(., \'"@type":"Recipe"\')]/text()' 查找具有 type 属性和值为 application/ld+json 的 script 节点，并且还在该节点的文本中包含字符串 "@type":"Recipe"。

【讨论】：

谢谢你，第二个成功了，我也试试第一个。但是，请帮助我理解您的代码，以便我完全掌握您所做的：这部分是做什么的：[contains(., \'"@type":"Recipe"\')]？我知道您正在寻找 json 文件包含“@type”：“Recipe”的位置，我认为反斜杠“\”是字面上匹配的，但是点呢？而这行代码究竟做了什么？