从scrapy spider获取日志消息并将其分配给变量答案

【问题标题】：Get a log message from scrapy spider and assign it to a variable从scrapy spider获取日志消息并将其分配给变量
【发布时间】：2022-02-03 02:35:25
【问题描述】：

我想检查来自此记录器的日志消息：[scrapy.spidermiddlewares.httperror] 并基于它，该函数将执行特定操作，所以基本上我想将消息作为字符串分配给一个变量，然后找到一个该字符串中的关键字

在documentation 中我没有找到一种方法来做到这一点，这完全是关于格式化日志

import scrapy

class spider1(scrapy.Spider):
    name = 'spider1'
    allowed_domains = []
    custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 2}
    start_urls = ['https://quotes.toscrape.com/']


    def parse(self, response):
        print(response.text)

日志示例

2022-02-03 03:11:42 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <402 https://quotes.toscrape.com/>: HTTP status code is not handled or not allowed

我想把上面的日志信息赋值给一个变量

我知道我可以将整个日志输出到一个 .txt 文件，但是由于我将在无限循环中运行多个蜘蛛，因此将有大量数据需要迭代

【问题讨论】：

标签： python logging scrapy

【解决方案1】：

您可以使用日志过滤器并将其应用于特定的scrapy.spidermiddlewares.httperror 记录器。然后，您可以使用正则表达式来捕获您想要过滤的确切类型的错误，然后将其写入文件。请参见下面的示例代码：

import scrapy
import logging
import re

class ContentFilter(logging.Filter):
    def filter(self, record):
        match = re.search(r'Ignoring response <.*> HTTP status code is not handled or not allowed', record.msg)
        if match:
            with open("logged_messages.log", "a") as f:
                f.write(record.msg + '\n')
            return True

class spider1(scrapy.Spider):
    name = 'spider1'
    allowed_domains = []
    custom_settings = {'CONCURRENT_REQUESTS_PER_DOMAIN': 2}
    start_urls = ['https://quotes.toscrape.com/']

    def __init__(self, *args, **kwargs):
        logger = logging.getLogger('scrapy.spidermiddlewares.httperror')
        logger.addFilter(ContentFilter())

    def parse(self, response):
        yield {
            "title": response.css("title::text").get()
        }

阅读更多关于日志记录模块和您可以通过scrapy docs 进行的自定义的信息

【讨论】：