【发布时间】:2021-12-25 04:50:27
【问题描述】:
我正在尝试将我自己的记录器与我的 Scrapy 集成
项目。期望的结果是记录我的自定义记录器和
在所需的日志级别将记录器抓取到stderr。我观察到
以下:
- 任何使用自己的记录器的模块/类似乎都会覆盖 Scrapy 记录器,
因为相关模块/类中的 Scrapy 日志记录似乎是
完全沉默了。
- 每当我禁用对我的自定义的所有引用时,都会确认上述内容
记录器。例如,如果我不在
forum.py中实例化我的自定义记录器, Scrapy 包将继续发送日志输出到stderr。
- 每当我禁用对我的自定义的所有引用时,都会确认上述内容
记录器。例如,如果我不在
- 我已经用
install_root_handler=True和install_root_handler=False,我看不出与日志记录有任何区别 输出。 - 我已确认从我的日志记录中正确获取了我的记录器 config,因为返回的 logger 对象具有正确的属性。
- 我已确认我的 Scrapy 设置已成功传递给
CrawlerProcess。
我的项目结构:
.
└── scraper
├── logging.cfg
├── main.py
├── scraper
│ ├── __init__.py
│ ├── forums.py
│ ├── items.py
│ ├── pipelines.py
│ ├── runner.py
│ ├── settings.py
│ ├── spiders
│ │ ├── __init__.py
│ │ ├── forum.py
│ │ └── someforum.py
│ └── utils.py
└── scrapy.cfg
程序设计为从main.py.调用
main.py的内容:
import os
from scraper import runner, utils
from scrapy.utils.project import get_project_settings
logger = utils.get_logger("launcher")
def main():
"""Entrypoint to the web scraper."""
# Guarantee that we have the correct reference to the settings file
os.environ.setdefault("SCRAPY_SETTINGS_MODULE", "scraper.settings")
logger.info("Initializing spiders")
runner.run_spiders(get_project_settings())
if __name__ == "__main__":
main()
runner.py的内容:
from scraper import forums, utils
from scraper.spiders.someforum import SomeForum
from scrapy.crawler import CrawlerProcess
logger = utils.get_logger("launcher")
def run_spiders(project_settings):
process = CrawlerProcess(project_settings, install_root_handler=False)
logger.info(
f"Initialzing spider for {forums.someforum.get('forum_attrs').get('name')}"
)
process.crawl(
SomeForum,
**forums.someforum.get("forum_attrs"),
post_attrs=forums.someforum.get("post_attrs"),
)
process.start()
logging.cfg的内容:
[loggers]
keys=root,launcher,forum,item,pipeline
[logger_root]
level=DEBUG
handlers=basic
[logger_launcher]
level=DEBUG
handlers=basic
qualname=launcher
propagate=0
[logger_forum]
...
[logger_item]
...
[logger_pipeline]
...
# --
[logger_scrapy]
level=DEBUG
# --
[handlers]
keys=basic
[formatters]
keys=basic
[handler_basic]
class=StreamHandler
level=DEBUG
formatter=basic
[formatter_basic]
format=%(asctime)s - [%(name)s] - %(levelname)s - %(message)s
请注意,我希望能够在外部配置模块级日志记录详细程度
一个 Python 文件。这就是为什么在logging.cfg 中有一个随机的 Scrapy 虚拟记录器,所以我
可以去settings.py文件中抓取它并将其传递给底层
CrawlSpider.
settings.py 文件内容:
import configparser
config = configparser.RawConfigParser()
config.read(LOG_CONF)
LOG_LEVEL = config["logger_scrapy"].get("level")
LOG_FORMAT = config["formatter_basic"].get("format")
BOT_NAME = "scraper"
SPIDER_MODULES = ["scraper.spiders"]
NEWSPIDER_MODULE = "scraper.spiders"
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 8
ITEM_PIPELINES = {
"scraper.pipelines.PostPipeline": 300,
}
函数utils.get_logger():
import logging
from settings import LOG_CONF
from logging import config
def get_logger(logger_name):
"""Returns logger"""
# LOG_CONF just points to `logging.cfg` in root directory
config.fileConfig(LOG_CONF)
return logging.getLogger(logger_name)
如果我使用 launcher 记录器运行 main.py,则记录输出将包含
像这样的东西:
2021-11-12 16:37:23,994 - [launcher] - INFO - Initializing spiders
2021-11-12 16:37:24,016 - [launcher] - INFO - Initialzing spider for someforum
2021-11-12 16:37:24,045 - [scrapy.extensions.telnet] - INFO - Telnet Password: 62df42034f3a7f09
2021-11-12 16:37:24,070 - [scaper.spiders.forums] - INFO - Creating scrape time log directory for forum: 'someforum'
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post title
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post currency
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post price
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Searching post title for tags
2021-11-12 16:37:27,617 - [scraper.items] - DEBUG - Fetching post hash key
如果我运行 main.py,但从 main.py 和
runner.py,日志输出将包含以下内容:
2021-11-12 16:39:37,734 - [scrapy.utils.log] - INFO - Scrapy 2.5.1 started (bot: scraper)
2021-11-12 16:39:37,735 - [scrapy.utils.log] - INFO - Versions: lxml 4.6.4.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.2 (default, Dec 21 2020, 15:06:04) - [Clang 12.0.0 (clang-1200.0.32.29)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1l 24 Aug 2021), cryptography 35.0.0, Platform macOS-10.15.7-x86_64-i386-64bit
2021-11-12 16:39:37,735 - [scrapy.utils.log] - DEBUG - Using reactor: twisted.internet.selectreactor.SelectReactor
2021-11-12 16:39:37,748 - [scrapy.crawler] - INFO - Overridden settings:
{'BOT_NAME': 'scraper',
'CONCURRENT_REQUESTS': 8,
'LOG_FORMAT': '%(asctime)s - [%(name)s] - %(levelname)s - %(message)s',
'NEWSPIDER_MODULE': 'scraper.spiders',
'SPIDER_MODULES': ['scraper.spiders']}
2021-11-12 16:39:37,769 - [scrapy.extensions.telnet] - INFO - Telnet Password: 2baaa967d9d68933
2021-11-12 16:39:37,792 - [scrapy.middleware] - INFO - Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
我对@987654350@ 的调用中的 Scrapy 日志再次记录到
stderr,太好了。我只是不想让我的日志和 Scrapy 日志成为
互斥。
我的直觉告诉我我弄乱了根记录器,或者我自己的记录器 覆盖 Scrapy 记录器。我不知道如何解决这个问题, 所以任何想法/建议都值得赞赏。提前谢谢!
【问题讨论】:
标签: python web-scraping logging scrapy