Scrapy 实时打印结果，而不是等待抓取完成答案

【问题标题】：Scrapy to print results in real time rather than waiting for crawl to finishScrapy 实时打印结果，而不是等待抓取完成
【发布时间】：2020-12-02 18:27:45
【问题描述】：

scrapy 可以实时打印结果吗？我打算爬取大型网站，担心如果我的 vpn 连接中断，爬取的努力就会白费，因为它不会打印任何结果。

我目前正在使用带有轮换用户代理的 VPN，我知道使用轮换代理而不是 VPN 是理想的，但这将用于未来的脚本升级。

import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

results = open('results.csv','w')

class TestSpider(CrawlSpider):
    name = "test"
    with open("domains.txt", "r") as d:
        allowed_domains = [url.strip() for url in d.readlines()]

    with open("urls.txt", "r") as f:
        start_urls = [url.strip() for url in f.readlines()]
        f.close()

    rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)

    def parse_item(self, response):
        for pattern in ['Albert Einstein', 'Bob Marley']:
            result = re.findall(pattern, response.text) 
            print(response.url,">",pattern,'>',len(result), file = results)

非常感谢。

更新

harada 的脚本可以完美运行，除了保存文件之外没有任何更改。我需要做的就是对当前文件进行一些修改，如下所示，以便一切正常。

蜘蛛定义的项目

import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TestItem

class TestSpider(CrawlSpider):
    name = "test"
    with open("domains.txt", "r") as d:
        allowed_domains = [url.strip() for url in d.readlines()]

    with open("urls.txt", "r") as f:
        start_urls = [url.strip() for url in f.readlines()]
        f.close()

    rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)

    def parse_item(self, response):

        items = TestItem()

        for pattern in ['Albert Einstein', 'Bob Marley']:
            result = re.findall(pattern, response.text) 
            url = response.url
            count = len(result)

            items['url'] = url
            items['pattern'] = pattern
            items['count'] = count

            yield(items)

items.py - 将项目添加为字段

import scrapy

    class TestItem(scrapy.Item):
        url = scrapy.Field()
        pattern = scrapy.Field()
        count = scrapy.Field()

settings.py - 未注释的 ITEM_PIPELINES

ITEM_PIPELINES = {
   'test.pipelines.TestPipeline': 300,
}

【问题讨论】：

标签： python scrapy

【解决方案1】：

您可以向管道中添加一个脚本，该脚本可以将您当时拥有的数据保存到文件中。将计数器作为变量添加到管道中，当管道达到某个阈值（假设每产生 1000 个项目）时，它应该写入文件。代码看起来像这样。我试图让它尽可能通用。

class MyPipeline:
    def __init__(self):
        # variable that keeps track of the total number of items yielded
        self.total_count = 0
        self.data = []

    def process_item(self, item, spider):
        self.data.append(item)
        self.total_count += 1
        if self.total_count % 1000 == 0:
            # write to your file of choice....
            # I'm not sure how your data is stored throughout the crawling process
            # If it's a variable of the pipeline like self.data,
            # then just write that to the file
            with open("test.txt", "w") as myfile:
                myfile.write(f'{self.data}')

        return item

【讨论】：

谢谢！我一周前刚开始使用 Scrapy/Python，所以我只是想弄清楚如何以及在进程中添加什么。
@AJ2 没问题，我编辑了答案以使我的示例更加清晰。理论上，您可以（并且强烈建议）使用 feed 导出。在这里查看更多信息：docs.scrapy.org/en/stable/topics/feed-exports.html
这很完美！谢谢！我只需要对当前文件进行一些调整，它就可以完美运行。