【问题标题】:Scrapy to print results in real time rather than waiting for crawl to finishScrapy 实时打印结果,而不是等待抓取完成
【发布时间】:2020-12-02 18:27:45
【问题描述】:

scrapy 可以实时打印结果吗?我打算爬取大型网站,担心如果我的 vpn 连接中断,爬取的努力就会白费,因为它不会打印任何结果。

我目前正在使用带有轮换用户代理的 VPN,我知道使用轮换代理而不是 VPN 是理想的,但这将用于未来的脚本升级。

import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

results = open('results.csv','w')

class TestSpider(CrawlSpider):
    name = "test"
    with open("domains.txt", "r") as d:
        allowed_domains = [url.strip() for url in d.readlines()]

    with open("urls.txt", "r") as f:
        start_urls = [url.strip() for url in f.readlines()]
        f.close()

    rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)

    def parse_item(self, response):
        for pattern in ['Albert Einstein', 'Bob Marley']:
            result = re.findall(pattern, response.text) 
            print(response.url,">",pattern,'>',len(result), file = results)

非常感谢。

更新

harada 的脚本可以完美运行,除了保存文件之外没有任何更改。我需要做的就是对当前文件进行一些修改,如下所示,以便一切正常。

蜘蛛定义的项目

import scrapy
import re
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from ..items import TestItem

class TestSpider(CrawlSpider):
    name = "test"
    with open("domains.txt", "r") as d:
        allowed_domains = [url.strip() for url in d.readlines()]

    with open("urls.txt", "r") as f:
        start_urls = [url.strip() for url in f.readlines()]
        f.close()

    rules = (Rule(LinkExtractor(allow=('/'), deny=('9','10')), follow=True, callback='parse_item'),)

    def parse_item(self, response):

        items = TestItem()

        for pattern in ['Albert Einstein', 'Bob Marley']:
            result = re.findall(pattern, response.text) 
            url = response.url
            count = len(result)

            items['url'] = url
            items['pattern'] = pattern
            items['count'] = count

            yield(items)

items.py - 将项目添加为字段

import scrapy

    class TestItem(scrapy.Item):
        url = scrapy.Field()
        pattern = scrapy.Field()
        count = scrapy.Field()

settings.py - 未注释的 ITEM_PIPELINES

ITEM_PIPELINES = {
   'test.pipelines.TestPipeline': 300,
}

【问题讨论】:

    标签: python scrapy


    【解决方案1】:

    您可以向管道中添加一个脚本,该脚本可以将您当时拥有的数据保存到文件中。将计数器作为变量添加到管道中,当管道达到某个阈值(假设每产生 1000 个项目)时,它应该写入文件。代码看起来像这样。我试图让它尽可能通用。

    class MyPipeline:
        def __init__(self):
            # variable that keeps track of the total number of items yielded
            self.total_count = 0
            self.data = []
    
        def process_item(self, item, spider):
            self.data.append(item)
            self.total_count += 1
            if self.total_count % 1000 == 0:
                # write to your file of choice....
                # I'm not sure how your data is stored throughout the crawling process
                # If it's a variable of the pipeline like self.data,
                # then just write that to the file
                with open("test.txt", "w") as myfile:
                    myfile.write(f'{self.data}')
    
            return item
    

    【讨论】:

    • 谢谢!我一周前刚开始使用 Scrapy/Python,所以我只是想弄清楚如何以及在进程中添加什么。
    • @AJ2 没问题,我编辑了答案以使我的示例更加清晰。理论上,您可以(并且强烈建议)使用 feed 导出。在这里查看更多信息:docs.scrapy.org/en/stable/topics/feed-exports.html
    • 这很完美!谢谢!我只需要对当前文件进行一些调整,它就可以完美运行。
    猜你喜欢
    • 1970-01-01
    • 2017-12-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-06-19
    相关资源
    最近更新 更多