【问题标题】:Scrapy with APscheduler only works half the time使用 APscheduler 的 Scrapy 只能工作一半
【发布时间】:2020-09-24 10:55:23
【问题描述】:

我有以下代码每 2 小时而不是每小时工作一次。我通过管道将数据存储到 MongoDb,因此我看到 id 会随着时间的推移而改变 2 而不是 1。

代码的目的是每小时抓取保存在 data.csv 文件中的 100 个 subreddit 中的在线人数,并将数据推送到 mongoDb 云服务器。一切正常,只是它只能每两个小时而不是每小时刮一次。

class SubredditSpider(scrapy.Spider):
    name = 'subreddit'
    sub_list = [] # list of subreddits
    count = 0

    def start_requests(self):
        SubredditSpider.count += 1
        if SubredditSpider.count > 24:
            SubredditSpider.count = 1
        with open('data.csv', 'r') as file:
            csv_reader = csv.reader(file)
            for row in csv_reader:
                self.sub_list.append(row[0])

        for sub in self.sub_list:
            yield scrapy.Request(f'https://www.reddit.com{sub}/about.json', self.parse)

    def parse(self, response):
        data = json.loads(response.body)
        subreddit = data['data']['display_name']
        active_users = data['data']['active_user_count']

        now = datetime.now()
        current_time = now.strftime("%H:%M")
        current_date = now.strftime("%d:%m:%Y")

        yield {
            '_id': SubredditSpider.count,
            'subreddit': subreddit,
            'active_users': active_users,
            'time': current_time,
            'date': current_date
        }


def main():

    process = CrawlerProcess(get_project_settings())
    scheduler = TwistedScheduler()
    scheduler.add_job(process.crawl, 'cron', args=[
                      SubredditSpider], hour='*')
    scheduler.start()
    process.start(False)

不运行的小时有日志

2020-09-24 10:00:00 [apscheduler.scheduler] DEBUG: Looking for jobs to run
2020-09-24 10:00:00 [apscheduler.scheduler] DEBUG: Next wakeup is due at 2020-09-24 11:00:00+00:00 (in 3599.898356 seconds)
2020-09-24 10:00:00 [apscheduler.executors.default] INFO: Running job "CrawlerRunner.crawl (trigger: cron[hour='*'], next run at: 2020-09-24 11:00:00 UTC)" (scheduled at 2020-09-24 10:00:00+00:00)
2020-09-24 10:00:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'reddit',
 'NEWSPIDER_MODULE': 'reddit.spiders',
 'SPIDER_MODULES': ['reddit.spiders']}
2020-09-24 10:00:00 [scrapy.extensions.telnet] INFO: Telnet Password: telnet_password
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled item pipelines:
['reddit.pipelines.RedditPipeline']
2020-09-24 10:00:00 [scrapy.core.engine] INFO: Spider opened
2020-09-24 10:00:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-24 10:00:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-24 10:00:00 [apscheduler.executors.default] INFO: Job "CrawlerRunner.crawl (trigger: cron[hour='*'], next run at: 2020-09-24 11:00:00 UTC)" executed successfully

【问题讨论】:

    标签: python python-3.x web-scraping scrapy apscheduler


    【解决方案1】:

    我修改了你的代码,让它每分钟运行一次蜘蛛并抓取 100 次相同的 URL,它对我有用,我让它运行了 10 分钟。

    我还尝试使用 requests 库并每秒发出一个请求,一切正常

    我在谷歌上搜索了 reddit api 速率限制,有些帖子说你可以提出 100 个请求,但在他们的实际文档中,他们将其限制为 60 个。

    https://github.com/reddit-archive/reddit/wiki/API

    它们允许你在我分享的链接上抓取边界内的所有内容,但你必须对自己进行身份验证。

    我唯一的理论是,他们阻止了您的第二次抓取,因为超出了他们的小时费率,也许您可​​以尝试使用代理或身份验证。

    此外,如果您愿意,可以将您的 URL 列表告诉我,我可以重新运行我的测试。也许我并没有超出他们的限制,因为我一遍又一遍地要求同样的事情。

    import scrapy
    import json
    from datetime import datetime
    import requests
    from apscheduler.schedulers.twisted import TwistedScheduler
    from apscheduler.schedulers.blocking import BlockingScheduler
    from scrapy.crawler import CrawlerProcess
    
    
    class SubredditSpider(scrapy.Spider):
        name = 'subreddit'
        sub_list = []  # list of subreddits
        count = 0
        custom_settings = {}
    
        def start_requests(self):
            SubredditSpider.count += 1
            if SubredditSpider.count > 24:
                SubredditSpider.count = 1
            for _ in range(100):
                yield scrapy.Request('https://www.reddit.com/r/Music/about.json', self.parse, dont_filter=True)
    
        def parse(self, response, *args):
            data = json.loads(response.body)
            subreddit = data['data']['display_name']
            active_users = data['data']['active_user_count']
    
            now = datetime.now()
            current_time = now.strftime("%H:%M")
            current_date = now.strftime("%d:%m:%Y")
    
            yield {
                '_id': SubredditSpider.count,
                'subreddit': subreddit,
                'active_users': active_users,
                'time': current_time,
                'date': current_date
            }
    
    
    def main():
        process = CrawlerProcess({'BOT_NAME': 'reddit'})
        scheduler = TwistedScheduler()
        scheduler.add_job(process.crawl, 'interval', args=[
            SubredditSpider], minutes=1)
        scheduler.start()
        process.start(False)
    
    
    def get_active_users():
        url = "https://www.reddit.com/r/Music/about.json"
    
        payload = {}
        headers = {
            'User-Agent': 'PostmanRuntime/7.26.3',
            'Accept': '*/*',
            'Accept-Encoding': 'gzip, deflate, br',
            'Connection': 'keep-alive'
        }
    
        response = requests.request("GET", url, headers=headers, data=payload)
        if response.status_code == 200:
            data = response.json()
            subreddit = data['data']['display_name']
            active_users = data['data']['active_user_count']
    
            now = datetime.now()
            current_time = now.strftime("%H:%M")
            current_date = now.strftime("%d:%m:%Y")
    
            print({
                '_id': SubredditSpider.count,
                'subreddit': subreddit,
                'active_users': active_users,
                'time': current_time,
                'date': current_date
            })
            SubredditSpider.count += 1
        else:
            print(response)
    
    
    if __name__ == '__main__':
        main()
        # scheduler = BlockingScheduler()
        # scheduler.add_job(get_active_users, 'interval', seconds=1)
        # scheduler.start()
    

    【讨论】:

    • 您好,感谢您再次回复。我重新运行代码,显然它不会在第一次触发时运行,所以这绝对不是速率问题。对于 URL,我没有。我只有你需要在 reddit.com 之后附加的数据,即pastebin.com/mNgCrfWR。另外,我认为“cron”和“interval”之间是有区别的。我使用了“cron”,我认为间隔对我来说也很好。但我想使用“cron”,因为只有它会每小时运行一次。
    猜你喜欢
    • 2017-11-04
    • 2015-06-28
    • 2015-06-07
    • 2011-12-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多