【发布时间】:2020-09-24 10:55:23
【问题描述】:
我有以下代码每 2 小时而不是每小时工作一次。我通过管道将数据存储到 MongoDb,因此我看到 id 会随着时间的推移而改变 2 而不是 1。
代码的目的是每小时抓取保存在 data.csv 文件中的 100 个 subreddit 中的在线人数,并将数据推送到 mongoDb 云服务器。一切正常,只是它只能每两个小时而不是每小时刮一次。
class SubredditSpider(scrapy.Spider):
name = 'subreddit'
sub_list = [] # list of subreddits
count = 0
def start_requests(self):
SubredditSpider.count += 1
if SubredditSpider.count > 24:
SubredditSpider.count = 1
with open('data.csv', 'r') as file:
csv_reader = csv.reader(file)
for row in csv_reader:
self.sub_list.append(row[0])
for sub in self.sub_list:
yield scrapy.Request(f'https://www.reddit.com{sub}/about.json', self.parse)
def parse(self, response):
data = json.loads(response.body)
subreddit = data['data']['display_name']
active_users = data['data']['active_user_count']
now = datetime.now()
current_time = now.strftime("%H:%M")
current_date = now.strftime("%d:%m:%Y")
yield {
'_id': SubredditSpider.count,
'subreddit': subreddit,
'active_users': active_users,
'time': current_time,
'date': current_date
}
def main():
process = CrawlerProcess(get_project_settings())
scheduler = TwistedScheduler()
scheduler.add_job(process.crawl, 'cron', args=[
SubredditSpider], hour='*')
scheduler.start()
process.start(False)
不运行的小时有日志
2020-09-24 10:00:00 [apscheduler.scheduler] DEBUG: Looking for jobs to run
2020-09-24 10:00:00 [apscheduler.scheduler] DEBUG: Next wakeup is due at 2020-09-24 11:00:00+00:00 (in 3599.898356 seconds)
2020-09-24 10:00:00 [apscheduler.executors.default] INFO: Running job "CrawlerRunner.crawl (trigger: cron[hour='*'], next run at: 2020-09-24 11:00:00 UTC)" (scheduled at 2020-09-24 10:00:00+00:00)
2020-09-24 10:00:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'reddit',
'NEWSPIDER_MODULE': 'reddit.spiders',
'SPIDER_MODULES': ['reddit.spiders']}
2020-09-24 10:00:00 [scrapy.extensions.telnet] INFO: Telnet Password: telnet_password
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-24 10:00:00 [scrapy.middleware] INFO: Enabled item pipelines:
['reddit.pipelines.RedditPipeline']
2020-09-24 10:00:00 [scrapy.core.engine] INFO: Spider opened
2020-09-24 10:00:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-24 10:00:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-09-24 10:00:00 [apscheduler.executors.default] INFO: Job "CrawlerRunner.crawl (trigger: cron[hour='*'], next run at: 2020-09-24 11:00:00 UTC)" executed successfully
【问题讨论】:
标签: python python-3.x web-scraping scrapy apscheduler