【问题标题】:Running Splash server and Scrapy spiders on the same Ec2 Instance在同一个 Ec2 实例上运行 Splash 服务器和 Scrapy 蜘蛛
【发布时间】:2018-04-26 19:55:33
【问题描述】:
我正在部署一个由Scrapy 蜘蛛组成的网络抓取应用程序,它使用Splash javascript 呈现服务从网站抓取内容以及截屏网页。我想将整个应用程序部署到单个 Ec2 实例。但是为了使应用程序正常工作,我必须在运行蜘蛛的同时从 docker 映像运行启动服务器。如何在 Ec2 实例上运行多个进程?任何关于最佳实践的建议将不胜感激。
【问题讨论】:
-
这个thread 建议使用 AWS Elastic Container Service (ECS) 作为在同一个 Ec2 实例上运行多个任务的解决方案。在我的案例中应用此解决方案的任何提示?此外,我的scrapy spider 能够与splash 服务器进行通信也很重要。提前谢谢!
标签:
amazon-web-services
amazon-ec2
scrapy
splash-screen
【解决方案1】:
总菜鸟问题。我发现配置后在 Ec2 实例上运行 Splash 服务器和 Scrapy 蜘蛛的最佳方法是通过计划与 cronjob 一起运行的 bash 脚本。这是我想出的 bash 脚本:
#!bin/bash
# Change to proper directory to run Scrapy spiders.
cd /home/ec2-user/project_spider/project_spider
# Activate my virtual environment.
source /home/ec2-user/venv/python36/bin/activate # activate my virtual environment
# Create a shell variable to store date at runtime
LOGDATE=$(date +%Y%m%dT%H%M%S);
# Spin up splash instance from docker image.
sudo docker run -d -p 8050:8050 -p 5023:5023 scrapinghub/splash --max-timeout 3600
# Scrape first site and store dated log file in logs directory.
scrapy crawl anhui --logfile /home/ec2-user/project_spider/project_spider/logs/anhui_spider/anhui_spider_$LOGDATE.log
...
# Spin down splash instance via docker image.
sudo docker rm $(sudo docker stop $(sudo docker ps -a -q --filter ancestor=scrapinghub/splash --format="{{.ID}}"))
# Exit virtual environment.
deactivate
# Send an email to confirm cronjob was successful.
# Note that sending email from Ec2 is difficult and you can not use 'MAILTO'
# in your cronjob without setting up something like postfix or sendmail.
# Using Mailgun is an easy way around that.
curl -s --user 'api:<YOURAPIHERE>' \
https://api.mailgun.net/v3/<YOURDOMAINHERE>/messages \
-F from='<YOURDOMAINADDRESS>' \
-F to=<RECIPIENT> \
-F subject='Cronjob Run Successfully' \
-F text='Cronjob completed.'