如何从 AWS Lambda 运行 Scrapy 蜘蛛？答案

【问题标题】：How to run a Scrapy spider from AWS Lambda?如何从 AWS Lambda 运行 Scrapy 蜘蛛？
【发布时间】：2018-12-23 06:24:01
【问题描述】：

我正在尝试从 AWS Lambda 中运行一个爬虫。这是我当前脚本的样子，它正在抓取测试数据。

import boto3
import scrapy
from scrapy.crawler import CrawlerProcess

s3 = boto3.client('s3')
BUCKET = 'sample-bucket'

class BookSpider(scrapy.Spider):
    name = 'bookspider'
    start_urls = [
        'http://books.toscrape.com/'
    ]

    def parse(self, response):
        for link in response.xpath('//article[@class="product_pod"]/div/a/@href').extract():
            yield response.follow(link, callback=self.parse_detail)
        next_page = response.xpath('//li[@class="next"]/a/@href').extract_first()
        if next_page:
            yield response.follow(next_page, callback=self.parse)

    def parse_detail(self, response):
        title = response.xpath('//div[contains(@class, "product_main")]/h1/text()').extract_first()
        price = response.xpath('//div[contains(@class, "product_main")]/'
                               'p[@class="price_color"]/text()').extract_first()
        availability = response.xpath('//div[contains(@class, "product_main")]/'
                                      'p[contains(@class, "availability")]/text()').extract()
        availability = ''.join(availability).strip()
        upc = response.xpath('//th[contains(text(), "UPC")]/'
                             'following-sibling::td/text()').extract_first()
        yield {
            'title': title,
            'price': price,
            'availability': availability,
            'upc': upc
        }

def main(event, context):
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)',
        'FEED_FORMAT': 'json',
        'FEED_URI': 'result.json'
    })

    process.crawl(BookSpider)
    process.start() # the script will block here until the crawling is finished

    data = open('result.json', 'rb')
    s3.put_object(Bucket = BUCKET, Key='result.json', Body=data)
    print('All done.')

if __name__ == "__main__":
    main('', '')

我首先在本地测试了这个脚本，它正常运行，抓取数据并将其保存到“results.json”，然后将其上传到我的 S3 存储桶。

然后，我按照此处的指南配置了我的 AWS Lambda 函数：https://serverless.com/blog/serverless-python-packaging/，它成功地在 AWS Lambda 中导入了 Scrapy 库以供执行。

但是，当脚本在 AWS Lambda 上运行时，它不会抓取数据，只会抛出 results.json 不存在

的错误

任何配置运行 Scrapy 或有解决方法或可以指出正确方向的人将不胜感激。

谢谢。

【问题讨论】：

标签： python-3.x amazon-web-services scrapy aws-lambda

【解决方案1】：

刚刚在寻找其他东西时遇到了这个，但我没想到......

Lambdas 在 /tmp 中提供临时存储，所以我建议设置

'FEED_URI': '/tmp/result.json'

然后

data = open('/tmp/result.json', 'rb')

可能有各种关于在 lambdas 中使用临时存储的最佳实践，所以我建议花一些时间阅读这些实践。

【讨论】：