【问题标题】:Giving a list as an argument to a Scrapy scraper将列表作为参数提供给 Scrapy 刮板
【发布时间】:2021-03-05 09:52:22
【问题描述】:

我希望能够提供一个 url 列表作为我的 scrapy scraper 的参数,以便我可以定期对其进行迭代以避免 403 错误。目前我认为 Scrapy 不允许我这样做。

scrapy crawl nosetime -o results.jl ['/pinpai/10036120-yuguoboshi-hugo-boss.html', '/pinpai/10094164-kedi-coty.html', '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', '/pinpai/10088596-laerfu-laolun-ralph-lauren.html']

或者是一个url文件。

目前这些网址很难写在我的蜘蛛中:

import scrapy
from ..pipelines import NosetimeScraperPipeline
import time

headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; TencentTraveler 4.0; Trident/4.0; SLCC1; Media Center PC 5.0; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30618)'}
base_url = 'https://www.nosetime.com'

class NosetimeScraper(scrapy.Spider):
    name = "nosetime"

    urls = ['/pinpai/10036120-yuguoboshi-hugo-boss.html', # I want to get rid of this
            '/pinpai/10094164-kedi-coty.html',            # unless I can use something like time.sleep(12*60*60)
            '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', # for each before being taken as argument
            '/pinpai/10088596-laerfu-laolun-ralph-lauren.html']

    start_urls = ['https://www.nosetime.com' + url for url in urls]
    base_url = 'https://www.nosetime.com'

    def parse(self, response):
        # proceed to other pages of the listings
        urls = response.css('a.imgborder::attr(href)').getall()
        for url in urls:
            print("url: ", url)
            yield scrapy.Request(url=base_url + url, callback=self.parse)

        # now that we have the urls we need to know if the dire are the things we can scrape
        pipeline = NosetimeScraperPipeline()
        perfume = pipeline.process_response(response)
        try:
            if perfume['enname']:
                print("Finally are going to store: ", perfume['enname'])
                pipeline.save_in_mongo(perfume)
        except KeyError:
            pass

【问题讨论】:

    标签: python-3.x scrapy arguments parameter-passing


    【解决方案1】:

    Scrapy documentation 中有一个非常简单的示例,您可以调整它以获取包含 URL 列表的文件名:

    scrapy crawl myspider -a urls_file=URLs.txt
    
       def __init__(self, urls_file=None, *args, **kwargs):
            super(MySpider, self).__init__(*args, **kwargs)
            self.urls_file = urls_file
            # ...
       def start_requests(self):
           with open(self.urls_file, 'r') as f:
           # read and yield your URLs here
           
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-04-17
      • 2020-06-10
      • 2021-03-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多