【发布时间】:2021-03-05 09:52:22
【问题描述】:
我希望能够提供一个 url 列表作为我的 scrapy scraper 的参数,以便我可以定期对其进行迭代以避免 403 错误。目前我认为 Scrapy 不允许我这样做。
scrapy crawl nosetime -o results.jl ['/pinpai/10036120-yuguoboshi-hugo-boss.html', '/pinpai/10094164-kedi-coty.html', '/pinpai/10021965-gaotiye-jean-paul-gaultier.html', '/pinpai/10088596-laerfu-laolun-ralph-lauren.html']
或者是一个url文件。
目前这些网址很难写在我的蜘蛛中:
import scrapy
from ..pipelines import NosetimeScraperPipeline
import time
headers = {'User-Agent':'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; TencentTraveler 4.0; Trident/4.0; SLCC1; Media Center PC 5.0; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30618)'}
base_url = 'https://www.nosetime.com'
class NosetimeScraper(scrapy.Spider):
name = "nosetime"
urls = ['/pinpai/10036120-yuguoboshi-hugo-boss.html', # I want to get rid of this
'/pinpai/10094164-kedi-coty.html', # unless I can use something like time.sleep(12*60*60)
'/pinpai/10021965-gaotiye-jean-paul-gaultier.html', # for each before being taken as argument
'/pinpai/10088596-laerfu-laolun-ralph-lauren.html']
start_urls = ['https://www.nosetime.com' + url for url in urls]
base_url = 'https://www.nosetime.com'
def parse(self, response):
# proceed to other pages of the listings
urls = response.css('a.imgborder::attr(href)').getall()
for url in urls:
print("url: ", url)
yield scrapy.Request(url=base_url + url, callback=self.parse)
# now that we have the urls we need to know if the dire are the things we can scrape
pipeline = NosetimeScraperPipeline()
perfume = pipeline.process_response(response)
try:
if perfume['enname']:
print("Finally are going to store: ", perfume['enname'])
pipeline.save_in_mongo(perfume)
except KeyError:
pass
【问题讨论】:
标签: python-3.x scrapy arguments parameter-passing