【发布时间】:2019-09-25 11:12:21
【问题描述】:
假设我正在尝试抓取一个网站,该网站的设计使其收到的每个请求都必须包含有效的第三方字符串密钥。想象一下,如果您发送一个不包含有效密钥的请求,网站将回复一个空字符串。 到目前为止,这就是我所拥有的:
class mySpider(Spider):
#nicesite.com contains a list of items that are stored in my problematic website. It can be accessed without any key
start_urls = ['http://www.nicesite.com']
def __init__(self, *args, **kwargs):
#Let's say that every time I get new credentials I'm billed $1. Also assume that getMyCredentials() will generate new credentials every time it is called
self.credentials = getMyCredentials()
#parsing nicesite.com
def parse(self, response):
#imagine that myList contains 50000 items --> I can't get new credentials for each item. That would be very expensive
myList = response.selector.xpath('xpath_that_yields_the_items_Im_interested')
for i in myList:
myKey = requestToAThirdPartyService(self.credentials)
yield Request('http://naughtysite.com/items/' + i + '/?' + urlencode(myKey), callback=self.parseItem )
#parsing naughtysite.com
def parseItem(self, response):
if(response.body == ''):
print('Dang! We lost an item because our key isnt valid anymore.')
#update our credentials so the next items wont be lost as well
self.credentials = getMyCredentials()
else:
#collect the relevant data and yield item:
item = response.selector.xpath('relevant_xpath')
yield item
我遇到的问题是相当明显的:在每个请求都产生之后,parseItem 不会被调用,而是在所有请求都产生之后。这就是为什么前 n 个项目已成功生成而其余所有项目均未成功生成的原因。我的密钥开始被淘气网站拒绝后,它永远不会更新并一直被拒绝。
我想做的是在产生每个请求后立即调用 parseItem 以便可以知道响应是否为空,如果是,请更新我的凭据。使用更新的凭据,我对后续请求没有任何问题。 有人可以帮我完成这个吗? 谢谢。
【问题讨论】:
标签: python web-scraping scrapy web-crawler