【发布时间】:2018-11-18 17:34:53
【问题描述】:
我对 SCRAPY 很陌生 我尝试从该网站 (http://quotes.toscrape.com/random) 中提取 100 条引号,为此我编写了以下蜘蛛
# -*- coding: utf-8 -*-
import scrapy
class QuotesProjectSpider(scrapy.Spider):
name = 'quotes_project'
allowed_domains = ['toscrape.com']
start_urls = ['http://quotes.toscrape.com/random']
def parse(self, response):
self.log('i gonna scrape : '+response.url)
#self.log('the whole page : '+response.text)
i=1
tempQuotes = {}
quotesArray = [ {
'author' : response.css('div.quote small.author::text')[0].extract(),
'quote' : response.css('div.quote span.text::text')[0].extract(),
'tags' : response.css('div.quote div.tags a.tag::text').extract()
}]
flag = False
while i < 100:
tempQuotes = {
'author' : response.css('div.quote small.author::text')[0].extract(),
'quote' : response.css('div.quote span.text::text')[0].extract(),
'tags' : response.css('div.quote div.tags a.tag::text').extract()
}
flag = False
j = 0
n = len(quotesArray)
while not flag and j < n :
if tempQuotes['quote'] == quotesArray[j]['quote'] :
flag = True
j+=1
if not flag :
quotesArray.append(tempQuotes)
i+=1
print("i = " + str(i))
print("quote : "+tempQuotes['quote'])
print("condition : " + str(tempQuotes['quote'] == quotesArray[0]['quote']))
yield quotesArray
这一行的问题
print("条件:" + str(tempQuotes['quote'] == quotesArray[0]['quote']))
它向我显示了一个无限循环的True,这意味着响应没有更新,因为每次刷新页面时网站都会显示一个新的报价,所以如何在每个循环中更新解析函数的响应。 有人可以帮帮我吗?
【问题讨论】:
标签: python-3.x web-scraping scrapy