【发布时间】:2018-11-09 10:27:53
【问题描述】:
我想抓取每一页。我找到了一种使用scrapy shell 的方法,但我不知道我的蜘蛛是否会遍历每一页或只遍历下一页;我不太确定如何实现。
alphabet = string.ascii_uppercase
each_link = '.' + alphabet
each_url = ["https://myanimelist.net/anime.php?letter={0}".format(i) for i in each_link]
#sub_page_of_url = [[str(url)+"&show{0}".format(i) for i in range(50, 2000, 50)] for url in each_url] #start/stop/steps
#full_url = each_url + sub_page_of_url
class AnimeScraper_Spider(scrapy.Spider):
name = "Anime"
def start_requests(self):
for url in each_url:
yield scrapy.Request(url=url, callback= self.parse)
def parse(self, response):
next_page_url = response.xpath(
"//div[@class='bgColor1']//a[text()='Next']/@href").extract_first()
for href in response.css('#content > div.normal_header.clearfix.pt16 > div > div > span > a:nth-child(1)') :
url = response.urljoin(href.extract())
yield Request(url, callback = self.parse_anime)
yield Request(next_page_url, callback=self.parse)
def parse_anime(self, response):
for tr_sel in response.css('div.js-categories-seasonal tr ~ tr'):
return {
"title" : tr_sel.css('a[id] strong::text').extract_first().strip(),
"synopsis" : tr_sel.css("div.pt4::text").extract_first(),
"type_" : tr_sel.css('td:nth-child(3)::text').extract_first().strip(),
"episodes" : tr_sel.css('td:nth-child(4)::text').extract_first().strip(),
"rating" : tr_sel.css('td:nth-child(5)::text').extract_first().strip()
}
【问题讨论】:
-
尝试创建
counter=0,并在每次迭代的while True循环中增加counter += 50。break的条件应该是if response.status == 404: break -
您能举个例子吗?我不知道该怎么做假设我需要点击第 3 页才能获得第 4 页的 href
-
您应该将计数器作为附加参数添加到您的 URL。例如如果
url = "https://myanimelist.net/anime.php?letter=a",那么每个页面的URL 应该是url + "&show={}".format(counter)。在每个迭代计数器上,第 1 页的0,第 2 页的50,第 3 页的100等等...... -
Counter = 0 while True : Counter +=50 for href in response.css('#content > div.normal_header.clearfix.pt16 > div > div > span > a') : url = response .urljoin(href.extract()) url + "&show={}".format(Counter) if response.status == 404 : break ,有什么遗漏吗?运行需要一段时间
标签: python web-scraping scrapy