今天在爬取一个朝鲜网站:http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=2时,发现它会重定向多次,又回到原url,如果scrapy过滤重复url,则无法爬取。

所以,查资料发现:可以重复爬取,而且设置比较简单。

资料如下:

https://blog.csdn.net/huyoo/article/details/75570668

实际代码如下:

def parse(self, response):
meta = response.meta
===================================================================================
meta["website"] = "http://www.rodong.rep.kp/ko/"
meta['area'] = 'xj_rodong_rep_kp'

start_url_list = [
# "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=3",
# "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=5",
# "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=6",
# "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=7",
# "http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=1&iSubMenuID=1",
"http://www.rodong.rep.kp/cn/index.php?strPageID=SF01_01_02&iMenuID=2"
]
for url in start_url_list:
yield Request(url, meta=meta, callback=self.parse_list, dont_filter=True)

相关文章:

  • 2021-05-22
  • 2021-10-03
  • 2021-04-26
  • 2022-12-23
  • 2022-12-23
  • 2020-07-06
  • 2021-09-17
  • 2021-08-16
猜你喜欢
  • 2022-12-23
  • 2021-11-23
  • 2021-04-21
  • 2021-12-31
  • 2022-12-23
  • 2021-06-06
  • 2022-12-23
相关资源
相似解决方案