【发布时间】:2020-09-29 15:45:20
【问题描述】:
如果我只是愚蠢,请原谅我对 Python 和 Webscraping 还很陌生
我想抓取具有不同结构的多个站点的所有文本元素,因此第一步想要爬取每个站点并检索每个域的不同子站点的所有 url
但首先我的代码不适用于我通过的每个链接,我收到此通知
2020-09-29 17:24:04 [scrapy.core.engine] 调试:已爬网 (200)
https://markus-pieper.eu/>(引用者:无)
最后,如何在一个链接完成后重新启动该过程?我的想法是为 for 循环中的每个链接执行此操作,因此我得到每个链接的子站点 url 列表,但我无法使用新 url 重新启动爬虫
有人可以帮忙吗?提前非常感谢
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
import scrapy
from scrapy.crawler import CrawlerProcess
import re
global base_links, link_list, links
link_list = []
base_links = []
# assign list of urls to crawl
links = ['https://bernd-lange.de/',
'https://markus-pieper.eu/']
# strips urls in order to get base-urls
for link in links:
base = re.sub('/$', '', link)
base = re.sub('^https:\/\/', '', base)
base = re.sub('^www.', '', base)
base_links.append(base)
class SpiderSpider(CrawlSpider):
name= "sites"
#allowed_domains = base_links
le = LinkExtractor(allow_domains = base_links, unique=True)
#rules = [Rule(le, callback='parse_all_subsites', follow=True)]
rules = [Rule(le, callback='parse_all_subsites', follow=False)]
def parse_all_subsites(self, response):
#for link in response.css('a::attr(href)'):
extracted_links = self.le.extract_links(response)
pages = set()
for link in extracted_links:
pages.add(link.url)
link_list.append(pages)
process = CrawlerProcess()
#iterates over every link and adds list of links of every sub-site to link_list
for link in links:
process.crawl(SpiderSpider, start_urls=link)
process.start()
【问题讨论】:
标签: python scrapy web-crawler screen-scraping