【发布时间】:2017-11-14 10:42:29
【问题描述】:
我正在对网站进行爬网,我正在使用来自 scrapy 的 LinkExtractor 来爬网链接并确定其响应状态。
此外,我还想使用链接提取器从站点获取图像 src。我有一个代码,它适用于网站网址,但我似乎无法获取图像。因为它不会登录控制台。
handle_httpstatus_list = [404,502]
# allowed_domains = [''mydomain']
start_urls = ['somedomain.com/']
http_user = '###'
http_pass = '#####'
rules = (
Rule(LinkExtractor(allow=('domain.com',),canonicalize = True, unique = True), process_links='filter_links', follow = False, callback='parse_local_link'),
Rule(LinkExtractor(allow=('cdn.domain.com'),tags = ('img',),attrs=('src',),canonicalize = True, unique = True), follow = False, callback='parse_image_link'),
)
def filter_links(self,links):
for link in
def parse_local_link(self, response):
if response.status != 200:
item = LinkcheckerItem()
item['url'] = response.url
item['status'] = response.status
item['link_type'] = 'local'
item['referer'] = response.request.headers.get('Referer',None)
yield item
def parse_image_link(self, response):
print "Got image link"
if response.status != 200:
item = LinkcheckerItem()
item['url'] = response.url
item['status'] = response.status
item['link_type'] = 'img'
item['referer'] = response.request.headers.get('Referer',None)
yield item
【问题讨论】: