【发布时间】:2015-11-20 08:30:41
【问题描述】:
所以我构建了一个爬虫爬虫,它可以爬取网站内的所有内部链接。但是,当我运行蜘蛛时,有些网站的大部分网站与网站内容无关。例如,一个网站运行 Jenkins,而我的蜘蛛程序花费大量时间来探索与该网站完全无关的这些页面。
一种方法是创建一个黑名单并向其中添加一些路径,例如 Jenkins,但我想知道是否有更好的方法来处理这个问题。
class MappingItem(dict, BaseItem):
pass
class WebsiteSpider(scrapy.Spider):
name = "Website"
def __init__(self):
item = MappingItem()
self.loader = ItemLoader(item)
self.filter_urls = list()
def start_requests(self):
filename = "filename.csv"
try:
with open(filename, 'r') as csv_file:
reader = csv.reader(csv_file)
header = next(reader)
for row in reader:
seed_url = row[1].strip()
base_url = urlparse(seed_url).netloc
self.filter_urls.append(base_url)
request = Request(seed_url, callback=self.parse_seed)
request.meta['base_url'] = base_url
yield request
except IOError:
raise CloseSpider("A list of websites are needed")
def parse_seed(self, response):
base_url = response.meta['base_url']
# handle external redirect while still allowing internal redirect
if urlparse(response.url).netloc != base_url:
return
external_le = LinkExtractor(deny_domains=base_url)
external_links = external_le.extract_links(response)
for external_link in external_links:
if urlparse(external_link.url).netloc in self.filter_urls:
self.loader.add_value(base_url, external_link.url)
internal_le = LinkExtractor(allow_domains=base_url)
internal_links = internal_le.extract_links(response)
for internal_link in internal_links:
request = Request(internal_link.url, callback=self.parse_seed)
request.meta['base_url'] = base_url
request.meta['dont_redirect'] = True
yield request
【问题讨论】:
-
您在使用链接提取器吗?显示蜘蛛代码的相关部分可能会有所帮助。谢谢!
标签: python web-scraping scrapy