【发布时间】:2018-09-22 16:33:09
【问题描述】:
这是我的蜘蛛
from scrapy import Selector
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from Diplom.items import QuestionItem
class ConsultSpider(CrawlSpider):
name = "consultation"
allowed_domains = ['health.mail.ru']
start_urls = ['https://health.mail.ru/consultation/1579497']
rules = {
Rule(LinkExtractor(allow=('.*\/consultation\/\d+'),), callback="parse_item", follow=True),
}
def parse_item(self, response):
items = []
root = Selector(response)
posts = root.xpath('/html/body/div[2]/div[1]/div[5]/div/div[1]/div[1]/div[2]')
for post in posts:
item = QuestionItem()
item['question'] = post.xpath(
'//div[1]/div/div/div[2]/div[2]').extract()
item['answer'] = post.xpath('//div[3]/div[2]/div[2]').extract()
items.append(item)
return items
问题是蜘蛛进入规则中描述的链接
INFO:已爬取 8 页(以 8 页/分钟),抓取 0 项(以 0 项/分钟)
但这不会返回任何项目。如果我改变类并这样写,我的代码就可以工作
class ConsultSpider(scrapy.Spider):
....
但这不适用于Rules。
【问题讨论】:
标签: python web-scraping scrapy web-crawler