【发布时间】:2015-06-18 21:30:26
【问题描述】:
我正在尝试在数学/科学/经济学页面下抓取所有可汗学院页面的标题和 URL。但是,目前它只输出一个左括号,在此之前它只会抓取起始 URL。
from openbar_index.items import OpenBarIndexItem
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
class OpenBarSpider(CrawlSpider):
"""
scrapes website URLs from educational websites and commits urls/webpage names/text to a document
"""
name = 'openbar'
allowed_domains = 'khanacademy.org'
start_urls = [
"https://www.khanacademy.org"
]
rules = [
Rule(SgmlLinkExtractor(allow = ['/math/']), callback='parse_item', follow = True),
Rule(SgmlLinkExtractor(allow = ['/science/']), callback='parse_item', follow=True),
Rule(SgmlLinkExtractor(allow = ['/economics-finance-domain/']), callback='parse_item', follow=True)
]
def parse_item(self, response):
item = OpenBarIndexItem()
url = response.url
item['url'] = url
item['title'] = response.xpath('/html/head/title/text()').extract()
yield item
有没有人知道为什么会发生这种情况或有关如何解决它的提示?
【问题讨论】:
标签: python url web-crawler scrapy scrape