【发布时间】:2014-09-10 13:18:39
【问题描述】:
我有以下 Scrapy 代码,我用它来尝试从代码中的网站仅抓取英超联赛数据:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log
from scrapy.cmdline import execute
from scrapy.utils.markup import remove_tags
import time
class ExampleSpider(CrawlSpider):
name = "goal3"
allowed_domains = ["whoscored.com"]
start_urls = ["http://www.whoscored.com/Regions/252/Tournaments/2/Seasons/3853/Stages/7794/PlayerStatistics/England-Premier-League-2013-2014"]
download_delay = 1
rules = [Rule(SgmlLinkExtractor(allow=()),
follow=True),
Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
]
def parse_item(self,response):
self.log('A response from %s just arrived!' % response.url)
scripts = response.selector.xpath("normalize-space(//title)")
for scripts in scripts:
body = response.xpath('//p').extract()
body2 = "".join(body)
print remove_tags(body2).encode('utf-8')
execute(['scrapy','crawl','goal3'])
代码似乎在做的是将它的起点作为英超联赛数据的链接,然后抓取其中包含的所有链接,即使该链接指向与英超联赛数据无关的网站部分.实际上,它最终会爬取整个网站,而不是从主页。
有没有办法让 Scrapy 只从你的起点抓取依赖链接?
谢谢
【问题讨论】:
标签: python web-scraping scrapy