【发布时间】:2015-02-04 12:22:18
【问题描述】:
我正在尝试使用 Scrapy (scrapy.org) 创建简单的爬虫。例如item.php 是允许的。我如何编写允许始终以http://example.com/category/ 开头但在GET 参数page 中的url 的规则应该与其他参数的任意位数一起存在。这些参数的顺序是随机的。
请帮助我如何编写这样的规则?
少数有效值是:
- http://example.com/category/?page=1&sort=a-z&cache=1
- http://example.com/category/?page=1&sort=a-z#
- http://example.com/category/?sort=a-z&page=1
以下是代码:
import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/category/']
rules = (
Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)
def parse_item(self, response):
item = scrapy.Item()
item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
return item
【问题讨论】:
标签: python regex web-scraping scrapy