【问题标题】:How to create LinkExtractor rule which based on href in Scrapy如何在 Scrapy 中创建基于 href 的 LinkExtractor 规则
【发布时间】:2015-02-04 12:22:18
【问题描述】:

我正在尝试使用 Scrapy (scrapy.org) 创建简单的爬虫。例如item.php 是允许的。我如何编写允许始终以http://example.com/category/ 开头但在GET 参数page 中的url 的规则应该与其他参数的任意位数一起存在。这些参数的顺序是随机的。 请帮助我如何编写这样的规则?

少数有效值是:

以下是代码:

import scrapy
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/category/']

rules = (
    Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
)

def parse_item(self, response):
    item = scrapy.Item()
    item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
    item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
    item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
    return item

【问题讨论】:

    标签: python regex web-scraping scrapy


    【解决方案1】:

    测试字符串开头的http://example.com/category/ 和值中包含一位或多位数字的page 参数:

    Rule(LinkExtractor(allow=('^http://example.com/category/\?.*?(?=page=\d+)', )), callback='parse_item'),
    

    演示(使用您的示例网址):

    >>> import re
    >>> pattern = re.compile(r'^http://example.com/category/\?.*?(?=page=\d+)')
    >>> should_match = [
    ...     'http://example.com/category/?sort=a-z&page=1',
    ...     'http://example.com/category/?page=1&sort=a-z&cache=1',
    ...     'http://example.com/category/?page=1&sort=a-z#'
    ... ]
    >>> for url in should_match:
    ...     print "Matches" if pattern.search(url) else "Doesn't match"
    ... 
    Matches
    Matches
    Matches
    

    【讨论】:

      【解决方案2】:

      这样试试

      import re
      p = re.compile(ur'<[^>]+href="((http:\/\/example.com\/category\/)([^"]+))"', re.MULTILINE)
      test_str = u"<a class=\"youarehere\" href=\"http://example.com/category/?sort=newest\">newest</a>\n \n<a href=\"http://example.com/category/?sot=frequent\">frequent</a>"
      
      re.findall(p, test_str)
      

      live demo

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2011-01-01
        • 2017-07-22
        • 2019-12-23
        • 1970-01-01
        • 2021-12-06
        • 1970-01-01
        • 2016-04-27
        相关资源
        最近更新 更多