【问题标题】:Trying to scrape data from a website dicksmith.com.au试图从 dicksmith.com.au 网站上抓取数据
【发布时间】:2015-05-21 10:29:48
【问题描述】:

python 2.7.6,scrapy 0.24.6,网站-dicksmith.com.au,操作系统-Linux(Ubuntu) url(移动网站很简单) - http://search.dicksmith.com.au/search?w=mobile+phone&ts=m

对不起,伙计们,我是scrapy的新手。提前致谢

代码:

import scrapy

class PriceWatchItem( scrapy.Item ):
    name = scrapy.Field()
    price = scrapy.Field()

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor

class PriceWatchSpider( CrawlSpider ):
    name = 'dicksmith'
    allowed_domains = ['dicksmith.com.au']
    start_urls = ['http://search.dicksmith.com.au/search']
    rules = [ Rule ( LinkExtractor( allow = ['?w=mobile+phone&ts=m']      ), 'parse_dickSmith' ) ]

    def parse_dickSmith( self, response ):
        dickSmith = PriceWatchItem()
        dickSmith['name'] = response.xpath("//h1/text()").extract()
        return dickSmith
  #scrapy crawl dicksmith -o scraped_data.jason

错误:

File "pricewatch.py", line 10, in <module>
    class PriceWatchSpider( CrawlSpider ):
  File "pricewatch.py", line 14, in PriceWatchSpider
    rules = [ Rule ( LinkExtractor( allow = ['?w=mobile+phone&ts=m'] ), 'parse_dickSmith' ) ]
  File "/usr/local/lib/python2.7/dist-packages/scrapy/contrib/linkextractors/lxmlhtml.py", line 94, in __init__
    deny_extensions)
  File "/usr/local/lib/python2.7/dist-packages/scrapy/linkextractor.py", line 46, in __init__
    self.allow_res = [x if isinstance(x, _re_type) else re.compile(x) for x in arg_to_iter(allow)]
  File "/usr/lib/python2.7/re.py", line 190, in compile
    return _compile(pattern, flags)
  File "/usr/lib/python2.7/re.py", line 244, in _compile
    raise error, v # invalid expression
sre_constants.error: nothing to repeat

【问题讨论】:

    标签: python web-scraping scrapy


    【解决方案1】:

    你应该逃跑吗?和 +

    试试这个

    reg = re.compile('\?w=mobile\+phone&ts=m')
    rules = [ Rule ( LinkExtractor(allow = reg, 'parse_dickSmith' ) ]
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-03-14
      • 2013-05-21
      • 1970-01-01
      • 2014-07-06
      相关资源
      最近更新 更多