正则表达式方法的替代方法是使用 Javascript 解析器,将该解析器的输出转换为 XML 文档,然后使用 XPath 对其进行解析。
这就是在js2xml 中实现的,它使用slimit 和lxml
(免责声明:我写了 js2xml;警告:不稳定)
在你的情况下,使用js2xml.jsonlike.getall()检查这个示例scrapy shell会话:
paul:~$ scrapy shell http://2loom.com/products/2loom-design-siyah-beyaz-kalpli
2014-05-19 16:12:00+0200 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
2014-05-19 16:12:00+0200 [scrapy] INFO: Optional features available: ssl, http11
2014-05-19 16:12:00+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled item pipelines:
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-19 16:12:00+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-19 16:12:00+0200 [default] INFO: Spider opened
2014-05-19 16:12:01+0200 [default] DEBUG: Crawled (200) <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli> (referer: None)
[s] Available Scrapy objects:
[s] crawler <scrapy.crawler.Crawler object at 0x7f8552946610>
[s] item {}
[s] request <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s] response <200 http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
[s] settings <CrawlerSettings module=None>
[s] spider <Spider 'default' at 0x7f8552384b90>
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
/usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.
warn("The top-level `frontend` package has been deprecated. "
In [1]: scripts = response.selector.xpath('//script/text()').extract()
In [2]: import js2xml, js2xml.jsonlike
In [3]: js = js2xml.parse(scripts[-1])
In [4]: js2xml.jsonlike.getall(js)
Out[4]:
[{'onVariantSelected': 'selectCallback',
'product': {'available': True,
'compare_at_price': None,
'compare_at_price_max': 0,
'compare_at_price_min': 0,
'compare_at_price_varies': False,
'content': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
'created_at': '2013-11-29T13:37:11+02:00',
'description': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
'featured_image': '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
'handle': '2loom-design-siyah-beyaz-kalpli',
'id': 185310341,
'images': ['//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
'//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwarte_hartjes_ak_girls.jpg?v=1389259259',
'//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_boys.jpg?v=1389259264',
'//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwartje_hartjes_ak_boys.jpg?v=1389259264'],
'options': ['Size'],
'price': 15900,
'price_max': 15900,
'price_min': 15900,
'price_varies': False,
'published_at': '2013-11-29T13:34:20+02:00',
'tags': [u'2\xb7Loom',
'Beyaz',
'Design',
'Ekrek',
u'Kad\u0131n',
'Kalpli',
'Lacivert'],
'title': '10. Design | Siyah & beyaz kalpli',
'type': '2 Loom Limiteds',
'variants': [{'available': True,
'barcode': None,
'compare_at_price': None,
'id': 424584985,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 3,
'option1': 'XS (34-36: 1.60m-1.70m)',
'option2': None,
'option3': None,
'options': ['XS (34-36: 1.60m-1.70m)'],
'price': 15900,
'requires_shipping': True,
'sku': 'T01-BLWH-1-XS',
'taxable': True,
'title': 'XS (34-36: 1.60m-1.70m)',
'weight': 0},
{'available': True,
'barcode': None,
'compare_at_price': None,
'id': 424584989,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 3,
'option1': 'S (36-38: 1.65m-1.75m)',
'option2': None,
'option3': None,
'options': ['S (36-38: 1.65m-1.75m)'],
'price': 15900,
'requires_shipping': True,
'sku': 'T01-BLWH-1-S',
'taxable': True,
'title': 'S (36-38: 1.65m-1.75m)',
'weight': 0},
{'available': True,
'barcode': None,
'compare_at_price': None,
'id': 424584997,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 7,
'option1': 'M (38-40: 1.70m-1.80m)',
'option2': None,
'option3': None,
'options': ['M (38-40: 1.70m-1.80m)'],
'price': 15900,
'requires_shipping': True,
'sku': 'T01-BLWH-1-M',
'taxable': True,
'title': 'M (38-40: 1.70m-1.80m)',
'weight': 0},
{'available': True,
'barcode': None,
'compare_at_price': None,
'id': 424585001,
'inventory_management': 'shopify',
'inventory_policy': 'deny',
'inventory_quantity': 7,
'option1': 'L (40-42: 1.75m-1.85m)',
'option2': None,
'option3': None,
'options': ['L (40-42: 1.75m-1.85m)'],
'price': 15900,
'requires_shipping': True,
'sku': 'T01-BLWH-1-L',
'taxable': True,
'title': 'L (40-42: 1.75m-1.85m)',
'weight': 0}],
'vendor': u'2\xb7Loom'}}]
In [5]: