【问题标题】:Scrapy parse javascriptScrapy解析javascript
【发布时间】:2014-07-02 22:42:40
【问题描述】:

我在页面上有一个 javascript,如下所示:

new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",

我想得到“185310341”。我在谷歌上搜索了大约几个小时,但找不到任何东西,我希望你能帮助我。我怎样才能刮掉那个javascript并得到那个id?

我试过那个代码:

id = sel.search('"id":(.*?),',text).group(1)
print id

但我得到了:

exceptions.AttributeError: 'Selector' object has no attribute 'search'

【问题讨论】:

    标签: python regex web-scraping scrapy web-crawler


    【解决方案1】:

    Scrapy 选择器有built-in support 用于正则表达式:

    sel.xpath('<xpath_to_find_the_element_text>').re(r'"id":(\d+)')
    

    演示这个特定正则表达式的工作:

    >>> import re
    >>> s = 'new Shopify.OptionSelectors("product-select", { product: {"id":185310341,"title":"10. Design | Siyah \u0026 beyaz kalpli",'
    >>> re.search('"id":(\d+)', s).group(1)
    '185310341' 
    

    【讨论】:

    • 我在 上的脚本我会使用 id = sel.xpath('//body').re(r'"id":(\d+)') 是真的吗?
    • 我从正文元素中获取该脚本。我用过:id = re.search('"id":(\d+)', sel.xpath("//body/text()").extract()).group(1) 但有错误
    • @MuhammetArslan 正如我在答案中指出的,使用sel.xpath('//body/text()').re(r'"id":(\d+)')
    【解决方案2】:

    正则表达式方法的替代方法是使用 Javascript 解析器,将该解析器的输出转换为 XML 文档,然后使用 XPath 对其进行解析。

    这就是在js2xml 中实现的,它使用slimitlxml (免责声明:我写了 js2xml;警告:不稳定)

    在你的情况下,使用js2xml.jsonlike.getall()检查这个示例scrapy shell会话:

    paul:~$ scrapy shell http://2loom.com/products/2loom-design-siyah-beyaz-kalpli
    2014-05-19 16:12:00+0200 [scrapy] INFO: Scrapy 0.23.0 started (bot: scrapybot)
    2014-05-19 16:12:00+0200 [scrapy] INFO: Optional features available: ssl, http11
    2014-05-19 16:12:00+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
    2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
    2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
    2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
    2014-05-19 16:12:00+0200 [scrapy] INFO: Enabled item pipelines: 
    2014-05-19 16:12:00+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
    2014-05-19 16:12:00+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
    2014-05-19 16:12:00+0200 [default] INFO: Spider opened
    2014-05-19 16:12:01+0200 [default] DEBUG: Crawled (200) <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli> (referer: None)
    [s] Available Scrapy objects:
    [s]   crawler    <scrapy.crawler.Crawler object at 0x7f8552946610>
    [s]   item       {}
    [s]   request    <GET http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
    [s]   response   <200 http://2loom.com/products/2loom-design-siyah-beyaz-kalpli>
    [s]   settings   <CrawlerSettings module=None>
    [s]   spider     <Spider 'default' at 0x7f8552384b90>
    [s] Useful shortcuts:
    [s]   shelp()           Shell help (print this help)
    [s]   fetch(req_or_url) Fetch request (or URL) and update local objects
    [s]   view(response)    View response in a browser
    /usr/local/lib/python2.7/dist-packages/IPython/frontend.py:30: UserWarning: The top-level `frontend` package has been deprecated. All its subpackages have been moved to the top `IPython` level.
      warn("The top-level `frontend` package has been deprecated. "
    
    In [1]: scripts = response.selector.xpath('//script/text()').extract()
    
    In [2]: import js2xml, js2xml.jsonlike
    
    In [3]: js = js2xml.parse(scripts[-1])
    
    In [4]: js2xml.jsonlike.getall(js)
    Out[4]: 
    [{'onVariantSelected': 'selectCallback',
      'product': {'available': True,
       'compare_at_price': None,
       'compare_at_price_max': 0,
       'compare_at_price_min': 0,
       'compare_at_price_varies': False,
       'content': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
       'created_at': '2013-11-29T13:37:11+02:00',
       'description': u'<blockquote>Siyah-beyaz kalpli tulumlarimiz 100% polyester olup kap\u015fonun i\xe7i ve ribanas\u0131 lacivertir. Fermuar\u0131 iki tarafl\u0131 a\xe7\u0131l\u0131r kapan\u0131r olup kap\u015fonun tamam\u0131n\u0131 kapsar ve beyaz renklidir. Tulumlar\u0131n her iki taraf\u0131ndaki cepler\xa0 beyaz fermuarl\u0131 ve elcikler siyaht\u0131r. Ayr\u0131ca kar\u0131n bolgesinde cepler vard\u0131r Tulumlardaki logolar beyazd\u0131r. Kad\u0131nlar ve erkekler i\xe7in tasarlanm\u0131\u015ft\u0131r.</blockquote>',
       'featured_image': '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
       'handle': '2loom-design-siyah-beyaz-kalpli',
       'id': 185310341,
       'images': ['//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_girls.jpg?v=1389259261',
        '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwarte_hartjes_ak_girls.jpg?v=1389259259',
        '//cdn.shopify.com/s/files/1/0305/9953/products/11._Zwarte_hartjes_vk_boys.jpg?v=1389259264',
        '//cdn.shopify.com/s/files/1/0305/9953/products/6._Zwartje_hartjes_ak_boys.jpg?v=1389259264'],
       'options': ['Size'],
       'price': 15900,
       'price_max': 15900,
       'price_min': 15900,
       'price_varies': False,
       'published_at': '2013-11-29T13:34:20+02:00',
       'tags': [u'2\xb7Loom',
        'Beyaz',
        'Design',
        'Ekrek',
        u'Kad\u0131n',
        'Kalpli',
        'Lacivert'],
       'title': '10. Design | Siyah & beyaz kalpli',
       'type': '2 Loom Limiteds',
       'variants': [{'available': True,
         'barcode': None,
         'compare_at_price': None,
         'id': 424584985,
         'inventory_management': 'shopify',
         'inventory_policy': 'deny',
         'inventory_quantity': 3,
         'option1': 'XS (34-36: 1.60m-1.70m)',
         'option2': None,
         'option3': None,
         'options': ['XS (34-36: 1.60m-1.70m)'],
         'price': 15900,
         'requires_shipping': True,
         'sku': 'T01-BLWH-1-XS',
         'taxable': True,
         'title': 'XS (34-36: 1.60m-1.70m)',
         'weight': 0},
        {'available': True,
         'barcode': None,
         'compare_at_price': None,
         'id': 424584989,
         'inventory_management': 'shopify',
         'inventory_policy': 'deny',
         'inventory_quantity': 3,
         'option1': 'S (36-38: 1.65m-1.75m)',
         'option2': None,
         'option3': None,
         'options': ['S (36-38: 1.65m-1.75m)'],
         'price': 15900,
         'requires_shipping': True,
         'sku': 'T01-BLWH-1-S',
         'taxable': True,
         'title': 'S (36-38: 1.65m-1.75m)',
         'weight': 0},
        {'available': True,
         'barcode': None,
         'compare_at_price': None,
         'id': 424584997,
         'inventory_management': 'shopify',
         'inventory_policy': 'deny',
         'inventory_quantity': 7,
         'option1': 'M (38-40: 1.70m-1.80m)',
         'option2': None,
         'option3': None,
         'options': ['M (38-40: 1.70m-1.80m)'],
         'price': 15900,
         'requires_shipping': True,
         'sku': 'T01-BLWH-1-M',
         'taxable': True,
         'title': 'M (38-40: 1.70m-1.80m)',
         'weight': 0},
        {'available': True,
         'barcode': None,
         'compare_at_price': None,
         'id': 424585001,
         'inventory_management': 'shopify',
         'inventory_policy': 'deny',
         'inventory_quantity': 7,
         'option1': 'L (40-42: 1.75m-1.85m)',
         'option2': None,
         'option3': None,
         'options': ['L (40-42: 1.75m-1.85m)'],
         'price': 15900,
         'requires_shipping': True,
         'sku': 'T01-BLWH-1-L',
         'taxable': True,
         'title': 'L (40-42: 1.75m-1.85m)',
         'weight': 0}],
       'vendor': u'2\xb7Loom'}}]
    
    In [5]: 
    

    【讨论】:

      猜你喜欢
      • 2015-05-01
      • 1970-01-01
      • 1970-01-01
      • 2015-08-13
      • 2014-03-27
      • 2013-04-01
      • 1970-01-01
      • 2019-09-18
      • 2019-08-01
      相关资源
      最近更新 更多