【问题标题】:Extract Outer Text with Scrappy使用 Scrappy 提取外部文本
【发布时间】:2016-02-24 14:01:38
【问题描述】:

我需要解析以下片段:

<span>    Lekhwiya&nbsp;v&nbsp;<strong class="winner-strong">Zobahan</strong></span>

<span>    <strong class="winner-strong">Sepahan</strong>&nbsp;v&nbsp;Al&nbsp;Nasr&nbsp;(UAE)</span>

正确地作为Lekhwiya v Zobahan 和Sepahan v Al' Nasr'(UAE)。

我试图解析为:

team_1 = block.xpath('.//span/text()').extract()[:2]
team_1 = team_1[0].strip() + team_1[1].strip() 
team_2 = block.xpath('.//span/strong/text()').extract()[0]

item['match'] = team_2.strip() + ' ' + team_1 if team_1[0] == 'v' else team_1 + ' ' + team_2.strip()

对我来说,这是一个丑陋的解决方案。最好的方法是什么?

【问题讨论】:

    标签: python python-2.7 parsing xpath scrapy


    【解决方案1】:

    你可以使用 XPath 的 string() 函数,或者normalize-space() 甚至:

    In [1]: text = '''
       ...: <span>    Lekhwiya&nbsp;v&nbsp;<strong class="winner-strong">Zobahan</strong></span>
       ...: <span>    <strong class="winner-strong">Sepahan</strong>&nbsp;v&nbsp;Al&nbsp;Nasr&nbsp;(UAE)</span>
       ...: '''
    
    In [2]: import scrapy
    
    In [3]: selector = scrapy.Selector(text=text, type="html")
    
    In [4]: for span in selector.xpath('//span'):
       ...:     print(span.xpath('string(.)').extract_first())
       ...:     
        Lekhwiya v Zobahan
        Sepahan v Al Nasr (UAE)
    
    In [5]: for span in selector.xpath('//span'):
        print(span.xpath('normalize-space(.)').extract_first())
       ...:     
    Lekhwiya v Zobahan
    Sepahan v Al Nasr (UAE)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-04-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-27
      相关资源
      最近更新 更多