【问题标题】:Parsing stray text with Scrapy使用 Scrapy 解析杂散文本
【发布时间】:2018-02-22 04:58:52
【问题描述】:

知道如何从这段标记中提取“TEXT TO GRAB”:

<span class="navigation_page">
    <span>
        <a itemprop="url" href="http://www.example.com">
            <span itemprop="title">LINK</span>
        </a>
    </span>
    <span class="navigation-pipe">&gt;</span>
    TEXT TO GRAB
</span>

【问题讨论】:

  • 试试 response.css('span.navigation_page::text').extract_first()

标签: python web-scraping scrapy scrapy-spider


【解决方案1】:

不理想:

text_to_grab = response.xpath('//span[@class="navigation-pipe"]/following-sibling::text()[1]').extract_first()

【讨论】:

    【解决方案2】:

    这不是一个理想的解决方案,但应该可以解决问题:

    from scrapy import Selector
    
    content="""
    <span class="navigation_page">
        <span>
            <a itemprop="url" href="http://www.example.com">
                <span itemprop="title">LINK</span>
            </a>
        </span>
        <span class="navigation-pipe">&gt;</span>
        TEXT TO GRAB
    </span>
    """
    sel = Selector(text=content)
    item = sel.css(".navigation_page::text")
    print(item.extract()[-1].strip())
    

    或者像这样:

    sel = Selector(text=content)
    item = ''.join([' '.join(items.split()) for items in sel.css("span.navigation_page::text").extract()])
    print(item)
    

    输出:

    TEXT TO GRAB
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-05-27
      • 1970-01-01
      • 2018-06-07
      • 2019-07-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多