【问题标题】:Scrape paragraph with href inside刮掉带有href的段落
【发布时间】:2019-08-29 07:39:04
【问题描述】:

这是html:

<p class="myParagraph">
  Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
  <a href="http://google.it" class="small-link" target="_blank">
    <span class="tco-ellipsis"></span>
    <span class="invisible">https://</span>
    <span class="js-display-url">google.it</span>
    <span class="invisible">lpage/events/?ref=page_internal&amp;mt_nav=0&amp;locale2=it_IT</span>
    <span class="tco-ellipsis">
      <span class="invisible">&nbsp;</span>…
    </span>
  </a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
</p>

如果我使用tree.xpath('//p/text()'),它只会返回我

Lorem ipsum dolor sit amet,consectetur adipiscing elit。 Vivamus vel justo

而不是

Lorem ipsum dolor sit amet,consectetur adipiscing elit。 Vivamus vel justo ornare, suscipit nisl eget, aliquam augue。 Aenean quis pretium

我也试过tree.xpath('string(//p)')here 我怎样才能同时获得完整的段落和href?并非每次段落内都有a 元素

【问题讨论】:

    标签: xpath web-scraping lxml


    【解决方案1】:

    xpath('//p/text()') 返回一个字符串列表。加入这些字符串以获得想要的结果。

    from lxml import html
    
    doc = """<p class="myParagraph">
      Lorem ipsum dolor sit amet, consectetur adipiscing elit. Vivamus vel justo
      <a href="http://google.it" class="small-link" target="_blank">
        <span class="tco-ellipsis"></span>
        <span class="invisible">https://</span>
        <span class="js-display-url">google.it</span>
        <span class="invisible">lpage/events/?ref=page_internal&amp;mt_nav=0&amp;locale2=it_IT</span>
        <span class="tco-ellipsis">
          <span class="invisible">&nbsp;</span>…
        </span>
      </a> ornare, suscipit nisl eget, aliquam augue. Aenean quis pretium
    </p>"""
    
    root = html.fromstring(doc)
    print("".join([t for t in root.xpath("//p/text()")]))
    

    【讨论】:

      猜你喜欢
      • 2015-04-15
      • 1970-01-01
      • 2021-06-25
      • 1970-01-01
      • 2016-05-01
      • 2021-07-21
      • 2016-06-19
      • 1970-01-01
      • 2020-01-27
      相关资源
      最近更新 更多