【问题标题】:Nokogiri & returning all data between two tagsNokogiri & 返回两个标签之间的所有数据
【发布时间】:2021-10-26 02:41:18
【问题描述】:

我正在做一个从https://platinumgod.co.uk/ 抓取项目的项目,并且我很难访问两个元素之间的所有<p> 标签。

这是 HTML:

<li class="textbox" data-tid="42.5" data-cid="42" data-sid="263" style="display: inline-block;">
    <a>
        <div onclick="" class="item reb-itm-new re-itm263"></div>
        <span>
            <p class="item-title">Clear Rune</p>
            <p class="r-itemid">ItemID: 263</p>
            <p class="pickup">"Rune mimic"</p>
      <p class="quality">Quality: 2</p>
            <p>When used, copies the effect of the Rune or Soul stone you are holding (like the Blank Card)</p>
            <p>Drops a random rune on the floor when picked up</p>
            <p>The recharge time of this item depends on the Rune/Soul Stone held:</p>
            <p>1 room: Soul of Lazarus</p>
            <p>2 rooms: Rune of Ansuz, Rune of Berkano, Rune of Hagalaz, Soul of Cain</p>
            <p>3 rooms: Rune of Algiz, Blank Rune, Soul of Magdalene, Soul of Judas, Soul of ???, Soul of the Lost</p>
            <p>4 rooms: Rune of Ehwaz, Rune of Perthro, Black Rune, Soul of Isaac, Soul of Eve, Soul of Eden, Soul of the Forgotten, Soul of Jacob and Esau</p>
            <p>6 rooms: Rune of Dagaz, Soul of Samson, Soul of Azazel, Soul of Apollyon, Soul of Bethany</p>
            <p>12 rooms: Rune of Jera, Soul of Lilith, Soul of the Keeper</p>
            <ul>
                <p>Type: Active</p>
                <p>Recharge time: Varies</p>
                <p>Item Pool: Secret Room, Crane Game</p>
            </ul>
            <p class="tags">* Secret Room</p>
        </span>
    </a>
</li>

我要做的是返回&lt;p class="quality"&gt;(不包括这个标签)和第一个&lt;ul&gt;之间的所有&lt;p&gt;标签。

我已经尝试了在论坛上找到的几种解决方案,并且使用我在其中一个答案中找到的以下代码只取得了部分成功(不会撒谎,我很难理解这里发生了什么)。我正在迭代的原因是因为 HTML 中有几个项目需要抓取:

items = html.at(".repentanceitems-container").css("li.textbox").each do |item|
  use = item.xpath(".//a/span/p[5]/following-sibling::p[count(.//a/span/p[6]/preceding-sibling::p)= 
        count(.//a/span/p[6]/preceding-sibling::p)]")
  end

但是,这只会返回&lt;p class="quality"&gt; 之后的第一个&lt;p&gt; 标记。我敢肯定,由于我不理解代码,因此可能很简单。我还访问了我想要包含的第一个 &lt;p&gt; 元素和它需要结束的 &lt;ul&gt;,但我不确定如何使用此信息:

# First line of item use
start = item.xpath('.//a/span/p[5]')
# ul tag
ending = item.xpath('.//a/span/ul[1]')

对此的任何帮助将不胜感激!

【问题讨论】:

    标签: ruby web-scraping nokogiri


    【解决方案1】:

    怎么样:

    require "nokogiri"
    
    html = '<li class="textbox" data-tid="42.5" data-cid="42" data-sid="263" style="display: inline-block;"> <a> <div onclick="" class="item reb-itm-new re-itm263"></div> <span> <p class="item-title">Clear Rune</p> <p class="r-itemid">ItemID: 263</p> <p class="pickup">"Rune mimic"</p> <p class="quality">Quality: 2</p> <p>When used, copies the effect of the Rune or Soul stone you are holding (like the Blank Card)</p> <p>Drops a random rune on the floor when picked up</p> <p>The recharge time of this item depends on the Rune/Soul Stone held:</p> <p>1 room: Soul of Lazarus</p> <p>2 rooms: Rune of Ansuz, Rune of Berkano, Rune of Hagalaz, Soul of Cain</p> <p>3 rooms: Rune of Algiz, Blank Rune, Soul of Magdalene, Soul of Judas, Soul of ???, Soul of the Lost</p> <p>4 rooms: Rune of Ehwaz, Rune of Perthro, Black Rune, Soul of Isaac, Soul of Eve, Soul of Eden, Soul of the Forgotten, Soul of Jacob and Esau</p> <p>6 rooms: Rune of Dagaz, Soul of Samson, Soul of Azazel, Soul of Apollyon, Soul of Bethany</p> <p>12 rooms: Rune of Jera, Soul of Lilith, Soul of the Keeper</p> <ul> <p>Type: Active</p> <p>Recharge time: Varies</p> <p>Item Pool: Secret Room, Crane Game</p> </ul> <p class="tags">* Secret Room</p> </span> </a> </li>'
    puts Nokogiri::HTML(html).css(".quality ~ p:not(.tags)")[1..].map {|e| e.text}
    

    ~ 语法选择当前和其他兄弟元素,因此我使用切片跳过第一个元素。我假设.tags 是在.quality 之后唯一省略的其他类;如果除此之外还有其他元素,您还需要:not 它们,或者在.each 循环中手动检测并跳过它们,除非有人知道更聪明的技巧。

    【讨论】:

    • 刚刚运行它,它工作得很好,正是我需要的!我什至没有想过仅仅排除 p.tags 并使用地图。感谢您的所有帮助!
    【解决方案2】:

    您可能想看看this draft tutorial for nokogiri.org,它解释了一些方法来做到这一点。

    采用第三种(也是最通用的)方法,这里有一些代码可以满足您的需求:

    class CSSSection
      def self.item_section(item)
        document = item.document
        start_tag = item.at_css("p.quality")
        end_tag = item.at_css("ul")
    
        # grab siblings that follow the start tag
        following_siblings_query = "#{start_tag.path}/following-sibling::*"
        following_siblings = document.xpath(following_siblings_query)
    
        # grab siblings that precede the end tag
        preceding_siblings_query = "#{end_tag.path}/preceding-sibling::*"
        preceding_siblings = document.xpath(preceding_siblings_query)
    
        following_siblings & preceding_siblings # xpath intersection
      end
    end
    
    doc = Nokogiri::HTML4(html)
    li_nodes = doc.css("li") # whatever the query is to get the relevant "li" elements
    
    data = li_nodes.map do |li_node|
      CSSSection.item_section(li_node)
    end
    
    puts data.first
    # => <p>When used, copies the effect of the Rune or Soul stone you are holding (like the Blank Card)</p>
    #    <p>Drops a random rune on the floor when picked up</p>
    #    <p>The recharge time of this item depends on the Rune/Soul Stone held:</p>
    #    <p>1 room: Soul of Lazarus</p>
    #    <p>2 rooms: Rune of Ansuz, Rune of Berkano, Rune of Hagalaz, Soul of Cain</p>
    #    <p>3 rooms: Rune of Algiz, Blank Rune, Soul of Magdalene, Soul of Judas, Soul of ???, Soul of the Lost</p>
    #    <p>4 rooms: Rune of Ehwaz, Rune of Perthro, Black Rune, Soul of Isaac, Soul of Eve, Soul of Eden, Soul of the Forgotten, Soul of Jacob and Esau</p>
    #    <p>6 rooms: Rune of Dagaz, Soul of Samson, Soul of Azazel, Soul of Apollyon, Soul of Bethany</p>
    #    <p>12 rooms: Rune of Jera, Soul of Lilith, Soul of the Keeper</p>
    

    【讨论】:

      猜你喜欢
      • 2014-08-03
      • 2012-06-24
      • 1970-01-01
      • 2012-11-19
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多