Ruby, Nokogiri 移除 <class="foo"> 选择的 <ul> 元素答案

【问题标题】：Ruby, Nokogiri Remove <ul> element selected by <class="foo">Ruby, Nokogiri 移除 <class="foo"> 选择的 <ul> 元素
【发布时间】：2020-10-26 01:57:28
【问题描述】：

我想删除一个 Nokogiri 节点，但我不明白。

我得到了这样的 HTML 代码：

<div class="metis manual-toogle" id="tocList">...
  <li id="tocElement-ebook_cs_1111111_11">...
    <a data-content href="url" class=" "></a> <!-- only this urls I want -->
      <ul class="foo">
        <!-- the following content and urls I want to remove -->
        <li class id="tocElement-ebook_cs_1111111_cs12">
          <a data-content href="url" class=" "></a>
          ...
          <a data-content href="url" class=" "></a>
        </li>
      </ul>
  </li>
</div>

到目前为止我已经尝试过：

document = Nokogiri::HTML.parse(html_input)
document.xpath('//ul[@class="foo"]').each {|x| x.remove}

document.xpath('//ul[@class="foo"]').children.map(:&remove)

我做错了什么？

编辑：

我不想解析一些 URL。我得到了上面的html结构。我想要的 URL 在 <li></li> 块中，像 <a data-content href="url"></a> 一样嵌套。问题是，<ul></ul> 内部也是一个<a data-content href="url"></a>。我可以提取每个 URL，但只需要主 URL。

这是一本有一些章节的书，我可以下载第一个链接的章节。每个子章节（在<ul> 内）都有一个自己的 pdf。

我不能使用正则表达式，因为链接的构建方式不同。例如在一本书中是

第1章pdf：...-ch1.pdf（包含所有子章）
- 第1-1章pdf：...-ch1-1.pdf
第2章pdf：...-923df2.pdf
第三章pdf:...-ch3.pdf

HTML 代码是一团糟。最简单的方法是删除 <ul> 块本身。

【问题讨论】：

我不认为您可以使用 nokogiri 编辑 html，它是解析宝石。但我仍然不确定。您的代码有任何错误吗？
“我不明白”不是一个足够精确的错误描述，我们无法帮助您。什么不起作用？如何不起作用？你的代码有什么问题？您收到错误消息吗？错误信息是什么？你得到的结果不是你期望的结果吗？你期望什么结果，为什么，你得到的结果是什么，两者有什么不同？您正在观察的行为不是期望的行为吗？期望的行为是什么，为什么，观察到的行为是什么，它们有何不同？

标签： html ruby rubygems nokogiri

【解决方案1】：

您在这里没有提供太多上下文或细节。但是，如果您正确选择它，下面的代码应该删除您想要的项目。请提供更多详细信息，例如您收到的输出、预期输出等。

鉴于信息有限，你可以试试这个：

更新：

html.html

<div class="metis manual-toogle" id="tocList">...
  <li id="tocElement-ebook_cs_1111111_11">...
    <a data-content href="url" class=" "></a> <!-- only this urls I want -->
      <ul class="foo">
        <!-- the following content and urls I want to remove -->
        <li class id="tocElement-ebook_cs_1111111_cs12">
          <a data-content href="url" class=" "></a>
          ...
          <a data-content href="url" class=" "></a>
        </li>
      </ul>
  </li>
</div>

main.rb

require 'nokogiri'
require 'open-uri'
require 'pry'

doc = Nokogiri::HTML(open('html.html'))

doc.xpath('//ul[@class="foo"]').remove

doc.xpath('//a').each do |item|
puts item
end

输出：

~/code/projects/test ⌚ 8:28:32
$ ruby main.rb                                                                                                                                                                                                          ‹2.6.1›
<a data-content href="urliwant" class=" "></a>

我们通过聊天解决了这个问题。上面的例子有效。但是，对于他的具体情况，由于 html 混乱，我们需要这样做：

document = Nokogiri::HTML(open('html.html'))

document.css('//ul//ul//ul').remove
document.css('ul .collapse').remove

links = document.xpath('//*[@id="toc"]//ul')

File.open("input.html", "a") do |output_txt|
  links.each do |item|
    output_txt.write(item)
  end
end

【讨论】：

不幸的是，这不起作用。 syntax error, unexpected tIDENTIFIER, expecting ')' ul.children.map(:&remove)
@Patrick 您更新了 OP。给我几分钟，我会更新这个答案。
感谢您的帮助，我很感激。但是代码不会删除带有类的ul-blocks 如果我在.each 循环中放置puts item，则没有输出
@Patrick 获取 ul 类 foo 内部的链接，您可以这样做 doc.xpath('//ul[@class="foo"]//a')
@Patrick 再次为您更新了有关获取 UL 中的链接的信息