如何在节点集中搜索并从同一节点集中删除节点答案

【问题标题】：How to search within a nodeset and delete a node from that same nodeset如何在节点集中搜索并从同一节点集中删除节点
【发布时间】：2016-09-30 07:50:48
【问题描述】：

我有以下 xml：

    <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
    <w:document mc:Ignorable="w14 w15 wp14" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:mo="http://schemas.microsoft.com/office/mac/office/2008/main" xmlns:mv="urn:schemas-microsoft-com:mac:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape">
        <w:body>
            <w:p w14:paraId="56037BEC" w14:textId="1188FA30" w:rsidR="001665B3" w:rsidRDefault="008B4AC6">
                <w:r>
                    <w:t xml:space="preserve">This is the story of a man who </w:t>
                </w:r>
                <w:ins w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="0">
                    <w:r w:rsidR="003566BF">
                        <w:t>went</w:t>
                    </w:r>
                </w:ins>
                <w:del w:author="Mitchell Gould" w:date="2016-09-28T09:15:00Z" w:id="1">
                    <w:r w:rsidDel="003566BF">
                        <w:delText>goes</w:delText>
                    </w:r>
                </w:del>
...

我使用 Nokogiri 来解析 xml 如下：

zip = Zip::File.open("test.docx")
doc = zip.find_entry("word/document.xml")
file = Nokogiri::XML.parse(doc.get_input_stream)

我有一个包含所有 w:del 元素的“删除”节点集：

@deletions = file.xpath("//w:del")

我在这个节点集中搜索是否存在一个元素，如下所示：

 my_node_set = @deletions.search("//w:del[@w:id='1']" && "//w:del/w:r[@w:rsidDel='003566BF']")

如果它存在，我想从删除节点集中删除它。我用以下方法做到这一点：

deletions.delete(my_node_set.first)

这似乎可以正常工作，因为没有返回任何错误，它会在终端中显示已删除的节点集。

但是，当我检查我的@deletions 节点集时，它似乎仍然存在：

@deletions.search("//w:del[@w:id='1']" && "//w:del/w:r[@w:rsidDel='003566BF']")

我只是想了解 Nokogiri，所以我显然没有在我的 @deletions 节点集中正确搜索元素，而是搜索整个文档。

如何在 @deletions 节点集中搜索元素，然后将其从节点集中删除？

【问题讨论】：

请阅读“minimal reproducible example”。我们需要一个语法正确的 XML 样本，它是演示问题所必需的最低限度。我建议也删除命名空间，因为它们与问题并不密切。
不清楚为什么要从 NodeSet 中选择性地删除。 NodeSet 就像指向文档中节点的指针数组。从数组中删除一个节点，实际上您所做的就是从树中删除该特定分支，换句话说，您正在从文档中删除该标签。如果您正在收集一堆节点，然后只想删除一个，那么最初只搜索那个并删除它。不要浪费时间和内存来收集 NodeSet。

标签： ruby xml nokogiri

【解决方案1】：

考虑一下：

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="foo"><p>foo</p></div>
    <div id="bar"><p>bar</p></div>
  </body>
</html>
EOT

divs 包含div 标签，它们是一个NodeSet：

divs = doc.css('div')
divs.class  # => Nokogiri::XML::NodeSet

并包含：

divs.to_html # => "<div id=\"foo\"><p>foo</p></div><div id=\"bar\"><p>bar</p></div>"

您可以使用 at 搜索 NodeSet 以找到第一个匹配项：

divs.at('#foo').to_html # => "<div id=\"foo\"><p>foo</p></div>"

您可以轻松删除它：

divs.at('#foo').remove

从文档本身中删除它：

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     
# >>     <div id="bar"><p>bar</p></div>
# >>   </body>
# >> </html>

它不会从 NodeSet 中删除它，但我们并不关心这一点，NodeSet 只是一个指向文档本身的节点的指针，用于给出要做什么的列表删除。

如果您在删除某些节点后想要更新 NodeSet，请重新扫描文档并重建 NodeSet：

divs = doc.css('div')
divs.to_html # => "<div id=\"bar\"><p>bar</p></div>"

如果您的目标是删除 NodeSet 中的所有节点，而不是搜索该列表，您可以简单地使用：

divs.remove
puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     
# >>     
# >>   </body>
# >> </html>

当我删除节点时，我不会收集中间节点集，而是使用类似的方式即时执行：

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="foo"><p>foo</p></div>
    <div id="bar"><p>bar</p></div>
  </body>
</html>
EOT

doc.at('div#bar p').remove

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     <div id="foo"><p>foo</p></div>
# >>     <div id="bar"></div>
# >>   </body>
# >> </html>

删除#bar 中嵌入的<p> 标签。通过放松选择器并将 at 更改为 search 我可以将它们全部删除：

doc.search('div p').remove

puts doc.to_html

# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html>
# >>   <body>
# >>     <div id="foo"></div>
# >>     <div id="bar"></div>
# >>   </body>
# >> </html>

如果您坚持遍历NodeSet，请记住它们就像数组一样，您可以这样对待它们。下面是一个使用reject 跳过特定节点的示例：

doc = Nokogiri::HTML(<<EOT)
<html>
  <body>
    <div id="foo"><p>foo</p></div>
    <div id="bar"><p>bar</p></div>
  </body>
</html>
EOT

divs = doc.search('div').reject{ |d| d['id'] == 'foo' }
divs.map(&:to_html) # => ["<div id=\"bar\"><p>bar</p></div>"]

你不会收到 NodeSet，你会得到一个数组：

divs.class # => Array

虽然您可以这样做，但最好使用特定的选择器来减少集合，而不是依赖 Ruby 到 select 或 reject 元素。

【讨论】：

我非常感谢您的解释和指导。我错误地认为 Nodeset 就像一个单独的数组，我可以从中删除项目而不影响文档。我现在有了更好的理解。