XPath 查询以查找整个 HTML 文档中的所有未标记文本答案

【问题标题】：XPath Query to Find All Untagged Text Throughout HTML DocumentXPath 查询以查找整个 HTML 文档中的所有未标记文本
【发布时间】：2016-05-14 07:26:05
【问题描述】：

给定以下 HTML，是否有一个 XPath 查询可以提取两个 <h2> 标记之间的所有标记和未标记文本？（我在 RStudio 中使用 RSelenium 包。）

<html>
    <h2 id="section1" class="article">Heading 1</h2>
    <h3 id="section1.1" class="article">Subheading 1</h3>
    <p id="para001"  class="article section clear">
           Paragraph text 1.</p> 
    <div id="formula1" class="formula">...<img />...</div>
           Untagged text 1.
    <sub>  Subscripted text. </sub>
           Untagged text 2. 
    <em>   Emphasized text. </em>
           Untagged text 3.
    <span id="bib"> Bibliography text. </span>
           Untagged text 4.
    <p id="para002" class="article section clear">
           Paragraph text 2.</p>
    <h3 id="section1.2" class="article">Subheading 2</h3>
    <p id="para003" class="article section clear">
           Paragraph 3 text.</p>
    <h3 id="section1.3" class="article">Subheading 3</h3>
    <p id="para004" class="article section clear">
           Paragraph 4 text.</p>
    <h2 id="section2" class="article">Heading 2</h2>       
</html>

我正在尝试提出一个将返回的查询：

Paragraph text 1.
Untagged text 1.
Subscripted text.
Untagged text 2. 
Emphasized text.
Untagged text 3.
Bibliography text.
Untagged text 4.
Paragraph text 2.
Paragraph text 3.
Paragraph text 4.

到目前为止，我尝试过的是，

//p[preceding-sibling::h2[@id='section1'] 
    and following-sibling::h2[@id='section2'] 
    and descendant::node()]

返回，

Paragraph text 1.
Paragraph text 2.
Paragraph text 3.
Paragraph text 4.

我尝试使用this question 的解决方案，但我的问题有点复杂。我尝试添加following-sibling::text()[1]，但它不会提取未标记的文本。如果没有一个好的 XPath 解决方案，那么我很乐意欢迎像 CSS 选择器这样的替代方法。

【问题讨论】：

您的示例中有一些明显的拼写错误（第 4 行的 p-tag、最后一个 h2-tag 的 id-attribute 以及您的 XPath 引用的是 h3 而不是 h2）。您应该修复此问题，以便代码可以原样执行。
感谢您指出这一点。我做了这些更正。

标签： xml xpath css-selectors

【解决方案1】：

嗯，首先你不想只过滤 p-tags（这是第三个字母中的p 所做的），你想要第 1 节之后和第 2 节之前的所有标签。第二，你是寻找这两者之间的所有标签的后代，它们是文本节点。

所以：查找所有具有preceding-sibling::h2[@id='section1'] 和following-sibling::h2[@id='section2'] 的标签：

//*[preceding-sibling::h2[@id='section1'] and following-sibling::h2[@id='section2']]

然后在其中任何一个下面寻找所有text()-tags：

//*[preceding-sibling::h2[@id='section1'] and following-sibling::h2[@id='section2']]//text()

【讨论】：

感谢您的回答。您的 XPath 是正确的，但 RSelenium 包返回一个错误，指出“参数是无效的选择器（例如 XPath/CSS）。”我需要花一些时间看看它为什么会误解 XPath 查询。