通过 XPath 直接文本内容？答案

【问题标题】：Direct text contents via XPath?通过 XPath 直接文本内容？
【发布时间】：2016-12-10 12:04:18
【问题描述】：

//*/text()[string-length() > 100]

...几乎可以工作，除了它还在html document 中选择script 和style 标记，并在遇到<br> 或其他标记时停止文本选择。

我想查找直接包含文本的元素，并且文本大于 140 个字符，并且应该选择整个元素的文本（有时文本在 span 内部更远）。

【问题讨论】：

它会在遇到
或其他标签时停止文本选择 - 是否也应该在文本中捕获标签？
是的，或者那些标签内容没有标签。

标签： php html xml xpath

【解决方案1】：

你需要了解difference between text() nodes and string values in XPath。

text() 在 XPath 中选择 text nodes。 br 元素显示在您在父元素中选择表单混合内容：text() 节点和元素混合在一起。
string() 是一个 XPath 函数，它返回 XPath 表达式的 string value。要获取忽略 br 元素的字符串，请选择父 div 并通过 string() 直接获取其字符串值或通过使用 a 中的表达式隐式获取其字符串值隐含转换为字符串的上下文。

在这样的背景下，你的陈述，

我想找到直接包含文本的元素，而文本是超过 140 个字符和整个元素的文本应该是选中（有时文本在 span 内更远）。

可以改写为

我想查找具有text() 子节点且其字符串值的长度大于140 的元素。

让我们看一些示例 XML，

<r>
  <a>This is a <b>test</b> of mixed content.</a>
  <c>asdf asdf asdf asdf</c>
  <d>asdf asdf</d>
</r>

然后让我们将 140 减少到 8 以使其更易于管理，然后

//*[text()][string-length() > 7]

捕获重新表述的需求并选择四个元素：

<r>
  <a>This is a <b>test</b> of mixed content.</a>
  <c>asdf asdf asdf asdf</c>
  <d>asdf asdf</d>
</r>

<a>This is a <b>test</b> of mixed content.</a>

<c>asdf asdf asdf asdf</c>

<d>asdf asdf</d>

注意它没有选择b，因为它的字符串值的长度小于7个字符。

还要注意，r 被选中是因为元素之间只有空格 text()。要消除此类元素，请向text() 添加一个额外的谓词：

//*[text()[normalize-space()]][string-length() > 7]

那么，只会选择a、c和d。

如果你只想要文本，在 XPath 1.0 中你可以集体取字符串值：

string(//*[text()[normalize-space()]][string-length() > 7])

如果您想要一个字符串集合，在 XPath 1.0 中，您需要通过调用 XPath 的语言对元素进行迭代，但在 XPath 2.0 中，您可以在末尾添加 string() 步骤：

//*[text()[normalize-space()]][string-length() > 7]/string()

获取三个独立字符串的序列：

This is a test of mixed content.
asdf asdf asdf asdf
asdf asdf

【讨论】：