xpath 处理带有一些其他标签的双引号答案

【问题标题】：xpath handle double quotes with some other tagsxpath 处理带有一些其他标签的双引号
【发布时间】：2021-08-20 02:06:53
【问题描述】：

我有这个 html 示例

<html>
<body>
  ....
  <p id="book-1" class="abc">
    <b>
      <a href="xxx.html">book-1</a>
      <a href="xxx.html">section</a>
    </b>
       "I have a lot of "
        <i>different</i> 
       "text, and I want "
       <i>all</i>
       " text and we may or may not have italic surrounded text."
  </p>
  ....

我目前拥有的 xpath 是这样的：

@"/html[1]/body[1]/p[1]/text()"

这给出了这个结果：

我有很多

但我想要这个结果：

I have a lot of different text, and I want all text and we may or may not have italic surrounded text.

感谢您的帮助。

【问题讨论】：

我认为你不能单独使用 xpath 来做到这一点，since you can't select a node without its children。要做你想做的事，你必须使用宿主语言。比如python中的BeautifulSoup。
如果我真的找不到方法，那么我将不得不从 p 中提取，然后使用我的过程来做一些正则表达式。谢谢。

标签： xpath

【解决方案1】：

我认为在 XPath 2 及更高版本中，您可以使用 string-join(/html[1]/body[1]/p[1]/b/following-sibling::node(), '')。目前尚不清楚您想要哪些节点，但会选择p 的b 子节点之后的所有兄弟节点，然后将它们的字符串值连接成一个。

【讨论】：

b 之后的这些文本除了有时有斜体节点外没有其他节点。所以这些文本在 p 节点下。