Xpath 获取具有特定字符串的标签及其所有后续兄弟，直到标签中包含另一个特定字符串答案

【问题标题】：Xpath to get a tag with specific strings and all of its following sibling until another specific strings is in the tagXpath 获取具有特定字符串的标签及其所有后续兄弟，直到标签中包含另一个特定字符串
【发布时间】：2019-06-09 20:30:21
【问题描述】：

我对使用 Xpath 很陌生。我正在尝试从法律法规网站提取一些信息，现在我只想：

查找包含字符串“Article 1”的标签。
从 (1) 中的那个标记开始，获取它以及之后的所有内容，直到其中一个标记在 <b> 标记中包含另一个字符串“总理大臣”。

<p>
  <b> <span> Article 1. </span> </b> 
  <span> 
     To approve the master plan on development 
     of tourism in Northern Central Vietnam 
     with the following principal contents: 
  </span>
</p>

<p>
  <span>
    1. Development viewpoints
  </span>
</p>

<p>
  <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

<p>
  <span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>

<p>
  <span>
    <b> PRIME MINISTER </b>
  </span>
</p>

<p>
  <b> <span> Article 2. </span> </b> 
  <span> 
     .................
  </span>
</p>

<p>
  <span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>

预期的输出，我应该有一个类似于

的列表

[ 
'Article 1.' , 
  'To approve the master plan on development of tourism in Northern 
   Central Vietnam with the following principal contents: ',
  '1. Development viewpoints' ,
  'To realize general viewpoints of the strategy for and master plan on 
   development of Vietnam’s tourism through 2020.' ,
  'PRIME MINISTER: Nguyen Tan Dung',
  'PRIME MINISTER'
]

列表中的第一项是“Article 1”。列表中的最后一项是 <b> 标签内的“PRIME MINISTER”

【问题讨论】：

标签： python xpath web-scraping scrapy

【解决方案1】：

“Until”和“Between”查询在 XPath 中是非常困难的，即使在 XPath 1.0 之后的版本中也是如此。

如果我们从更高版本中恢复，在 XPath 3.1 中，您可以执行以下操作：

let $first := p[contains(., 'Article 1')],
    $last := p[contains(., 'PRIME MINISTER']
return $first, p[. >> $first and . << $last], $last

在 XPath 2.0 中，我们没有 let，但 for 也能正常工作，只是读起来有点奇怪。

但是在 1.0 中 (a) 我们不能绑定变量，并且 (b) 我们没有 << 和 >> 运算符，这使得它变得更加困难。

最简单的表达方式大概是

p[(.|preceding-sibling::p)[contains(., 'Article 1')] and 
  (.|following-sibling::p)[contains(., 'PRIME MINISTER')]]

不幸的是，如果没有一个非常聪明的优化器，那么对于大型输入文档来说，这可能会非常低效（两个 contains() 测试都将在 (N^2)/2 倍左右执行，其中 N 是段落数）。如果您受限于 XPath 1.0，那么您最好使用 XPath 查找“开始”和“结束”节点，然后使用宿主语言查找介于两者之间的所有节点。

【讨论】：

有人“帮助”编辑了将“p”更改为“//p”的答案。我也许应该更清楚地说明解决方案（尤其是 1.0 解决方案）假设“p”元素都是彼此的兄弟，因为它依赖于前同胞和后同胞在序列中导航。 //p 可以找到不是兄弟的元素，所以这不是一个正确的改变，

【解决方案2】：

这个 xpath 表达式：

//p[descendant-or-self::p and (following-sibling::p/descendant::b)]

至少在您发布的 html 代码上应该会得到您的预期输出。

【讨论】：

是的，但由于它没有提到“第 1 条”或“总理”，这取决于您确切了解源文档中的内容，以及您是否知道非常详细，那么你已经知道答案了，不需要 XPath 来找出答案。
@MichaelKay - 原则上你是对的，但是 - 首先，代码 sn-p 是 OP 给我们的所有代码；其次，更重要的是，“Article 1”可能与此处相关，但与使用“Section 1”等的文档无关。
恐怕这是一个非常普遍的问题，人们向我们展示源文档，却没有告诉我们源文档的哪些方面是固定的，哪些方面是可变的。在这种情况下，我们只能猜测。

【解决方案3】：

这是符合 OP 中确切要求的 xpath。

//span[normalize-space(.)='Article 1.']/ancestor::p|//p[//span[normalize-space(.)='Article 1.']]/following::*[count(following-sibling::p/span/b[normalize-space(.)='PRIME MINISTER'])=1]

截图：

【讨论】：

【解决方案4】：

一个简单的 XPath 1.0 表达式：

 /*/p[starts-with(normalize-space(), 'Article 1.')]
     [1]
    | /*/p[starts-with(normalize-space(), 'Article 1.')]
          [1]/following-sibling::p
             [not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
             and
               following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
             and not(starts-with(normalize-space(), 'PRIME MINISTER'))
             ]

根据此 XML 文档评估时：

<html>
<p>
  <b> <span> Article 1. </span> </b>
  <span>
     To approve the master plan on development
     of tourism in Northern Central Vietnam
     with the following principal contents:
  </span>
</p>

<p>
  <span>
    1. Development viewpoints
  </span>
</p>

<p>
  <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

<p>
  <span>PRIME MINISTER: Nguyen Tan Dung</span>
</p>

<p>
  <span>
    <b> PRIME MINISTER </b>
  </span>
</p>

<p>
  <b> <span> Article 2. </span> </b>
  <span>
     .................
  </span>
</p>

<p>
  <span> PRIME MINISTER: Nguyen Tan Dung</span>
</p>
</html>

它准确地选择了想要的 <p> 元素。

验证：

此 XSLT 转换评估 XPath 表达式并输出此评估中选择的所有节点：

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>

  <xsl:template match="/">
    <xsl:copy-of select=
    "/*/p[starts-with(normalize-space(), 'Article 1.')]
         [1]
        | /*/p[starts-with(normalize-space(), 'Article 1.')]
              [1]/following-sibling::p
                 [not(preceding-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')])
                 and
                   following-sibling::p[starts-with(normalize-space(), 'PRIME MINISTER')]
                 and not(starts-with(normalize-space(), 'PRIME MINISTER'))
                 ]
    "/>
  </xsl:template>
</xsl:stylesheet>

当应用于同一个 XML 文档（如上）时，会产生想要的结果：

<p>
   <b>
      <span> Article 1. </span>
   </b>
   <span>
     To approve the master plan on development
     of tourism in Northern Central Vietnam
     with the following principal contents:
  </span>
</p>
<p>
   <span>
    1. Development viewpoints
  </span>
</p>
<p>
   <span>To realize general viewpoints of the strategy for and master plan on development of Vietnam’s tourism through 2020.
  </span>
</p>

它会按预期由浏览器显示：

第1条。 批准发展总体规划越南中北部旅游主要内容如下：

一、发展观点

实现到2020年越南旅游业发展战略和总体规划的总体观点。

【讨论】：