xPath 表达式：获取元素，即使它们不存在答案

【问题标题】：xPath expression: Getting elements even if they don't existxPath 表达式：获取元素，即使它们不存在
【发布时间】：2012-01-23 22:45:39
【问题描述】：

我将这个 xPath 表达式放入 htmlCleaner：

 //table[@class='StandardTable']/tbody/tr[position()>1]/td[2]/a/img

现在，我的问题是它发生了变化，有时 /a/img 元素不存在。所以我想要一个获取所有元素的表达式

//table[@class='StandardTable']/tbody/tr[position()>1]/td[2]/a/img

当 /a/img 存在时，并且

//table[@class='StandardTable']/tbody/tr[position()>1]/td[2]

当 /a/img 不存在时。

有人知道怎么做吗？我在另一个问题中发现了一些看起来可能对我有帮助的东西

descendant-or-self::*[self::body or self::span/parent::body]

但我不明白。

【问题讨论】：

标签： java xml xpath htmlcleaner

【解决方案1】：

用途：

 (//table[@class='StandardTable']
     /tbody/tr)
         [position()>1]
                   /td[2]
                       [not(a/img)] 

|

 (//table[@class='StandardTable']
     /tbody/tr)
         [position()>1]
                   /td[2]
                      /a/img

一般来说，如果我们想在某些条件$cond为真时选择一个节点集（$ns1），否则选择另一个节点集（$ns2），可以用以下单个 XPath 表达式：

$ns1[$cond] | $ns2[not($cond)]

在这种特殊情况下，ns1 是：

 (//table[@class='StandardTable']
     /tbody/tr)
         [position()>1]
                   /td[2]
                      /a/img

和ns2 是：

 (//table[@class='StandardTable']
     /tbody/tr)
         [position()>1]
                   /td[2]

而$cond是：

boolean( (//table[@class='StandardTable']
         /tbody/tr)
             [position()>1]
                       /td[2]
                          /a/img
        )

【讨论】：

它一直给我一个 XPatherexception：Unknown Function not
@Nacht：“它”不是兼容的 XPath 实现。 not() 是一个标准的 XPath 函数：w3.org/TR/1999/REC-xpath-19991116/#function-not
是的，刚刚发现 htmlCleaner 不会立即执行布尔运算，您必须调用另一个名为“evaluateFunction”的函数。而且，与 htmlCleaner 一样，没有文档。 -_-
@Nacht：那么，您只需将我的解决方案中的not() 替换为 htmlCleaner 接受的任何内容。请让我知道这个最终解决方案是否适合您。
@Nacht - 更好的是，只需将 HTMLCleaner 的输出转换为 W3C Document 并从一开始就将其视为 XML。请参阅我的更新答案。此外，Dimitre 将我击败到 XPath 解决方案 +1。

【解决方案2】：

您可以选择两个互斥表达式的并集（注意| union 运算符）：

//table[@class='StandardTable']/tbody/tr[position()>1]/td[2]/a/img|
//table[@class='StandardTable']/tbody/tr[position()>1]/td[2][not(a/img)]

当第一个表达式返回节点时，第二个表达式不会（反之亦然），这意味着您将始终只获得所需的节点。

从@Dimitre 的回答中，我看到 HTMLCleaner 并不完全支持 XPath 1.0。你真的不需要它。您只需要 HTMLCleaner 来解析格式不正确的输入。完成该工作后，将其输出转换为标准 org.w3c.dom.Document 并将其视为 XML。

这是一个转换示例：

TagNode tagNode = new HtmlCleaner().clean("<html><div><p>test");
Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);

从现在开始，只需将 JAXP 与您想要的任何实现一起使用：

XPath xpath = XPathFactory.newInstance().newXPath();
Node node = (Node) xpath.evaluate("/html/body/div/p[not(child::*)]", 
                       doc, XPathConstants.NODE);
System.out.println(node.getTextContent());

输出：

test

【讨论】：

谢谢。我无法确认这是否有效，因为我不再从事此工作（程序的其余部分有效，我被告知要继续做其他事情）。当我有时间时，我会回来添加这个。到那时，谢谢！

【解决方案3】：

这很丑陋，甚至可能不起作用，但原则应该：

//table[@class='StandardTable']/tbody/tr[position()>1]/td[2][exists( /a/img )]/a/img | //table[@class='StandardTable']/tbody/tr[position()>1]/td[2][not( exists( /a/img ) )]

【讨论】：