在 Java 中评估 XPath 表达式时出现异常答案

【问题标题】：Getting Exception on evaluating an XPath expression in Java在 Java 中评估 XPath 表达式时出现异常
【发布时间】：2019-04-08 03:15:30
【问题描述】：

我正在尝试学习在 Java 中使用 Xpath 表达式。我正在使用 Jtidy 将 HTML 页面转换为 XHTML，以便我可以使用 XPath 表达式轻松解析它。我有以下代码：

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
factory.setNamespaceAware(true);


DocumentBuilder builder = factory.newDocumentBuilder();
    Document doc = ConvertXHTML("https://twitter.com/?lang=fr");

//Create XPath

XPathFactory xpathfactory = XPathFactory.newInstance();
XPath Inst= xpathfactory.newXPath();
NodeList nodes = (NodeList)Inst.evaluate("//p/@align",doc,XPathConstants.NODESET);
    for (int i = 0; i < nodes.getLength(); ++i) 
   {
            Element e = (Element) nodes.item(i);
            System.out.println(e);
    }

public Document ConvertXHTML(String link){
  try{

      URL u = new URL(link);

     BufferedInputStream instream=new BufferedInputStream(u.openStream());
     FileOutputStream outstream=new FileOutputStream("out.xhtml");

     Tidy c=new Tidy();
     c.setShowWarnings(false);
     c.setInputEncoding("UTF-8");
     c.setOutputEncoding("UTF-8");
     c.setXHTML(true);

     return c.parseDOM(instream,outstream);
     }

它适用于大多数 URL，但这个：

https://twitter.com/?lang=fr

因此我得到了这个异常：

javax.xml.transform.TransformerException: 索引 -1 越界.....

下面是我得到的堆栈跟踪的一部分：

javax.xml.transform.TransformerException: Index -1 out of bounds for length 128
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:366)
at java.xml/com.sun.org.apache.xpath.internal.XPath.execute(XPath.java:303)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathImplUtil.eval(XPathImplUtil.java:101)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.eval(XPathExpressionImpl.java:80)
at java.xml/com.sun.org.apache.xpath.internal.jaxp.XPathExpressionImpl.evaluate(XPathExpressionImpl.java:89)
at files.ExampleCode.GetThoselinks(ExampleCode.java:50)
at files.ExampleCode.DoSomething(ExampleCode.java:113)
at files.ExampleCode.GetThoselinks(ExampleCode.java:81)
at files.ExampleCode.DoSomething(ExampleCode.java:113)

我不确定问题是出在网站转换后的 xhtml 还是其他问题上。谁能说出代码中有什么问题？任何修改都会有所帮助。

【问题讨论】：

什么方法抛出异常？你能告诉我们一个堆栈跟踪吗？
@MichaelKay 我已经添加了堆栈跟踪。

标签： java xpath xhtml jtidy

【解决方案1】：

我在对 JTidy 生成的文档使用 xpath 评估时遇到了类似的问题。我通过让 JTidy 将它生成的 DOM 序列化为一个文件，然后使用 javax.xml.parsers.DocumentBuilder 解析该 xml 文件来获得 DOM 的第二个版本来解决这个问题。看起来很奇怪，使用第二个避免了越界异常并且有效。使用如下代码：

        DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
        documentBuilderFactory.setNamespaceAware(true);
        // If you don't do the following, it will take a full minute to parse the xml document (presumably the time-out
        // period for trying to load the DTD). See https://stackoverflow.com/questions/6204827/xml-parsing-too-slow.
        documentBuilderFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);
        documentBuilder = documentBuilderFactory.newDocumentBuilder();
        Document doc = tidy.parseDOM(input, null);
        FileOutputStream fos = new FileOutputStream("temp.xml");
        tidy.pprint(doc, fos);
        fos.close();
        doc = documentBuilder.parse("temp.xml");

【讨论】：

【解决方案2】：

我通常会说来自 XPath 引擎深处的边界索引异常是 XPath 引擎中的错误。唯一需要注意的是 XPath 引擎正在搜索的 DOM 是否存在结构性问题； XPath 处理器有权做出合理的假设，即 DOM 是有效的，如果不是，则崩溃。在这种情况下，这将是创建 DOM 的 Tidy 中的一个错误。

【讨论】：

假设问题出在 Tidy 中，它没有给我正确的 XHTML。有什么方法可以在 Xpath 中进行检查，以避免空节点评估？
我认为我在这个阶段的策略是（a）构建一个包含 Apache XPath 和 Tidy 项目的源代码的项目，尝试重现崩溃并调试它，或者（b ) 切换到其他库，例如validator.nu 代替 Tidy、Saxon 或 Jaxen 代替 Apache XPath。您也可以尝试 (c) 从支持这些库的人那里获得帮助，但对于 XPath 库，我不会屏住呼吸。