将带有 Doctype 的 XHTML 文件解析为 XML 文档的糟糕性能答案

【问题标题】：Horrible Performance Parsing XHTML File with Doctype as XML Document将带有 Doctype 的 XHTML 文件解析为 XML 文档的糟糕性能
【发布时间】：2012-03-10 07:43:21
【问题描述】：

当我将这个 xhtml 文件解析为 xml 时，对这样一个简单的文件进行解析大约需要 2 分钟。我发现如果我删除 doctype 声明，它会立即解析。是什么问题导致该文件需要这么长时间才能解析？

Java 示例

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );

XHTML 示例

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml"
    xmlns:ex="http://www.example.com/schema/v1_0_0">
    <head><title>Test</title></head>
    <body>
        <h1>Test</h1>
        <p>Hello, World!</p>
        <p><ex:test>Text</ex:test></p>
    </body>
</html>

谢谢

编辑：解决方案

为了根据提供的有关问题发生原因的信息实际解决问题，我执行了以下基本步骤：

已将 DTD 相关文件下载到 src/main/resources 文件夹中
创建了一个自定义 EntityResolver 来从类路径中读取这些文件
告诉我的 DocumentBuilder 使用我的新 EntityResolver

我在这样做时引用了这个 SO 答案：how to validate XML using java?

新的实体解析器

import java.io.IOException;

import org.xml.sax.EntityResolver;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class LocalXhtmlDtdEntityResolver implements EntityResolver {

    /* (non-Javadoc)
     * @see org.xml.sax.EntityResolver#resolveEntity(java.lang.String, java.lang.String)
     */
    @Override
    public InputSource resolveEntity( String publicId, String systemId )
            throws SAXException, IOException {
        String fileName = systemId.substring( systemId.lastIndexOf( "/" ) + 1 );    
        return new InputSource( 
                getClass().getClassLoader().getResourceAsStream( fileName ) );
    }

}

如何使用新的EntityResolver：

DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
dbf.setNamespaceAware( true );
DocumentBuilder bob = dbf.newDocumentBuilder();
bob.setEntityResolver( new LocalXhtmlDtdEntityResolver() );
Document template = bob.parse( new InputSource( new FileReader( xmlFile ) ) );

【问题讨论】：

正如其他人指出的那样，解析器正在尝试从互联网下载资源；你需要自己resolve这些实体。
由于某种原因，这个解决方案对我不起作用。所以我刚刚安装了squid并添加了 -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128

标签： java xml xhtml

【解决方案1】：

Java 正在下载指定的 DTD 及其包含的文件，以验证您的 xhtml 文件是否符合指定的 DTD。使用 Charles 代理，我记录了以下要求加载指定数量的请求：

http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd - 30.4 秒
http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent - 30.19 秒
http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent - 30.23 秒
http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent - 30.20 秒

【讨论】：

【解决方案2】：

事实上，你很幸运你得到了这些文件。 W3C 故意不响应对这些文档的请求，因为它们无法处理大量请求。您需要为解析器提供本地副本。

在 Java 世界中执行此操作的常用方法是使用 Apache/Oasis 目录解析器。

最新版本的 Saxon 内置了这些常用 DTD 和实体文件的知识，如果您允许 Saxon 提供 XML 解析器，它将自动配置为使用本地副本。即使您不使用 XSLT 或 XQuery 来处理数据，您也可以利用这一点：只需创建一个 Saxon Configuration 对象并调用它的 getSourceParser() 方法来获取您的 XMLReader。

（也许这也是让自己摆脱 DOM 的好时机。在 Java 中处理 XML 的所有选择中，DOM 可能是最糟糕的。如果您必须使用低级导航 API，请选择一个体面的类似 JDOM 或 XOM。）

【讨论】：