【问题标题】:How to preserve DOCTYPE declarations when manipulating xml with Jsoup使用 Jsoup 操作 xml 时如何保留 DOCTYPE 声明
【发布时间】:2015-09-18 19:02:04
【问题描述】:

我有一个 XML 文档,它以下列方式开始:

<?xml version="1.0"?>
<!DOCTYPE  viewdef [
<!ENTITY nbsp   "&#160;"> <!-- no-break space = non-breaking space U+00A0 ISOnum -->
<!ENTITY copy   "&#169;"> <!-- copyright sign, U+00A9 ISOnum -->
<!ENTITY amp    "&#038;"> <!-- ampersand -->
<!ENTITY shy    "&#173;"> <!-- soft hyphen -->
]>

我正在使用 Jsoup 1.8.2 以如下方式解析文档:

public static void convertXml(String inFile, String outFile) throws Exception {
    String xmlString = FileUtils.readFileToString(new File(inFile), Charset.forName("UTF-8")); 
    Document document = Jsoup.parse(xmlString, "UTF-8", Parser.xmlParser());
    FileUtils.writeStringToFile(new File(outFile), document.html(), "UTF-8");           
}

在这种情况下,我希望输出文件与输入文件相同,但 Jsoup 会生成此文件:

<?xml version="1.0"?> <!DOCTYPE viewdef> 
<!-- no-break space = non-breaking space U+00A0 ISOnum --> 
<!--ENTITY copy   "&#169;"--> 
<!-- copyright sign, U+00A9 ISOnum --> 
<!--ENTITY amp    "&#038;"--> 
<!-- ampersand --> 
<!--ENTITY shy    "&#173;"--> 
<!-- soft hyphen --> ]&gt;

这是一个错误还是有什么方法可以保留原始 DOCTYPE 声明?

【问题讨论】:

    标签: java xml jsoup doctype


    【解决方案1】:

    在使用 Jsoup 解析 xmlString 之前,手动替换 DOCTYPE 序列,然后将其添加回最终文档中。

    示例代码

    private final static String DOCTYPE_SEQUENCE = "<doctype-sequence/>";
    private final static Pattern patern = Pattern.compile("(?i)<!DOCTYPE[\s\S]+]>");
    
    public static void convertXml(String inFile, String outFile) throws Exception {
        String xmlString = FileUtils.readFileToString(new File(inFile), Charset.forName("UTF-8")); 
        
        // * Remove the doctype sequence if found
        String doctype = "";
        Matcher matcher = pattern.matcher(xmlString);
        if (matcher.find()) {
            doctype = matcher.group(0);
            xmlString = xmlString.replace( doctype, DOCTYPE_SEQUENCE);
        }
    
        // * 
        Document document = Jsoup.parse(xmlString, "UTF-8", Parser.xmlParser());
        FileUtils.writeStringToFile(new File(outFile), document.html().replace(DOCTYPE_SEQUENCE, doctype), "UTF-8");           
    }
    

    pattern 变量在convertXml 之外,以避免多重模式编译。

    【讨论】:

      猜你喜欢
      • 2016-02-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-02-05
      • 2014-01-25
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多