【问题标题】：java: remove cdata tag from xmljava：从xml中删除cdata标签
【发布时间】：2023-03-06 14:36:02
【问题描述】：

xpath 非常适合解析 xml 文件，但它不适用于 cdata 标签内的数据：

<![CDATA[ Some Text <p>more text and tags</p>... ]]>

我的解决方案：先获取xml的内容，然后删除

"<![CDATA["  and  "]]>".

之后，我会从 xml 文件中运行 xpath“以访问所有内容”。有更好的解决方案吗？如果没有，我该如何使用正则表达式？

【问题讨论】：

删除 CDATA 可能会使您的 xml 无效（并且可能对处理目的无用）
正则表达式和 XML 不能混合使用。请阅读stackoverflow.com/questions/1732348
那么从 rss xml 文件中获取标题、描述、发布时间等信息以及同时获取 cdata 内容的解决方案是什么？这实际上是我需要 CDATA 提供的图片链接。

标签： java regex xslt xpath cdata

【解决方案1】：

您绝对可以通过使用正则表达式从 xml 中删除所需内容来从 xml 中删除 cdata。

例如：

String s = "<sn><![CDATA[poctest]]></sn>";
s = s.replaceAll("!\\[CDATA", "");
s = s.replaceAll("]]", "");
s = s.replaceAll("\\[", "");

结果将是：

<sn><poctest></sn>

请检查，如果这解决了您的问题。

【讨论】：

【解决方案2】：

试试这个：

public static removeCDATA (String text) {
    String resultString = "";
    Pattern regex = Pattern.compile("(?<!(<!\\[CDATA\\[))|((.*)\\w+\\W)");
    Matcher regexMatcher = regex.matcher(text);
    while (regexMatcher.find()) {
        resultString += regexMatcher.group();
    }
    return resultString;
}

当我使用您的测试输入 <![CDATA[ Some Text <p>more text and tags</p>... ]]> 调用此方法时，方法返回 Some Text <p>more text and tags</p>

但我觉得这种不用正则表达式的方法会更可靠。像这样的：

public static removeCDATA (String text) {
    s = s.trim();
    if (s.startsWith("<![CDATA[")) {
        s = s.substring(9);
        int i = s.indexOf("]]>");
        if (i == -1) throw new IllegalStateException("argument starts with <![CDATA[ but cannot find pairing ]]>");
        s = s.substring(0, i);
    }
    return s;
}

【讨论】：

【解决方案3】：

CDATA 标记存在的原因是其中的所有内容都是纯文本，不应直接解释为 XML。您也可以将问题中的文档片段写成

 Some Text &lt;p&gt;more text and tags&lt;/p&gt;...

（带有前导和尾随空格）。

如果您真的想将其解释为 XML，请从文档中提取文本，然后再次将其提交给 XML 解析器。

【讨论】：

我很好奇你是否建议一些更简单的东西作为我提出的答案？
不是真的...我只是说这个问题通常不应该存在，因为 CDATA 区域中的内容并不意味着被解释为 XML。

【解决方案4】：

我需要完成同样的任务。我已经用两个 xslt 解决了。

让我强调一下，这只有在 CDATA 是 well-formed xml 时才有效。

为了完整起见，让我在您的示例 xml 中添加一个根元素：

<root>
   <well-formed-content><![CDATA[ Some Text <p>more text and tags</p>]]>
   </well-formed-content>
</root>

图 1.- 启动 xml

第一步

在第一个转换步骤中，我将所有文本节点包装在一个新引入的 xml 实体old_text：

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
    encoding="UTF-8" standalone="yes" />

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()|processing-instruction()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!-- Text-nodes: Wrap them in a new node without escaping it. -->
    <!-- (note precondition: CDATA should be valid xml.           -->
    <xsl:template match="text()">
        <xsl:element name="old_text">
            <xsl:value-of select="." disable-output-escaping="yes" />
        </xsl:element>
    </xsl:template>

</xsl:stylesheet>

图 2.- 第一个 xslt（将 CDATA 包装在“old_text”元素中）

如果您将此转换应用于起始 xml，这就是您得到的（我没有重新格式化它以避免混淆谁在做什么）：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
    </old_text><well-formed-content><old_text> Some Text <p>more text and tags</p>
    </old_text></well-formed-content><old_text>
</old_text></root>

图 3.- 转换后的 xml（第一步）

第二步

您现在需要清理引入的old_text 元素，并重新转义未创建新节点的文本：

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
    encoding="UTF-8" standalone="yes" />

    <!-- Element-nodes: Process nodes and their children -->
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!--
        'Wrapper'-node: remove the wrapper element but process its children.
        With this matcher, the "old_text" is cleaned, but the originally CDATA
        well-formed nodes surface in the resulting xml.
    -->
    <xsl:template match="old_text">
        <xsl:apply-templates select="*|text()" />
    </xsl:template>

    <!--
        Text-nodes: Text here comes from original CDATA and must be now
        escaped. Note that the previous rule has extracted all the existing
        nodes in the CDATA. -->
    <xsl:template match="text()">
        <xsl:value-of select="." disable-output-escaping="no" />
    </xsl:template>

</xsl:stylesheet>

图 4.- 2nd xslt（清理后的人工引入元素）

结果

这是最终结果，最初在 CDATA 中的节点在您的新 xml 文件中展开：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
    <well-formed-content> Some Text <p>more text and tags</p>
    </well-formed-content>
</root>

图 5.- 最终 xml

警告

如果您的 CDATA 包含 xml 中不支持的 html 字符实体（请查看此wikipedia article about character entities 中的示例），您需要将这些引用添加到您的中间 xml。让我用一个例子来说明这一点：

<root>
    <well-formed-content>
        <![CDATA[ Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.]]>
    </well-formed-content>
</root>

图6.-在图1的xml中添加字符实体&nbsp;

图 2 中的原始 xslt 会将 xml 转换为：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root><old_text>
    </old_text><well-formed-content><old_text>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.
    </old_text></well-formed-content><old_text>
</old_text></root>

图 7.- 第一次尝试转换图 6 中的 xml 的结果（格式不正确！）

这个文件的问题是它的格式不正确，因此不能用 XSLT 处理器进一步处理：

实体“nbsp”被引用，但未声明。 XML 检查完成。

图 8.- 图 7 中 xml 的格式良好检查结果

这种解决方法可以解决问题（match="/" 模板添加了&nbsp; 实体）：

<?xml version="1.0" encoding="UTF-8" ?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="no" version="1.0"
                encoding="UTF-8" standalone="yes" />

    <!-- Add an html entity to the xml character entities declaration. -->
    <xsl:template match="/">
        <xsl:text disable-output-escaping="yes"><![CDATA[<!DOCTYPE root
[
    <!ENTITY nbsp "&#160;">
]>
]]>
        </xsl:text>
        <xsl:apply-templates select="*" />
    </xsl:template>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*|text()|@*|comment()|processing-instruction()" />
        </xsl:copy>
    </xsl:template>

    <!-- Attribute-nodes and comment-nodes: Pass through without modifying -->
    <xsl:template match="@*|comment()|processing-instruction()">
        <xsl:copy-of select="." />
    </xsl:template>

    <!-- Text-nodes: Wrap them in a new node without escaping it. -->
    <!-- (note precondition: CDATA should be valid xml.           -->
    <xsl:template match="text()">
        <xsl:element name="old_text">
            <xsl:value-of select="." disable-output-escaping="yes" />
        </xsl:element>
    </xsl:template>

</xsl:stylesheet>

图 9.- xslt 创建实体声明

现在，在将这个 xslt 应用到 图 6 源 xml 之后，这是中间 xml：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><!DOCTYPE root
[
    <!ENTITY nbsp "&#160;">
]>

        <root><old_text>
    </old_text><well-formed-content><old_text>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop:&nbsp;.
    </old_text></well-formed-content><old_text>
</old_text></root>

图 10.- 中间 xml（图 3 中的 xml 加上实体声明）

您可以使用 图 4 中的 xslt 转换来生成最终的 xml：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?><root>
    <well-formed-content>
        Some Text <p>more text and tags</p>,
        now with a non-breaking-space before the stop: .
    </well-formed-content>
</root>

图 11.- 将 html 实体转换为 UTF-8 的最终 xml

注意事项

对于这些示例，我使用了 NetBeans 7.1.2 内置 XSLT 处理器 (com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl - default JRE XSLT processor)

免责声明：我不是 XML 专家。我觉得这应该更容易......

【讨论】：

【解决方案5】：

要剥离 CDATA 并将标签保留为标签，您可以使用 XSLT。

鉴于此 XML 输入：

<?xml version="1.0" encoding="ISO-8859-1"?>
<root>
    <child>Here is some text.</child>
    <child><![CDATA[Here is more text <p>with tags</p>.]]></child>
</root>

使用此 XSLT：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    xmlns:xs="http://www.w3.org/2001/XMLSchema"
    exclude-result-prefixes="xs"
    version="2.0">

    <xsl:output method="xml" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="*" />
            <xsl:value-of select="text()" disable-output-escaping="yes"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

将返回以下 XML：

<?xml version="1.0" encoding="UTF-8"?>
<root>
   <child>Here is some text.</child>
   <child>Here is more text <p>with tags</p>.</child>
</root>

（在 oXygen 12.2 中使用 Saxon HE 9.3.0.5 测试）

然后您可以使用 xPath 来提取 p 元素的内容：

/root/child/p

【讨论】：