使用 XSLT 函数删除除允许的标签之外的所有 html 标签答案

【问题标题】：Remove all html tags except allowed tags using XSLT function使用 XSLT 函数删除除允许的标签之外的所有 html 标签
【发布时间】：2017-03-10 11:02:53
【问题描述】：

我正在尝试使用 XSLT 清理我们从 rss 提要中获得的一些数据。我想删除除 p 标记之外的所有标记。

 Cows are kool.<p>The <i>milk</i> <b>costs</b> $1.99.</p>

我对如何在 1.0 或 2.0 中使用 XSLT 解决这个问题几乎没有疑问。

1)我看过这个例子https://maulikdhorajia.blogspot.in/2011/06/removing-html-tags-using-xslt.html

但是我需要 p 标签存在并且我需要使用正则表达式。我们可以使用 string-before-match 函数并以类似的方式执行。我认为这个函数在 xpath 中不存在。

2)我知道 replace 函数不能用于此，因为它需要一个字符串，如果我们传递任何节点，它会提取内容然后将其传递给函数，在这种情况下会破坏删除标签的目的。

在这个答案中我有点困惑，使用了替换 https://stackoverflow.com/a/18528749/745018。

3)我正在使用 xslt 在 nginx 服务器中执行此操作。

请在下面找到我们在 rss 提要的正文标签中获得的示例输入。

<p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on <h2>March 31</h2> because the judge ignored an earlier court order summoning him.<i>Justice Karnan</i> had to appear</p>

更新：我也在为此寻找一个 xslt 函数

【问题讨论】：

请提供最少但完整的 XML 输入示例以及您想要的相应结果。我们需要查看 RSS 提要中的 HTML 是作为标记还是作为文本（在 CDATA 部分内）包含在内。我们还需要知道您是否希望 HTML 可以解析为 XML 或只能解析为 HTML。
@MartinHonnen 更新了一个示例输入。我需要返回 cdata 中的内容，除了 p 标签之外没有任何 html 标签。

标签： xml xslt replace strip-tags

【解决方案1】：

假设您可以使用 XSLT 2.0，那么您可以将 David Carlisle 的 HTML 解析器 (https://github.com/davidcarlisle/web-xslt/blob/master/htmlparse/htmlparse.xsl) 应用于 body 元素的内容，然后以剥离除 p 元素之外的所有元素的模式处理生成的节点：

<?xml version="1.0" encoding="UTF-8"?>
<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
    xmlns:d="data:,dpc"
    xmlns:xhtml="http://www.w3.org/1999/xhtml"
    exclude-result-prefixes="d xhtml">

    <xsl:import href="htmlparse-by-dcarlisle.xsl"/>

    <xsl:template match="@*|node()" mode="#default strip">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="body">
        <xsl:copy>
            <xsl:apply-templates select="d:htmlparse(., '', true())" mode="strip"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(self::p)]" mode="strip">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:transform>

对于输入

<rss>
    <entry>
        <body><![CDATA[<p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on <h2>March 31</h2> because the judge ignored an earlier court order summoning him.<i>Justice Karnan</i> had to appear</p>]]></body>
    </entry>
</rss>

给了

<rss>
    <entry>
        <body><p>The Supreme Court issued on Friday a bailable warrant against sitting Calcutta high court justice CS Karnan, an unprecedented order in a bitter confrontation between the judge and the top court.</p><p>A seven-judge bench headed by Chief Justice of India JS Khehar issued the order directing Karnan’s presence on March 31 because the judge ignored an earlier court order summoning him.Justice Karnan had to appear</p></body>
    </entry>
</rss>

如果输入没有被转义，而是作为 XML 包含在输入中，那么您不需要解析它，只需将模式应用于内容：

<xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <xsl:template match="@*|node()" mode="#default strip">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()" mode="#current"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="body">
        <xsl:copy>
            <xsl:apply-templates select="node()" mode="strip"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*[not(self::p)]" mode="strip">
        <xsl:apply-templates/>
    </xsl:template>

</xsl:transform>

http://xsltransform.net/gWEamMc/1

【讨论】：

谢谢。假设 html 已经被解析。只有
xsltransform.net/ejivdJa 对我来说很好，您必须编辑您的问题并提供最少但完整的 XML、XSLT 示例、您想要的输出，然后如果您需要，您会得到一个或确切的错误消息进一步的帮助。
是的，它可以工作。只是想看看如何使它适用于这个输入 xsltransform.net/gWEamMc 。更新了问题。
@Mortan.Thanks ，我删除了模式条并且它起作用了。html 解析是我需要的一个很好的补充。
另外，我们可以创建一个可以做到这一点的 xsl 函数吗？