php正则表达式删除html中除href之外的所有属性答案

【问题标题】：php regex remove all attibute except href in htmlphp正则表达式删除html中除href之外的所有属性
【发布时间】：2016-01-10 15:55:04
【问题描述】：

我想使用 php 正则表达式删除 html 中的所有属性，例如：title="..." id="..." class="..." excel href 我用$result = preg_replace('#[^(href)]="(.*?)"#is', '', $result); 但它错了在线测试 http://www.phpliveregex.com/p/dcn

【问题讨论】：

标签： php regex preg-replace

【解决方案1】：

您确实应该考虑使用SGML parser 来完成此类工作。正则表达式不太适合 HTML 处理。但是，如果它们是您唯一可用的东西，您需要了解有关语法的更多信息。至少您的一个问题是子表达式[^(href)]，它指的是character class。这匹配(、h、r、e、f 和 ) 中的单个字符 not。这可能不是您想要的。

您可以尝试使用带有反向引用的negative look-ahead，但您最终可能会咀嚼您不打算使用的东西，或者丢失您想要的东西。考虑以下 HTML-ish sn-p：

<p class="...">Properties like <a class="..." href="..."
name="...">href="..."</a> and <a href="..."
name="...">name="..."</a> should come after the &lt;a
and before the &gt;.</p>

<p class="..."><a name="..." href="..."><img
src="..." /></a><br class="..." />Fig. 1</p>

您需要能够分辨出您何时输入了标签（因此我建议使用 SGML 解析器），并且如何确保仅使用负前瞻来替换正确的字符串并不明显。

preg_replace_callback 可能更适合您的用例（即，使用您的 $callback 来保留您的 href 属性，但过滤其他所有内容）：

$filtered = preg_replace_callback('#<([^/\s]\S*)((?:\s+[^>=]+=(?:\'[^\']*\'|"[^"]*"))*)(\s*/?)>#is',
    function ($matches) {
        $filtered = preg_replace_callback('#\s+([^=]+)=(?:\'[^\']*\'|"[^"]*")#is',
            function ($matches) {
                return ($matches[1] != 'href'
                    ? ''
                    : $matches[0]);
            }, $matches[2]);

        return ('<' . $matches[1] . $filtered . $matches[3] . '>');
    }, $subject);

可能有比上述更简单的方法来实现相同的目标，但您应该能够明白这一点。顺便说一句，通过上面的代码运行上面的 HTML-ish sn-p 会给你：

<p>Properties like <a href="...">href="..."</a> and <a href="...">name="..."</a> should come after the &lt;a
and before the &gt;.</p>

<p><a href="..."><img /></a><br />Fig. 1</p>%

这些教程中的一个或多个可能会有所帮助，具体取决于您的学习方式：

【讨论】：