HTML Purifier：根据属性有条件地删除元素答案

【问题标题】：HTML Purifier: Removing an element conditionally based on its attributesHTML Purifier：根据属性有条件地删除元素
【发布时间】：2011-02-07 23:31:12
【问题描述】：

根据the HTML Purifier smoketest，“格式错误”的 URI 偶尔会被丢弃以留下无属性的锚标记，例如

<a href="javascript:document.location='http://www.google.com/'">XSS</a> 变为 <a>XSS</a>

...以及偶尔被剥离到协议中，例如

<a href="http://1113982867/">XSS</a> 变为 <a href="http:/">XSS</a>

虽然这本身没有问题，但它有点难看。我没有尝试用正则表达式去除这些，而是希望使用 HTML Purifier 自己的库功能/注入器/插件/whathaveyou。

参考点：处理属性

有条件地删除 HTMLPurifier 中的属性很容易。这里库提供类HTMLPurifier_AttrTransform，方法是confiscateAttr()。

虽然我个人不使用 confiscateAttr() 的功能，但我确实使用HTMLPurifier_AttrTransform 按照this thread 将target="_blank" 添加到所有锚点。

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_Target();
// purify down here

HTMLPurifier_AttrTransform_Target 当然是一个非常简单的类。

class HTMLPurifier_AttrTransform_Target extends HTMLPurifier_AttrTransform
{
    public function transform($attr, $config, $context) {
        // I could call $this->confiscateAttr() here to throw away an
        // undesired attribute
        $attr['target'] = '_blank';
        return $attr;
    }
}

这部分很自然地就像一个魅力。

处理元素

也许我在HTMLPurifier_TagTransform 时眯着眼睛不够用力，或者我看错了地方，或者通常不理解它，但我似乎无法找到有条件删除的方法元素。

说，大意是：

// more configuration stuff up here
$htmlDef = $htmlPurifierConfiguration->getHTMLDefinition(true);
$anchor  = $htmlDef->addElementHandler('a');
$anchor->elem_transform_post[] = new HTMLPurifier_ElementTransform_Cull();
// add target as per 'point of reference' here
// purify down here

使用 Cull 类扩展了具有 confiscateElement() 能力或类似能力的东西，其中我可以检查缺少的 href 属性或带有内容 href 的属性 http:/ .

HTMLPurifier_Filter

我知道我可以创建一个过滤器，但示例（Youtube.php 和 ExtractStyleBlocks.php）建议我在其中使用正则表达式，我真的宁愿避免使用，如果有的话可能。我希望有一个板载或准板载解决方案，利用 HTML Purifier 的出色解析功能。

不幸的是，在 HTMLPurifier_AttrTransform 的子类中返回 null 并不能解决问题。

任何人有任何聪明的想法，还是我被正则表达式困住了？ :)

【问题讨论】：

我想我正在寻找同样的东西？看看我的帖子stackoverflow.com/questions/2646240/…你明白了吗

标签： php html-parsing htmlpurifier html

【解决方案1】：

成功！感谢Ambush Commander and mcgrailm in another question，我现在正在使用一个非常简单的解决方案：

// a bit of context
$htmlDef = $this->configuration->getHTMLDefinition(true);
$anchor  = $htmlDef->addBlankElement('a');

// HTMLPurifier_AttrTransform_RemoveLoneHttp strips 'href="http:/"' from
// all anchor tags (see first post for class detail)
$anchor->attr_transform_post[] = new HTMLPurifier_AttrTransform_RemoveLoneHttp();

// this is the magic! We're making 'href' a required attribute (note the
// asterisk) - now HTML Purifier removes <a></a>, as well as
// <a href="http:/"></a> after HTMLPurifier_AttrTransform_RemoveLoneHttp
// is through with it!
$htmlDef->addAttribute('a', 'href*', new HTMLPurifier_AttrDef_URI());

它有效，它 Works EM>，Bahahahahahahahanhͥͤͫğͮ͑̆ͦó̓̉ͬ͋hͧ̆̈̉ğ̈͐̈a̾̈̑ͨô̔̄̑̇ḡh̘̝͊̐ͩͥ̋ͤ͛gȱȱhgȱȱoȱgȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱȱ

【讨论】：

你是怎么做出这样疯狂的字母的？
@MichalStefanow: eeemo.net - 另见knowyourmeme.com/memes/zalgo ... 以及著名的stackoverflow stackoverflow.com/a/1732454/245790

【解决方案2】：

您无法使用 TagTransform 删除元素这一事实似乎是一个实现细节。删除 nodes 的经典机制（比标签略高一点）是使用 Injector。

无论如何，您正在寻找的特定功能已经实现为 %AutoFormat.RemoveEmpty

【讨论】：

啊，太近了！我改变了我的 HTMLPurifier_AttrTransform_Target 类，所以它不会在我想要删除的案例中添加 target="_blank"（现在，为了测试，在同一个类中，如果我遇到它，删除 href="http:/"，将稍后将其放入自己的类中），但AutoFormat.RemoveEmpty 仍然不会触发，因为锚点中有一个文本节点。如果里面没有文字，那就是金子，它可以工作，所以，啊，太接近了！非常感谢，不过，这绝对是我没有想到的。 [我一会儿再看看喷油器！]
通过AutoFormat.Custom 加载的注入器似乎被称为预净化，或者至少是预URI 净化——我还没有得到空标签。有没有办法可以延迟 URI 净化后注入器的调用？
其他一些过滤器所做的是预先强制进行属性验证，然后使用 $token->['ValidateAttributes'] = true 来武装生成的令牌
如何事先强制进行属性验证？我的意思是，据我所知，<a></a> 和<a href="http:/"></a> 的实例是由核心创建的。在注入器之前我会在哪里告诉它这样做？
基本上，你会得到一个 HTMLPurifier_AttrValidator 的实例，然后运行 $attr_validator->validateToken($token, $config, $context);。

【解决方案3】：

为了细读，这是我目前的解决方案。它可以工作，但完全绕过了 HTML Purifier。

/**
 * Removes <a></a> and <a href="http:/"></a> tags from the purified
 * HTML.
 * @todo solve this with an injector?
 * @param string $purified The purified HTML
 * @return string The purified HTML, sans pointless anchors.
 */
private function anchorCull($purified)
{
    if (empty($purified)) return '';
    // re-parse HTML
    $domTree = new DOMDocument();
    $domTree->loadHTML($purified);
    // find all anchors (even good ones)
    $anchors = $domTree->getElementsByTagName('a');
    // collect bad anchors (destroying them in this loop breaks the DOM)
    $destroyNodes = array();
    for ($i = 0; ($i < $anchors->length); $i++) {
        $anchor = $anchors->item($i);
        $href   = $anchor->attributes->getNamedItem('href');
        // <a></a>
        if (is_null($href)) {
            $destroyNodes[] = $anchor;
        // <a href="http:/"></a>
        } else if ($href->nodeValue == 'http:/') {
            $destroyNodes[] = $anchor;
        }
    }
    // destroy the collected nodes
    foreach ($destroyNodes as $node) {
        // preserve content
        $retain = $node->childNodes;
        for ($i = 0; ($i < $retain->length); $i++) {
            $rnode = $retain->item($i);
            $node->parentNode->insertBefore($rnode, $node);
        }
        // actually destroy the node
        $node->parentNode->removeChild($node);
    }
    // strip out HTML out of DOM structure string
    $html = $domTree->saveHTML();
    $begin = strpos($html, '<body>') + strlen('<body>');
    $end   = strpos($html, '</body>');
    return substr($html, $begin, $end - $begin);
}

我仍然希望有一个好的 HTML Purifier 解决方案来解决这个问题，所以，作为提醒，这个答案最终不会被自我接受。但万一最终没有更好的答案出现，至少它可能会帮助那些有类似问题的人。 :)

【讨论】：