如何使用 PHP preg_match_all 来区分由内部 HTML 元素的属性标识的锚元素？答案

【问题标题】：How would one use PHP preg_match_all to differentiate anchor elements identified by attribute of inner HTML element?如何使用 PHP preg_match_all 来区分由内部 HTML 元素的属性标识的锚元素？
【发布时间】：2014-02-27 20:08:23
【问题描述】：

我有一组包含图像元素的 HTML 锚元素。对于每组，我想使用 PHP-CLI 提取 URL 并根据它们的类型对其进行分类。锚的类型只能由其子图像元素的属性确定。如果每组只有一种类型，那将很容易。我的问题是当一种类型的两个锚元素被一种或多种其他类型分隔时。我的非贪婪括号子模式似乎变得贪婪并扩展以找到第二个相关的子属性。在我的测试脚本中，我试图从其他类型中提取“用户链接”URL。使用一个简单的模式，如：

#<a href="(.*?)" custattr="value1"><img alt="Userlink"#

在一组像：

<li><a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink" class="common_link_class" height="123" src="pic0.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.socnet1.com/username1" custattr="value1"><img alt="Socnet1" class="common_link_class" height="123" src="pic1.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.socnet2.com/username1" custattr="value1"><img alt="Socnet2" class="common_link_class" height="123" src="pic2.png" width="123" style="width: 123px;"></a></li><li><a href="mailto:useralias1@unlikely.zyx321.usermail.net" custattr="value1"><img alt="Usermail" class="common_link_class" height="123" src="pic3.png" width="123" style="width: 123px;"></a></li><li><a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink" class="common_link_class" height="123" src="pic4.png" width="123" style="width: 123px;"></a></li>

（抱歉，实际的 html 是这样的一行）

我的子模式从第一个“用户链接”URL 的开头捕获到最后一个 URL 的结尾。

我已经尝试了许多不同的前瞻方法，但不确定我是否应该在此处将它们全部列出。到目前为止，他们要么根本没有返回匹配项，要么返回与上述相同的匹配项。

这是我的测试脚本（在 Bash shell 中运行）：

#!/usr/bin/php
<?
    $lines = 0;
    $input = "";
    $matches = array();

    while ($line = fgets(STDIN)){
        $input .= $line;
        $lines++;
    }
    fwrite(STDERR, "Processing $lines\n");

    $pcre = '#<a href="(.*?)" custattr="value1"><img alt="Userlink"#';

    if (preg_match_all($pcre,$input,$matches)){
        fwrite(STDERR, "\$matches has " . count($matches) . " elements\n");
        foreach ($matches[1] as $match){
            fwrite(STDOUT, $match . "\n");
        }
    }
?>

PHP 的 preg_match_all() 的 PCRE 模式会返回上例中的两个“用户链接”URL？

【问题讨论】：

Don't parse HTML with regex。使用解析器。
不要使用不贪婪的.*?，而是使用贪婪的字符类[^"]*。
正如 Ed Cottrell 所说的 *^?!# 链接，如果您只想查找 href 内容，使用 DOM 可能是一个不错的选择。
即使我不需要识别或使用 HTML 元素本身，并且将它们全部丢弃，HTML 解析器还会更好吗？

标签： php preg-match-all pcre

【解决方案1】：

我冒昧地更改了您的变量名称：

$pattern = '~<a href="([^"]++)" custattr="value1"><img alt="Userlink"~';

if ($nb = preg_match_all($pattern, $input, $matches)) {
    fwrite(STDERR, "\$matches has " . $nb . " elements\n");
    fwrite(STDOUT, implode("\n", $match) . "\n");
}

请注意，preg_match_all 函数返回匹配数。

【讨论】：

【解决方案2】：

这个正则表达式应该可以工作 -

<a href="([^"]*?)"[^>]*\><img alt="Userlink"

你可以看到它是如何工作的here。

测试它 -

$pcre = '/<a href="([^"]*?)"[^>]*\><img alt="Userlink"/';
if (preg_match_all($pcre,$input,$matches)){
    var_dump($matches);
    //$matches[1] will be the array containing the urls.
}
/*
    OUTPUT- 
    array
      0 => 
        array
          0 => string '<a href="http://www.userlink1.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
          1 => string '<a href="http://www.userlink2.com/my/page.html" custattr="value1"><img alt="Userlink"' (length=85)
      1 => 
        array
          0 => string 'http://www.userlink1.com/my/page.html' (length=37)
          1 => string 'http://www.userlink2.com/my/page.html' (length=37)
*/

【讨论】：