正则表达式仅匹配特定类的完整超链接答案

【问题标题】：Regex match full hyperlink only with certain class正则表达式仅匹配特定类的完整超链接
【发布时间】：2011-05-30 17:13:00
【问题描述】：

我有一个字符串，里面有一些超链接。我只想与所有这些中的某些链接匹配正则表达式。我不知道是 href 还是 class 是第一位的，它可能会有所不同。例如，这是一个刺痛：

<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>

我只想从 aboce 字符串中选择具有类 nextpostslink 的字符串所以，这个例子中的匹配应该返回这个 -

<a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a>

这个正则表达式是我能得到的最接近的 -

/<a\s?(href=)?('|")(.*)('|") class=('|")nextpostslink('|")>.{1,6}<\/a>/

但它是从字符串的开头选择链接。我认为我的问题出在 (.*) 中，但我不知道如何更改它以仅选择所需的链接。

感谢您的帮助。

【问题讨论】：

不要使用正则表达式来解析 HTML。你用什么语言编程？

标签： php regex dom hyperlink

【解决方案1】：

为此使用真正的 HTML 解析器要好得多。放弃所有在 HTML 上使用正则表达式的尝试。

改用 PHP 的 DOMDocument：

$dom = new DOMDocument;
$dom->loadHTML($yourHTML);

foreach ($dom->getElementsByTagName('a') as $link) {
    $classes = explode(' ', $link->getAttribute('class'));

    if (in_array('nextpostslink', $classes)) {
        // $link has the class "nextpostslink"
    }
}

【讨论】：

这将对性能造成很大影响，我只建议这是速度真的不是问题，或者是否会进行进一步处理。
@Nicklas Claptrap。 (1) DOMDocument 速度惊人。 (2) 不要过早优化。 (3) DOMDocument 将工作，而正则表达式可能工作（偶尔）。
我认为用所有页面的 html 实例化一个新对象有点矫枉过正。该页面有巨大的 html，这只是页面的一小部分。
@Maor 在长字符串上使用正则表达式也不会有很好的性能。说真的，不要在这上面使用正则表达式。它会不可靠，当 HTML 稍有变化时它就会崩溃，它会让你偏头痛。使用专为这项工作设计的工具。
我意识到我可以加载这个唯一的字符串作为 DOMDocument 实例，所以我不必加载所有页面的 html，所以在这种情况下，你的方法可能真的比正则表达式更好。谢谢！

【解决方案2】：

不知道你是不是这样，但无论如何：用正则表达式解析 html 是个坏主意。使用 xpath 实现以达到所需的元素。下面的 xpath 表达式将为您提供类 "nextpostlink" 的所有 'a' 元素：

//a[contains(@class,"nextpostslink")]

周围有很多 xpath 信息，因为你没有提到你的编程语言，这里有一个使用 java 的快速 xpath 教程：http://www.ibm.com/developerworks/library/x-javaxpathapi/index.html

编辑：

php + xpath + html：http://dev.juokaz.com/php/web-scraping-with-php-and-xpath

【讨论】：

【解决方案3】：

这将在 php 中工作：

/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m

这当然是假设 class 属性总是在 href 属性之后。

这是一个代码sn-p：

$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;

$regexp = "/<a[^>]+href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>/m";

$matches = array();
if(preg_match($regexp, $html, $matches)) {
    echo "URL: " . $matches[2] . "\n";
    echo "Text: " . $matches[6] . "\n";
}

不过，我建议先匹配链接，然后获取 url，这样属性的顺序就无关紧要了：

<?php

$html = <<<EOD
<div class='wp-pagenavi'>
<span class='pages'>Page 1 of 8</span><span class='current'>1</span>
<a href='http://stv.localhost/channel/political/page/2' class='page'>2</a>     
<a href='http://stv.localhost/channel/political/page/3' class='page'>3</a>ccccc<a href='http://stv.localhost/channel/political/page/4' class='page'>4</a><a href='http://stv.localhost/channel/political/page/5' class='page'>5</a><a href="http://stv.localhost/channel/political/page/2" class="nextpostslink">»eee</a><span class='extend'>...</span><a href='http://stv.localhost/channel/political/page/8' class='last'>lastן »</a>
<a class="cccc">xxx</a>
</div>
EOD;

$regexp = "/(<a[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*('|\")[^>]*>(.{1,6})<\/a>)/m";

$matches = array();
if(preg_match($regexp, $html, $matches)) {
    $link = $matches[0];
    $text = $matches[4];

    $regexp = "/href=(\"|')([^'\"]*)(\"|')/";
    $matches = array();
    if(preg_match($regexp, $html, $matches)) {
        $url = $matches[2];

        echo "URL: $url\n";
        echo "Text: $text\n";
    }
}

您当然可以通过匹配两个变体之一（类优先与 href 优先）来扩展正则表达式，但这会很长，我认为这不会提高性能。

作为概念证明，我创建了一个不关心顺序的正则表达式：

/<a[^>]+(href=(\"|')([^\"']*)('|\")[^>]+class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')|class=(\"|')[^'\"]*nextpostslink[^'\"]*(\"|')[^>]+href=(\"|')([^\"']*)('|\"))[^>]*>(.{1,6})<\/a>/m

文本将在第 12 组中，而 URL 将在第 3 组或第 10 组中，具体取决于顺序。

【讨论】：

我使用 php 并尝试使用 preg_match 函数。但在此之前，我只是用这个很棒的正则表达式助手-gskinner.com/RegExr 进行了一些测试，以找出正确的表达式。
太棒了！这种表达方式似乎奏效了。但是，由于这是自动生成的 html，我不能假设 href 总是在类之前。有没有办法让这个表达式适用于这两种情况（href 出现在类之前或之后）？
谢谢！它看起来正在执行所需的过程

【解决方案4】：

由于问题是通过 regex 得到它，这里是<a\s[^>]*class=["|']nextpostslink["|'][^>]*>(.*)<\/a>。

属性的顺序无关紧要，它也考虑单引号或双引号。

在线查看正则表达式：https://regex101.com/r/DX03KD/1/

【讨论】：

【解决方案5】：

我将 (.*) 替换为 [^'"]+ 如下：

<a\s*(href=)?('|")[^'"]+('|") class=('|")nextpostslink('|")>.{1,6}</a>

注意：我用 RegEx Buddy 试过这个，所以我不需要转义或 /

【讨论】：

我试过这个，但现在它什么都不匹配。我当然避开了所有的 '" / 。您还有其他建议吗？谢谢。