编写多个正则表达式模式来解析 HTML [重复]答案

【问题标题】：Writing multiple regex pattern to parse HTML [duplicate]编写多个正则表达式模式来解析 HTML [重复]
【发布时间】：2017-04-10 13:56:01
【问题描述】：

我正在获取一个带有file_get_contents() 的 HTML 网页，我得到一个如下表，有超过 150 行：

<tr class="tabrow ">
    <td class="tabcol  tdmin_2l">FIRST+DATA</td>
    <td class="tabcol">
        <a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">
            asdxxx
        </a>
    </td>
    <td class="tabcol"></td>
    <td class="tabcol">FOURTH+DATA</td>
</tr>

我想通过preg_match_all() 呼叫获得FIRST DATA、SECOND DATA、THIRD DATA 和FOURTH DATA。我尝试编写多个模式，但我无法成功。这是我尝试过的：

preg_match_all('/(<td class="tabcol  tdmin_2l">|title=")(.*?)(<\/td>|")/s', $raw, $matches, PREG_SET_ORDER);

真正的模式是什么？

【问题讨论】：

不要使用正则表达式解析 HTML。
改用 DOM 解析器。用正则表达式解析 HTML 标记是非常不可靠的。对标记进行一些小的更改时，它会中断。

标签： php regex html-parsing preg-match

【解决方案1】：

试试这个：

$str = <<<HTML
<tr class="tabrow ">
<td class="tabcol  tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;

preg_match_all('/<td[^>]*>(.*?)<\/td>/im', $str, $td_matches);
preg_match('/ title="([^"]*)"/i', $td_matches[1][1], $title);
preg_match('/ href="([^"]*)"/i', $td_matches[1][1], $href);

echo $td_matches[1][0] . "\n";
echo $title[1] . "\n";
echo $href[1] . "\n";
echo $td_matches[1][3];

【讨论】：

谢谢，这还不错，但看起来像是正则表达式模式的一些修改，因为第二个和第三个数据组合在这个模式中。
一开始没看懂，这样比较好吗？
谢谢，我做了一些修改，现在效果很好！

【解决方案2】：

它不会直接回答您的问题，但这是正确的方法。

您应该避免使用正则表达式解析 HTML/XML 内容。想知道为什么？

正则表达式无法进行整个 HTML 解析，因为它依赖于匹配开始和结束标记，而这在正则表达式中是不可能的。

正则表达式只能匹配正则语言，但 HTML 是一种上下文无关语言。在 HTML 上使用正则表达式唯一可以做的就是启发式方法，但这并不适用于所有条件。应该有可能呈现一个将被任何正则表达式错误匹配的 HTML 文件。

——https://stackoverflow.com/a/590789/65732

请改用DOM parser。以下是它的一瞥：

composer require symfony/dom-crawler symfony/css-selector

<?php

require 'vendor/autoload.php';

use Symfony\Component\DomCrawler\Crawler;

$html = <<<HTML
<tr class="tabrow ">
<td class="tabcol  tdmin_2l">FIRST+DATA</td>
<td class="tabcol"><a class="modal-button" title="SECOND+DATA"  href="THIRD+DATA" rel="{handler: 'iframe', size: {x: 800, y: 640}, overlayOpacity: 0.9, classWindow: 'phocamaps-plugin-window', classOverlay: 'phocamaps-plugin-overlay'}">asdxxx</a></td>
<td class="tabcol"></td>
<td class="tabcol">FOURTH+DATA</td>
</tr>
HTML;

$crawler = new Crawler($html);

$first  = $crawler->filter('.tabcol.tdmin_2l')->text();
$second = $crawler->filter('.tabcol:nth-child(2) a')->attr('title');
$third  = $crawler->filter('.tabcol:nth-child(2) a')->attr('href');
$fourth = $crawler->filter('.tabcol:nth-child(4)')->text();

var_dump($first, $second, $third, $fourth);
// Outputs:
// string(10) "FIRST+DATA"
// string(11) "SECOND+DATA"
// string(10) "THIRD+DATA"
// string(11) "FOURTH+DATA"

更简单，更清洁，对吧？

使用此类解析器，您还可以使用 XPath 提取元素。

【讨论】：

我喜欢这个解决方案，但你确定它在“title”和“href”属性中找到第二和第三？
@fafl：现在我确定确实如此。