使用 PHP 在下一个 h2 之前解析 HTML 并在一个 h2 之后获取所有 h3答案

【问题标题】：Parse HTML and Get All h3's After an h2 Before the Next h2 Using PHP使用 PHP 在下一个 h2 之前解析 HTML 并在一个 h2 之后获取所有 h3
【发布时间】：2013-08-11 22:40:44
【问题描述】：

我正在寻找文章中的第一个 h2。找到后，查找所有 h3，直到找到下一个 h2。冲洗并重复，直到找到所有标题和副标题。

在您立即将此问题标记或关闭为重复解析问题之前，请记下问题标题，因为这与基本节点检索无关。我已经把那部分记下来了。

我正在使用DOMDocument 解析HTML，使用DOMDocument::loadHTML()、DOMDocument::getElementsByTagName() 和DOMDocument::saveHTML() 检索文章的重要标题。

我的代码如下：

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);
foreach($dom->getElementsByTagName('h2') as $node) {
    $matches['heading-two'][] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches['heading-three'][] = $dom->saveHtml($node);
}
if($matches){
    $this->key_points = $matches;
}

这给了我这样的输出：

array(
    'heading-two' => array(
        '<h2>Here is the first heading two</h2>',
        '<h2>Here is the SECOND heading two</h2>'
    ),
    'heading-three' => array(
        '<h3>Here is the first h3</h3>',
        '<h3>Here is the second h3</h3>',
        '<h3>Here is the third h3</h3>',
        '<h3>Here is the fourth h3</h3>',
    )
);

我希望有类似的东西：

array(
    '<h2>Here is the first heading two</h2>' => array(
        '<h3>Here is an h3 under the first h2</h3>',
        '<h3>Here is another h3 found under first h2, but after the first h3</h3>'
    ),
    '<h2>Here is the SECOND heading two</h2>' => array(
        '<h3>Here is an h3 under the SECOND h2</h3>',
        '<h3>Here is another h3 found under SECOND h2, but after the first h3</h3>'
    )
);

我并不是完全在寻找代码完成（如果您觉得这样做会更好地帮助其他人 - 继续），但或多或少的指导或建议是朝着正确的方向完成一个嵌套数组，就像上面一样.

【问题讨论】：

+1 用于在标记开始之前了解您的内容并加粗您的免责声明

标签： php parsing dom html-parsing domdocument

【解决方案1】：

这也可以通过获取在文档中找到节点元素的行号并将其存储为数组元素键，然后ksort($matches) 将数组中的每个节点元素返回到它们的原始行位置为它会在 HTML 文档中找到。

$matches = array();
$dom = new DOMDocument;
$dom->loadHTML($content);

foreach($dom->getElementsByTagName('h2') as $node) {
    $matches[$node->getLineNo()] = $dom->saveHtml($node);
}
foreach($dom->getElementsByTagName('h3') as $node) {
    $matches[$node->getLineNo()] = $dom->saveHtml($node);
}

ksort($matches);

...或者更紧凑的代码；

foreach(array('h2', 'h3') as $tag) {
    foreach($dom->getElementsByTagName($tag) as $node) {
        $matches[$node->getLineNo()] = $dom->saveHtml($node);
    }
}

ksort($matches);

【讨论】：

【解决方案2】：

我假设所有标题在 DOM 中都处于同一级别，因此每个 h3 都是 h2 的兄弟。有了这个假设，您可以迭代 h2 的兄弟姐妹，直到遇到下一个 h2：

foreach($dom->getElementsByTagName('h2') as $node) {
    $key = $dom->saveHtml($node);
    $matches[$key] = array();
    while(($node = $node->nextSibling) && $node->nodeName !== 'h2') {
        if($node->nodeName == 'h3') {
            $matches[$key][] = $dom->saveHtml($node);   
        }
    }
}

【讨论】：

这对我来说有点陌生：如何从 $matches 数组中访问 h2 和 h3 的实际文本（内容）？谢谢！