使用 PHP DOM 函数从 HTML 文件中提取数据的最佳方法是什么？答案

【问题标题】：What is the best way to extract data from an HTML file using the PHP DOM functions?使用 PHP DOM 函数从 HTML 文件中提取数据的最佳方法是什么？
【发布时间】：2011-03-22 23:30:50
【问题描述】：

我需要从各种 HTML 文件中提取大量数据，并且我必须为每种类型的 HTML 文件编写单独的脚本，以便正确解析出我需要的数据。

数据将位于文档的不同部分 - 例如，在文档类型 1 中，我需要的数据可能很好地位于带有 ID 的 DIV 中，但在文档类型 2 中，我定位数据的唯一方法可能需要找到包含它的特定模式的标签（如<div><b>DATA</div></b>）。

从到目前为止我能找到的一点点看来，DOMXPath 至少可以帮助我进行一些提取 - 我可以使用哪些其他函数，特别是在定位任意模式的第二个示例中标签的数量并获取它们的内容？

【问题讨论】：

简单的 HTML DOM 解析可以帮助你 --> simplehtmldom.sourceforge.net
这和PHP内置的DOM对象有区别吗？
记住窃取/抓取内容不是一件好事 :)
这是你老板告诉你的时候！ :]

标签： php dom screen-scraping

【解决方案1】：

如果您要从各种 HTML 文件中提取不同类型的数据，您将很快厌倦使用 DOMDocument API 和 XPath。使用How do you parse and process HTML/XML in PHP? 中列出的包装库之一。它们提供了更丰富的 API 和额外的选择器。

我更喜欢 phpQuery 和 QueryPath，它们允许：

print qp($url)->find("body p.article a")->attr("href");

print qp($html)->find("div b")->text();

这里记录了可用的函数：http://api.querypath.org/docs/class_query_path.html - 它主要类似于 jQuery。

【讨论】：

【解决方案2】：

如果您计划解析许多 HTML 文件并且需要选择或修改 HTML 文件的许多元素，请考虑使用库。

我会推荐我自己编写的库PHPPowertools/DOM-Query。它允许您 (1) 加载 HTML 文件，然后 (2) 选择或更改 HTML 的部分内容，就像您在前端应用程序中使用 jQuery 时所做的一样。

使用示例：

// Select the body tag
$body = $H->select('body');

// Combine different classes as one selector to get all site blocks
$siteblocks = $body->select('.site-header, .masthead, .site-body, .site-footer');

// Nest your methods just like you would with jQuery
$siteblocks->select('button')->add('span')->addClass('icon icon-printer');

// Use a lambda function to set the text of all site blocks
$siteblocks->text(function($i, $val) {
    return $i . " - " . $val->attr('class');
});

// Append the following HTML to all site blocks
$siteblocks->append('<div class="site-center"></div>');

// Use a descendant selector to select the site's footer
$sitefooter = $body->select('.site-footer > .site-center');

// Set some attributes for the site's footer
$sitefooter->attr(array('id' => 'aweeesome', 'data-val' => 'see'));

// Use a lambda function to set the attributes of all site blocks
$siteblocks->attr('data-val', function($i, $val) {
    return $i . " - " . $val->attr('class') . " - photo by Kelly Clark";
});

// Select the parent of the site's footer
$sitefooterparent = $sitefooter->parent();

// Remove the class of all i-tags within the site's footer's parent
$sitefooterparent->select('i')->removeAttr('class');

// Wrap the site's footer within two nex selectors
$sitefooter->wrap('<section><div class="footer-wrapper"></div></section>');

[...]

【讨论】：