在php中提取html页面的内容

【问题标题】：Extract a content of a html page in php在php中提取html页面的内容
【发布时间】：2012-02-11 07:09:11
【问题描述】：

有任何方法可以提取 HTML 页面的内容，该页面在 php 中从 <body> 开始并以 </body> 结束。如果有人可以发布一些示例代码。

【问题讨论】：

查看众多网站抓取问题之一。
How to parse and process HTML with PHP? 的可能重复项

【解决方案1】：

$html = file_get_html('http://www.example.com/');
$body = $html->find('body');

【讨论】：

【解决方案2】：

您应该查看DOMDocument 参考。

这个例子读取一个html文档，创建一个DOMDocument并获取body标签：

libxml_use_internal_errors(true);
$dom = new DOMDocument;
$dom->loadHTMLFile('http://example.com');
libxml_use_internal_errors(false);

$body = $dom->getElementsByTagName('body')->item(0);

echo $body->textContent; // print all the text content in the body

您还应该查看以下资源：

DOM API Documentation
XPATH language specification

【讨论】：

【解决方案3】：

你也可以尝试使用基于strpos函数的非DOM方案：

$html = file_get_contents($url);
$html = substr($html,stripos($html,'<body>')+6);
$html = substr($html,0,strripos($html,'</body>'));

stripos 是strpos 的不区分大小写版本，strripos 是strpos 的不区分大小写的“最右边”版本。

希望对你有帮助！

【讨论】：