使用 DOMDocument 从网站上抓取所有图像答案

【问题标题】：scraping all images from a website using DOMDocument使用 DOMDocument 从网站上抓取所有图像
【发布时间】：2013-03-31 12:27:30
【问题描述】：

我基本上想使用 DOMDocument 获取任何网站中的 ALL 图像。但是由于一些我还不知道的原因，我什至无法加载我的 html。

$url="http://<any_url_here>/";
$dom = new DOMDocument();
@$dom->loadHTML($url); //i have also tried removing @
$dom->preserveWhiteSpace = false;
$dom->saveHTML();
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) 
{
echo $image->getAttribute('src');
}

什么都没有被打印出来。还是我的代码做错了什么？

【问题讨论】：

您没有收到错误消息的原因可能是 php 中的这一行 @$dom->loadHTML($url); '@' 隐藏了该函数的所有错误消息。
我很久以前就删除了，但仍然没有结果...
您没有得到结果，因为 $dom->loadHTML() 需要 html。你给它一个url，你首先需要得到你要解析的页面的html。您可以为此使用file_get_contents()。（见答案）
我添加了 $html = file_get_contents("sitehere/"); 然后加载了 html 文件 $dom->loadHTML($html); 现在它给了我一个错误。错误：DOMDocument::loadHTML( ): Entity中重新定义的属性类

标签： php kohana-3.2

【解决方案1】：

你没有得到结果，因为 $dom->loadHTML() 需要 html。你给它一个url，你首先需要得到你要解析的页面的html。为此，您可以使用 file_get_contents()。

我在图像抓取类中使用了它。适合我。

$html = file_get_contents('http://www.google.com/');
$dom = new domDocument;
$dom->loadHTML($html);
$dom->preserveWhiteSpace = false;
$images = $dom->getElementsByTagName('img');
foreach ($images as $image) {
  echo $image->getAttribute('src');
}

【讨论】：

我现在在实体错误中重新定义了一个属性类。 $dom = new DOMDocument; $htmls = file_get_contents("http://philcooke.com/inspiration-happens-but-the-best-ideas-take-time/"); $dom->loadHTML($htmls);
您的回答几乎是正确的。只需在$dom->loadHTML($html) 之前添加一个“@”字符
在$dom->loadHTML($html) 之前附加“@”以抑制错误的替代方法，您可以先使用 tidy 来清理 html。 $tidy = tidy_parse_string($html); $html = $tidy->html()->value; 但也许这太过分了。