将 html 转换为 url 抓取工具答案

【问题标题】：Converting html to url scraper将 html 转换为 url 抓取工具
【发布时间】：2016-02-21 22:43:58
【问题描述】：

所以一个非常乐于助人的人帮助我在 Stackoverflow 上取得了这么大的进展，但是我需要将他的代码从 HTMl 转换为一个 URL 来抓取我一遍又一遍地尝试过并且我不断遇到错误的任何想法？

function getElementByIdAsString($html, $id, $pretty = true) {
$doc = new DOMDocument();
@$doc->loadHTML($html);

if(!$doc) {
    throw new Exception("Failed to load $url");
}
$element = $doc->getElementById($id);
if(!$element) {
    throw new Exception("An element with id $id was not found");
}

// get all object tags
$objects = $element->getElementsByTagName('object'); // return node list

// take the the value of the data attribute from the first object tag
$data = $objects->item(0)->getAttributeNode('data')->value;

// cut away the unnecessary parts and return the info
return substr($data, strpos($data, '=')+1);

}

// call it:
$finalcontent = getElementByIdAsString($html, 'mainclass');

print_r ($finalcontent);

【问题讨论】：

你提到错误......它们是什么？
它只是空白。有没有更好的方法让我得到错误？这一切都是新手
我只是想放置一个要抓取的 URL，而不是那个家伙在堆栈溢出时所做的 $html 示例
首先，删除@，因为这会消除错误（避免使用它，真的）。然后添加error_reporting(E_ALL); 报告所有错误。
我得到的唯一错误是在 Chrome 控制台中“加载资源失败：服务器响应状态为 500（内部服务器错误）”它没有加载我的 wordpress 页脚，所以我假设它只是在抓取期间导致错误。

标签： php html dom scraper

【解决方案1】：

请记住在使用函数时尝试捕获，因为它可能会抛出Exceptions，这将导致 500 服务器错误。

$finalcontent = getElementByIdAsString($html, 'mainclass');

应该变成

try {
    $finalcontent = getElementByIdAsString($html, 'mainclass');
}catch(Exception $e){
    echo $e->getMessage();
}

【讨论】：

非常感谢，这已经消除了错误！现在主要问题。我需要从 URL 中抓取它，如何将这段代码转换为读取 URL 而不是它当前正在执行的 $html。
根据您拥有的主机，您应该能够调用$html = file_get_contents($url);，它将获取您提供的 URL 并尝试获取该文档的 HTML，如果这不起作用，您可能会必须查看 cURL，您可以通过这种方式获取页面的 HTML！
我假设它现在是白屏的，这不适用于自定义 linode 上的 wordpress？
奇怪的是我删除了整个脚本并简单地放入一个 $html = file_get_contents ('url.com') 并回显它并且它工作正常但是使用整个函数它会导致错误
我得到了输出！