甚至 CURL 函数也不能抓取一些 url答案

【问题标题】：Even CURL function can't scrape some urls甚至 CURL 函数也不能抓取一些 url
【发布时间】：2012-07-11 20:14:57
【问题描述】：

我正在使用 CURL 从 url 中抓取 html。它适用于我使用的 80% 的网址。但有些网址似乎不是“可刮的”。例如，当我尝试抓取 http://www.thefancy.com 时，它不起作用。该网站不断加载，最后它不返回结果。问题可在以下位置进行测试：http://www.itemmized.com/test/test/ 这是我的代码：

 if($_POST['submit']) {

 function curl_exec_follow($ch, &$maxredirect = null) {

 $mr = $maxredirect === null ? 5 : intval($maxredirect);

 if (ini_get('open_basedir') == '' && ini_get('safe_mode' == 'Off')) {

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, $mr > 0);
curl_setopt($ch, CURLOPT_MAXREDIRS, $mr);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false);

} else {

curl_setopt($ch, CURLOPT_FOLLOWLOCATION, false);

if ($mr > 0)
{
  $original_url = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
  $newurl = $original_url;

  $rch = curl_copy_handle($ch);

  curl_setopt($rch, CURLOPT_HEADER, true);
  curl_setopt($rch, CURLOPT_NOBODY, true);
  curl_setopt($rch, CURLOPT_FORBID_REUSE, false);
  do
  {
    curl_setopt($rch, CURLOPT_URL, $newurl);
    $header = curl_exec($rch);
    if (curl_errno($rch)) {
      $code = 0;
    } else {
      $code = curl_getinfo($rch, CURLINFO_HTTP_CODE);
      if ($code == 301 || $code == 302) {
        preg_match('/Location:(.*?)\n/', $header, $matches);
        $newurl = trim(array_pop($matches));

        // if no scheme is present then the new url is a
        // relative path and thus needs some extra care
        if(!preg_match("/^https?:/i", $newurl)){
          $newurl = $original_url . $newurl;
        }
      } else {
        $code = 0;
      }
    }
  } while ($code && --$mr);

  curl_close($rch);

  if (!$mr)
  {
    if ($maxredirect === null)
    trigger_error('Too many redirects.', E_USER_WARNING);
    else
    $maxredirect = 0;

    return false;
  }
  curl_setopt($ch, CURLOPT_URL, $newurl);
}
 }
return curl_exec($ch);
 }

 $ch = curl_init($_POST['form_url']);
 curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
 $data = curl_exec_follow($ch);
  curl_close($ch);


  echo $data;

【问题讨论】：

标签： php curl web-scraping

【解决方案1】：

可能您无法抓取http://www.thefancy.com，因为每次您到达页面底部时都会加载新内容，因此实际上您正试图通过 cUrl 获取大量信息，这可能就是问题所在。你只是得到一个超时尝试在 php.ini 中设置一个更大的超时，然后再试一次。可能需要一段时间才能加载，但我认为这样可以正常工作。

【讨论】：

无限滚动不是这样工作的。无限滚动是通过 JavaScript 实现的。页面本身仍然是一个正常大小的页面，当您向下滚动时，JS 会不断地从新的 XHR 请求中添加更多内容。 cUrl 不是浏览器；它不会“滚动”或运行 JS 或发出 XHR 请求。
关于 JS 执行它取决于复杂性，一些 JS 脚本通过 cUrl 执行没有任何问题。我能够通过 cUrl 使用 jquery 调用执行 ajax 请求，所以这并非不可能。无论如何，你可以有一个观点，我并不是说我绝对正确，但我认为值得一试看看结果。
要么你不了解 cUrl 是什么，要么你不了解无限滚动的工作原理。无限滚动是通过（客户端，如果之前不清楚的话）JavaScript 实现的。 cUrl 是一个用于发出 HTTP 请求的工具/库。简单地通过 cUrl 获取页面只会返回 Web 服务器提供的数据。无限滚动永远不会发挥作用。并且执行curl http://www.thefancy.com>thefancy.htm 会返回一个只有 46KB 的文件，完成下载所需的时间不到 275 毫秒。这还不足以让大多数 PHP/apache 设置超时。
我不知道无限滚动是如何工作的，我知道 cUrl 是如何工作的。我认为您是对的，但这并不能解决该人的 cUrl 和页面问题。让我们考虑一个不同的原因。

【解决方案2】：

试试这个……希望对你有帮助……

<?php


class Curl
{       

public $cookieJar = "";

public function __construct($cookieJarFile = 'cookies.txt') {
    $this->cookieJar = $cookieJarFile;
}

function setup()
{


    $header = array();
    $header[0] = "Accept: text/xml,application/xml,application/xhtml+xml,";
    $header[0] .= "text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5";
    $header[] =  "Cache-Control: max-age=0";
    $header[] =  "Connection: keep-alive";
    $header[] = "Keep-Alive: 300";
    $header[] = "Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7";
    $header[] = "Accept-Language: en-us,en;q=0.5";
    $header[] = "Pragma: "; // browsers keep this blank.


    curl_setopt($this->curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7');
    curl_setopt($this->curl, CURLOPT_HTTPHEADER, $header);
    curl_setopt($this->curl,CURLOPT_COOKIEJAR, $cookieJar); 
    curl_setopt($this->curl,CURLOPT_COOKIEFILE, $cookieJar);
    curl_setopt($this->curl,CURLOPT_AUTOREFERER, true);
    curl_setopt($this->curl,CURLOPT_FOLLOWLOCATION, true);
    curl_setopt($this->curl,CURLOPT_RETURNTRANSFER, true);  
}


function get($url)
{ 
    $this->curl = curl_init($url);
    $this->setup();

    return $this->request();
}

function getAll($reg,$str)
{
    preg_match_all($reg,$str,$matches);
    return $matches[1];
}

function postForm($url, $fields, $referer='')
{
    $this->curl = curl_init($url);
    $this->setup();
    curl_setopt($this->curl, CURLOPT_URL, $url);
    curl_setopt($this->curl, CURLOPT_POST, 1);
    curl_setopt($this->curl, CURLOPT_REFERER, $referer);
    curl_setopt($this->curl, CURLOPT_POSTFIELDS, $fields);
    return $this->request();
}

function getInfo($info)
{
    $info = ($info == 'lasturl') ? curl_getinfo($this->curl, CURLINFO_EFFECTIVE_URL) : curl_getinfo($this->curl, $info);
    return $info;
}

function request()
{
    return curl_exec($this->curl);
}
}
{
$curl = new Curl();
$html = $curl->get("http://www.thefancy.com");
echo "$html";
}



?>

【讨论】：

抓取通常包括 3 个步骤：首先，您将请求 GET 或 POST 到指定的 URL，然后您收到作为响应返回的 html，最后您从该 html 中解析出您想要的文本刮。为了完成第 1 步和第 2 步，上面是一个简单的 php 类，它使用 Curl 使用 GET 或 POST 来获取网页。取回 HTML 后，您只需使用正则表达式通过解析出您想要抓取的文本来完成第 3 步。
亲爱的 Naren，这门课给了我同样的结果。脚本不断加载，最后我没有得到结果。第 2 步与 thefancy.com 网址出错。他似乎没有得到html。例如，如果我使用像froot.nl 这样的另一个网址，他就会完美地抓取它。有没有可能是服务器问题？可在itemmized.com/test/test/test.php 测试
感谢您提供代码 sn-p。 “FOLLOWLOCATION, true”选项解决了我的刮擦问题。