它不起作用。网站一直显示空白答案

【问题标题】：it doesn't work. the website keeps showing blanks它不起作用。网站一直显示空白
【发布时间】：2018-04-12 11:16:17
【问题描述】：

我无法使用 curls 从几个网站上抓取数据。
我正在使用 CURL 从 url 中抓取网站。它适用于我使用的 80% 的网址。但有些网址似乎不是“可刮的”。例如，当我尝试抓取https://www.nextdoorhub.com/ 和https://www.atknsn.com/ 时，它不起作用。该网站一直显示空白，最后它不返回结果。

这是我的代码：

<center>
<br/>
    <form method="post" name="scrap_form" id="scrap_form" action="scrape_data.php">
         <b>Enter Website URL To Scrape Data:</b>
        <input type="input" name="website_url" id="website_url">
        <input type="submit" name="submit" value="Submit" >
    </form>
</center>
<?php
error_reporting(E_ALL ^ E_NOTICE );
  $website_url = $_POST['website_url'];
 $result =  scrapeWebsiteData($website_url);

 function scrapeWebsiteData($website_url){

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $website_url);
    curl_setopt($curl, CURLOPT_HEADER, 0);
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
    curl_setopt($curl, CURLOPT_BINARYTRANSFER,1);
    $result = curl_exec($curl);
    curl_close($curl);
    return $result;
 }
  $regextit = '<div id="case_textlist">(.*?)<\/div>/s';
   preg_match_all($regextit, $result, $list);
  /* echo "<pre>";
  print_r($list[1]); die; */
  $regex = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i'; 
  preg_match_all($regex, $result, $url_matches);
  $count = count($url_matches[1]);
  // set the local path of image 
  $local_path = 'C:\udeytech\htdocs\tests\images\\'; 
   for($i=0; $i<$count; $i++)
    {
     preg_match_all('!.*?/!', $url_matches[1][$i], $matches);
     $last_part = end($matches[0]); 
     ////match image name last part of anything .jpg|jpeg|gif|png
     preg_match("!$last_part(.*?.(jpg|jpeg|gif|png))!", $url_matches[1][$i], $matche);
     $secons_part = $matche[0];
     $info = pathinfo($secons_part);
     $image_name = $info['basename'];
    //save image url in a variable
    $image_url = $url_matches[1][$i];
    $image_path = scrapeWebsiteData($image_url);

    $file_open = fopen($local_path.$image_name, 'w');
    fwrite($file_open, $image_path);
    fclose($file_open);      
   }

?>

【问题讨论】：

检查curl_error 的输出以获取初学者。 php.net/manual/en/function.curl-error.php
两个网站都使用 javascript 来呈现无法用 curl 废弃的内容，您需要无头浏览器来完成此操作，此答案将解决您的问题stackoverflow.com/questions/49049382/…

标签： php web-scraping

【解决方案1】：

您是否尝试在浏览器中加载这些网站并查看响应？

nextdoorhub 正在使用 Angular，而 atknsn 看起来对 jQuery 很重。长话短说，这些网站需要运行 javascript 来呈现您要抓取的完整 HTML。

单独使用 PHP + cURL 并不能解决问题。查看讨论scraping angular 的线程，这将为您指明正确的方向。（提示：你需要用 node.js 抓取这些网站）

【讨论】：

请给我一个node.js和angular js之类的例子，帮助我从nextdoorhub网站上抓取数据。