【发布时间】:2018-04-12 11:16:17
【问题描述】:
- 我无法使用 curls 从几个网站上抓取数据。
- 我正在使用 CURL 从 url 中抓取网站。它适用于我使用的 80% 的网址。但有些网址似乎不是“可刮的”。例如,当我尝试抓取https://www.nextdoorhub.com/ 和https://www.atknsn.com/ 时,它不起作用。该网站一直显示空白,最后它不返回结果。
这是我的代码:
<center>
<br/>
<form method="post" name="scrap_form" id="scrap_form" action="scrape_data.php">
<b>Enter Website URL To Scrape Data:</b>
<input type="input" name="website_url" id="website_url">
<input type="submit" name="submit" value="Submit" >
</form>
</center>
<?php
error_reporting(E_ALL ^ E_NOTICE );
$website_url = $_POST['website_url'];
$result = scrapeWebsiteData($website_url);
function scrapeWebsiteData($website_url){
$curl = curl_init();
curl_setopt($curl, CURLOPT_URL, $website_url);
curl_setopt($curl, CURLOPT_HEADER, 0);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($curl, CURLOPT_BINARYTRANSFER,1);
$result = curl_exec($curl);
curl_close($curl);
return $result;
}
$regextit = '<div id="case_textlist">(.*?)<\/div>/s';
preg_match_all($regextit, $result, $list);
/* echo "<pre>";
print_r($list[1]); die; */
$regex = '/[\'" >\t^]([^\'" \n\r\t]+\.(jpe?g|bmp|gif|png))[\'" <\n\r\t]/i';
preg_match_all($regex, $result, $url_matches);
$count = count($url_matches[1]);
// set the local path of image
$local_path = 'C:\udeytech\htdocs\tests\images\\';
for($i=0; $i<$count; $i++)
{
preg_match_all('!.*?/!', $url_matches[1][$i], $matches);
$last_part = end($matches[0]);
////match image name last part of anything .jpg|jpeg|gif|png
preg_match("!$last_part(.*?.(jpg|jpeg|gif|png))!", $url_matches[1][$i], $matche);
$secons_part = $matche[0];
$info = pathinfo($secons_part);
$image_name = $info['basename'];
//save image url in a variable
$image_url = $url_matches[1][$i];
$image_path = scrapeWebsiteData($image_url);
$file_open = fopen($local_path.$image_name, 'w');
fwrite($file_open, $image_path);
fclose($file_open);
}
?>
【问题讨论】:
-
检查
curl_error的输出以获取初学者。 php.net/manual/en/function.curl-error.php -
两个网站都使用 javascript 来呈现无法用 curl 废弃的内容,您需要无头浏览器来完成此操作,此答案将解决您的问题stackoverflow.com/questions/49049382/…
标签: php web-scraping