【发布时间】:2015-11-06 09:52:24
【问题描述】:
我现在有一个非常简单的刮刀可以满足我的需要,但是它非常慢,它在 3 秒内刮了 2 张图片,我需要做的是在几秒钟内至少 1000 张图片。
这是我现在使用的代码
<?php
require_once('config.php');
//Calling PHasher class file.
include_once('classes/phasher.class.php');
$I = PHasher::Instance();
//Prevent execution timeout.
set_time_limit(0);
//Solving SSL Problem.
$arrContextOptions=array(
"ssl"=>array(
"verify_peer"=>false,
"verify_peer_name"=>false,
),
);
//Check if the database contains hashed pictures or if it's empty, Then start from the latest hashed picture or start from 4.
$check = mysqli_query($con, "SELECT fid FROM images ORDER BY fid DESC LIMIT 1;");
if(mysqli_num_rows($check) > 0){
$max_fid = mysqli_fetch_row($check);
$fid = $max_fid[0]+1;
} else {
$fid = 4;
}
$deletedProfile = "https://z-1-static.xx.fbcdn.net/rsrc.php/v2/yo/r/UlIqmHJn-SK.gif";
//Infinte while loop to fetch profiles pictures and save them inside avatar folder.
$initial = $fid;
while($fid = $initial){
$url = 'https://graph.facebook.com/'.$fid.'/picture?width=378&height=378';
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true); // follow the redirects
curl_setopt($ch, CURLOPT_HEADER, false); // no needs to pass the headers to the data stream
curl_setopt($ch, CURLOPT_NOBODY, true); // get the resource without a body
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, false); // accept any server certificate
curl_exec($ch);
// get the last used URL
$lastUrl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);
curl_close($ch);
if($lastUrl == $deletedProfile){
$initial++;
}else{
$imageUrl = file_get_contents($url, false, stream_context_create($arrContextOptions));
$savedImage = dirname(__file__).'/avatar/image.jpg';
file_put_contents($savedImage, $imageUrl);
//Exclude deleted profiles or corrupted pictures.
if(getimagesize($savedImage) > 0 ){
//PHasher class call to hash the images to hexdecimal values or binary values.
$hash = $I->FastHashImage($savedImage);
$hex = $I->HashAsString($hash);
//Store Facebook id and hashed values for the images in hexa values.
mysqli_query($con, "INSERT INTO images(fid, hash) VALUES ('$fid', '$hex')");
$initial++;
} else {
$initial++;
}
}
}
?>
我不知道怎么做,但我现在想到的是:
1- 将每个循环分成 1000 个配置文件并将它们存储在一个数组中。
$items = array();
for($i=$fid; $i <= $fid+1000; $i++){
$url = 'https://graph.facebook.com/'.$i.'/picture?width=378&height=378';
$items[$i] = array($url);
}
但结果不正确我想知道如何修复数组的输出。
Array ( [28990] => Array ( [0] => https://graph.facebook.com/28990/picture?width=378&height=378 )
[28991] => Array ( [0] => https://graph.facebook.com/28991/picture?width=378&height=378 )
[28992] => Array ( [0] => https://graph.facebook.com/28992/picture?width=378&height=378 )
[28993] => Array ( [0] => https://graph.facebook.com/28993/picture?width=378&height=378 )
[28994] => Array ( [0] => https://graph.facebook.com/28994/picture?width=378&height=378 )
[28995] => Array ( [0] => https://graph.facebook.com/28995/picture?width=378&height=378 )
[28996] => Array ( [0] => https://graph.facebook.com/28996/picture?width=378&height=378 )
[28997] => Array ( [0] => https://graph.facebook.com/28997/picture?width=378&height=378 )
2- 然后我想使用Mulit curl里面的输出数组,允许异步处理多个cURL句柄。
3- 检查输出 URL 是否等于删除的配置文件,如果不传递它以使用 PHasher 将其转换为哈希值并将其存储在数据库中。
【问题讨论】:
标签: php multithreading ssl curl scraper