一 用的QueryList库
二 安装方法
确认已经安装了composer,因为速度会很慢,可以切换到中国镜像:
composer config -g repo.packagist composer https://packagist.phpcomposer.com
安装QueryList:
composer require jaeger/querylist
QueryList文档地址,可以了解下:
三 需求如下
通过淘宝或天猫的商品链接,采集该商品链接对应的商品标题、商品首图、店铺名称、商家旺旺名称
四 目前的采集数据Demo可以适用于所有天猫商品+店铺名称在右边或上边的
五 代码如下
<?php
include "vendor/autoload.php";
use QL\QueryList;
function uni_decode($s) { //针对部分淘宝宝贝链接的店铺名被加密进行解密处理
preg_match_all(\'/\&\#([0-9]{2,5})\;/\', $s, $html_uni);
preg_match_all(\'/[\\\%]u([0-9a-f]{4})/ie\', $s, $js_uni);
$source = array_merge($html_uni[0], $js_uni[0]);
$js = array();
for($i=0;$i<count($js_uni[1]);$i++) {
$js[] = hexdec($js_uni[1][$i]);
}
$utf8 = array_merge($html_uni[1], $js);
$code = $s;
for($j=0;$j<count($utf8);$j++) {
$code = str_replace($source[$j], unicode2utf8($utf8[$j]), $code);
}
return $code;
}
function unicode2utf8($c) {
$str="";
if ($c < 0x80) {
$str.=chr($c);
} else if ($c < 0x800) {
$str.=chr(0xc0 | $c>>6);
$str.=chr(0x80 | $c & 0x3f);
} else if ($c < 0x10000) {
$str.=chr(0xe0 | $c>>12);
$str.=chr(0x80 | $c>>6 & 0x3f);
$str.=chr(0x80 | $c & 0x3f);
} else if ($c < 0x200000) {
$str.=chr(0xf0 | $c>>18);
$str.=chr(0x80 | $c>>12 & 0x3f);
$str.=chr(0x80 | $c>>6 & 0x3f);
$str.=chr(0x80 | $c & 0x3f);
}
return $str;
}
function get_between($input, $start, $end) {//截取指定两个字符之间的内容
return substr($input, strlen($start)+strpos($input, $start),(strlen($input) - strpos($input, $end))*(-1));
}
function trimall($str)//删除空格
{
$qian=array(" "," ","\t","\n","\r");
$hou=array("","","","","");
return str_replace($qian,$hou,$str);
}
$url = \'https://item.taobao.com/item.htm?spm=a230r.1.14.34.47cd6ace3iAnm0&id=564043247193&ns=1&abbucket=19#detail\';
$ql = QueryList::get($url)->encoding(\'UTF-8\',\'GBK\');//防止数据乱码
//针对1天猫宝贝链接 2淘宝店铺名在右边 3淘宝店铺名在上面 采取不同的采集方式
if (substr($url, 0, 24) == \'https://detail.tmall.com\') {
$rt = [
\'img\' => $ql->find(\'#J_ImgBooth\')->attr(\'src\'),
\'title\' => $ql->find(\':input[name="title"]\')->attr(\'value\'),
\'shop_name\' => $ql->find(\'.slogo-shopname\')->text()
];
$rt[\'seller_name\'] = $rt[\'shop_name\'];
} else {
$rt = [
\'img\' => $ql->find(\'#J_ImgBooth\')->attr(\'src\'),
\'title\' => $ql->find(\'.tb-main-title\')->text(),
\'shop_name\' => $ql->find(\'.tb-shop-name>dl>dd>strong>a\')->text(),
\'seller_name\' => $ql->find(\'.tb-seller-name\')->text()
];
if (!$rt[\'shop_name\']) {
$config = substr(trimall($ql->find(\'script\')->eq(0)->text()), 100, 150);
$shop_name = get_between($config, "shopName:\'", "\',sellerId");
$rt[\'shop_name\'] = uni_decode($shop_name);
$rt[\'seller_name\'] = get_between($config, "sellerNick:\'", "\',sibUrl");
}
}
var_dump($rt[\'shop_name\']);
echo \'<hr />\';
?>
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>爬取淘宝商品数据Demo</title>
</head>
<body>
<h4>标题:<?php echo $rt[\'title\']; ?></h4>
<h4>店铺:<?php echo $rt[\'shop_name\']; ?></h4>
<h4>旺旺:<?php echo $rt[\'seller_name\']; ?></h4>
<h4>图片:</h4>
<img src="<?php echo $rt[\'img\'] ?>" alt="">
</body>
</html>
六 效果展示
1 天猫商品链接
采集效果:
2 店铺名称在右边的淘宝商品链接
采集效果:
3 店铺名称在上方的商品链接(这个稍微有些麻烦,因为这种类型的商家旺旺和店铺名都是在js中的,而且店铺名称还是加过密的)
采集效果:
7 最近项目中刚好有这个需求,所以写的这个Demo,如果需要采集其它的数据,可以参考QueryList手册,根据实际产品业务需求进行更改