kangbk

1.几种常用的PHP爬虫框架对比

原文链接:https://blog.csdn.net/future_todo/article/details/52804440

1.1 phpQuery

优势:类似jquery的强大搜索DOM的能力。 
pq()是一个功能强大的搜索DOM的方法,跟jQuery的$()如出一辙,jQuery的选择器基本上都能使用在phpQuery上,只要把“.”变成“->”,Demo如下(对应我的github的Demo5)

<?php 
 require(\'phpQuery/phpQuery.php\');
 phpQuery::newDocumentFile(\'http://www.baidu.com/\'); 
 $menu_a = pq("a"); 
 foreach($menu_a as $a){
    echo pq($a)->html()."<br>";
 } 
 foreach($menu_a as $a){
    echo pq($a)->attr("href")."<br>";
 } 
?>

1.2 PHPcrawer

优势:过滤能力比较强。 
官方给的Demo如下(我的github中对应demo4):

<?php 
    include("PHPCrawl/libs/PHPCrawler.class.php");
    class MyCrawler extends PHPCrawler 
    { 
      function handleDocumentInfo(PHPCrawlerDocumentInfo $PageInfo) 
      { // As example we just print out the URL of the document 
        echo $PageInfo->url."<br>"; 
      } 
    }
    $crawler = new MyCrawler(); 
    $crawler->setURL("www.baidu.com"); 
    $crawler->addURLFilterRule("#\.(jpg|gif)$# i");
    //过滤到含有这些图片格式的URL
    $crawler->go();
 ?>

1.3 snoopy

优势:提交表单,设置代理等 
Snoopy是一个php类,用来模拟浏览器的功能,可以获取网页内容,发送表单, 
demo如下(对应github中的demo3):

include \'Snoopy/Snoopy.class.php\';
$snoopy = new Snoopy();
$url = "http://www.baidu.com";
// $snoopy->fetch($url);
// $snoopy->fetchtext($url);//去除HTML标签和其他的无关数据
$snoopy->fetchform($url);//只获取表单
//只返回网页中链接 默认情况下,相对链接将自动补全,转换成完整的URL。
// $snoopy->fetchlinks($url);
var_dump($snoopy->results);

1.4 phpspider

优势:安装配置到数据库 
提供了安装配置,能够直接连接mysql数据库,使用也是比较广泛,这里我们暂时不单独介绍。

 

2.模拟用户行为

2.1 file_get_contents

<?php
$opts = array(
  \'http\'=>array(
    \'method\'=>"GET",
    \'header\'=>"Accept-language: en\r\n" .
              "Cookie: foo=bar\r\n"
  )
);

$context = stream_context_create($opts);

/* Sends an http request to www.example.com
   with additional headers shown above */
$fp = fopen(\'http://www.example.com\', \'r\', false, $context);
fpassthru($fp);
fclose($fp);
?>

2.2 curl

$ch=curl_init();  //初始化一个cURL会话
curl_setopt($ch,CURLOPT_URL,$url);//设置需要获取的 URL 地址
// 设置浏览器的特定header
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
  "Host: www.baidu.com",
  "Connection: keep-alive",
  "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
  "Upgrade-Insecure-Requests: 1",
  "DNT:1",
  "Accept-Language: zh-CN,zh;q=0.8,en-GB;q=0.6,en;q=0.4,en-US;q=0.2",
  "Cookie:_za=4540d427-eee1-435a-a533-66ecd8676d7d;"    
));
$result=curl_exec($ch);//执行一个cURL会话

2.3 snoopy

  • 表单提交

我们的一个例子 
form-demo.html

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>form-demo</title>
</head>
<body>
    <form action="./form-demo.php" method="post">
        用户名:<input type="text" name="userName"><br>
        密 码:<input type="password" name="password"><br>
        <input type="submit">
    </form>
</body>
</html>

form-demo.php

<?php 
    $userName = $_POST[\'userName\'];
    $password = $_POST[\'password\'];
    if($userName==="admin"&&$password==="admin"){
        echo "hello admin";
    }else{
        echo "login error";
    }
 ?>
 ```
提交表单
```php
<?php
include \'Snoopy/Snoopy.class.php\';
$snoopy = new Snoopy();
$formvars["userName"] = "admin";
//userName 与服务器端/表单的name属性一致
$formvars["password"] = "admin";
$action = "http://localhost:8000/spider/demo3/form-demo.php";//表单提交地址
$snoopy->submit($action,$formvars);
echo $snoopy->results;
?>




<div class="se-preview-section-delimiter"></div>

 

 

分类:

技术点:

相关文章:

  • 2022-03-14
  • 2021-06-19
  • 2021-11-07
  • 2021-12-17
  • 2021-09-02
  • 2021-04-25
  • 2021-09-09
猜你喜欢
  • 2022-12-23
  • 2021-11-20
  • 2021-07-20
  • 2021-10-21
  • 2021-11-20
  • 2021-07-13
相关资源
相似解决方案