爬虫: 网络爬虫(又称为网页蜘蛛,网络机器人,在FOAF社区中间,更经常的称为网页追逐者),是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。另外一些不常使用的名字还有蚂蚁、自动索引、模拟程序或者蠕虫

爬虫的简单小实例
 
# 调用请求库
import requests
 
# 定义一个变量为一个想要爬取的网站url
# 定义这个变量请求这个url
res = requests.get(url,url)
# 设置编码模式
res.encoding = res.apparent_encoding
# 打印这个爬取对象的内容以text文档格式显示
print(res.text)
 
打印的结果为
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>ç¾åº¦ä¸ä¸ï¼ä½ å°±ç¥é</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class="head_wrapper"> <div class="s_form"> <div class="s_form_wrapper"> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class="s_ipt" value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=ç¾åº¦ä¸ä¸ class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class="mnav">æ°é»</a> <a href=http://www.hao123.com name=tj_trhao123 class="mnav">hao123</a> <a href=http://map.baidu.com name=tj_trmap class="mnav">å°å¾</a> <a href=http://v.baidu.com name=tj_trvideo class="mnav">è§é¢</a> <a href=http://tieba.baidu.com name=tj_trtieba class="mnav">è´´å§</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class="lb">ç»å½</a> </noscript> <script>document.write('<a href="http://www.baidu.com/bdorz/login.gif?login&tpl=mn&u='+ encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">ç»å½</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class="bri" >æ´å¤äº§å</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>å³äºç¾åº¦</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使ç¨ç¾åº¦åå¿è¯»</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>æè§åé¦</a>&nbsp;京ICPè¯030173å·&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>
 
这个前端页面代码放在以网页形式显示为下面内容
爬虫讲解爬虫讲解
抽屉网站的爬取
# 调用请求包
import requests
# 定义url为一个网站的网址
# 设置一个变量来请求这个url
res = requests.get(url)
# 打印这个url并以text文档显示
print(res.text)
 
获取到的内容为
<html xmlns="http://www.w3.org/1999/xhtml"><head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>网站防火墙</title>
<style>
p {
    line-height:20px;
}
ul{ list-style-type:none;}
li{ list-style-type:none;}
</style>
</head>
 
<body >
 
<div >
  
  
  <div >
    <div >网站防火墙 </div>
    <div >
      <p ><span >您的请求带有不合法参数,已被拦截!请勿在恶意提交。</span></p>
<p >可能原因:您提交的内容包含危险的攻击请求, 自动记录 ip 相关信息通知管理员</p>
<p >如何解决:</p>
<ul ><li >1)检查提交内容;</li>
<li >2)普通网站访客,请联系网站管理员;</li></ul>
    </div>
  </div>
</div>
</body></html>
 
这个前端页面代码放在以网页形式显示为下面内容
 
爬虫讲解爬虫讲解
发现网站的检测计值检测到我们是非法访问
下面就来解决它
首先打开网站,打开开发者工具,一般F12就可以打开,找到network
爬虫讲解爬虫讲解
# 调用请求包
import requests
# 调用这个头部内容,用户代理是在浏览器里面找到的
header = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
}
# 定义url为一个网站的网址
# 设置一个变量来请求这个url,应用这个头部信息
res = requests.get(url,headers=header)
# 打印这个url并以text文档显示
print(res.text)
再获取的内容为
<!doctype html><html class="no-js" lang="en" xmlns="http://www.w3.org/1999/xhtml"><head><meta charset="utf-8"><title>抽屉新热榜-聚合每日热门、搞笑、有趣资讯</title><meta name="keywords" content="抽屉新热榜,资讯,段子,图片,公众场合不宜,科技,新闻,节操,搞笑"><meta name="description" content="抽屉新热榜,汇聚每日搞笑段子、热门图片、有趣新闻。它将微博、门户、社区、bbs、社交网站等海量内容聚合在一起,通过用户推荐生成最热榜单。看抽屉新热榜,每日热门、有趣资讯尽收眼底。"><meta name="author" content="北京格致璞科技有限公司"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta name="renderer" content="webkit"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,user-scalable=no" draggable="false"><meta name="format-detection" content="telephone=no"><meta name="apple-mobile-web-app-capable" content="yes"><link rel="apple-touch-icon" href="/images/apple-touch-icon.png"><link rel="shortcut icon" href="/images/favicon-d38b877458.png" type="image/vnd.microsoft.icon"><link rel="icon" href="/images/favicon-d38b877458.png" type="image/vnd.microsoft.icon"><link href="/images/favicon-d38b877458.png" mce_href="/images/favicon-d38b877458.png" rel="icon" type="image/x-icon"><meta name="robots" content="index,follow"><meta name="GOOGLEBOT" content="index,follow"><meta name="Author" content="搞笑"><link type="application/opensearchdescription+xml" href="/opensearch.xml" title="抽屉新热榜" rel="search"><script>history.scrollRestoration = "manual"</script><link type="text/css" rel="stylesheet" href="/styles/base-650bcbd75b.css?v=1562812941943"></head><body><div><div class="header-fix"><header class="outer-container"><div class="main-container clearfix"><a class="logo-con left" href="/"><img class="logo-icon" src="/images/logo-c30a1a3941.png"> <img class="logo-txt" src="/images/logo_txt-06bb1545d4.png"></a><nav class="left"><div class="nav-ul clearfix"><a class="nav-li left active" data-url="/" href="/"><span>首页</span> </a><a class="nav-li left zone-area-btn" href="javascript:;"><span>     <span>专区</span> </span><span class="trangle-icon"></span> </a><a class="nav-li left" data-url="/all/man" href="/all/man"><span>人类发布</span> </a><a class="nav-li left" data-url="/zone/video" href="/zone/video"><span>视频</span> </a><a class="nav-li left discovery-area-btn" href="javascript:;"><span>        <span class="discovery-area-name">发现</span> </span><span class="trangle-icon"></span> </a><a class="nav-li left" data-url="/page/section/attention" href="/page/section/recommend"><span>话题</span> </a></div></nav><a class="btn1 right login-btn" ;
        window.jid = "";
        window.loginedUser = "";
        window.followCount = "";
        window.action = "";
        window.phone = "";
        window.commentLimit = "";</script><script src="/vendor/base-2d7e5ec98b.js"></script><script src="/vendor/core-8a9c3d200a.js"></script><script src="//cstaticdun.126.net/load.min.js"></script><!--[if lt IE 8]>
    <p class="browserupgrade">您正在使用 <strong>过时</strong>的浏览器,该浏览器已经不保证完全兼容。请 <a
            href="http://browsehappy.com/">升级您的浏览器</a> 以提升您的用户体验.</p>
    <![endif]--><script type="text/javascript">let messageArr = [];
        let messageCount = 0;
        let requestUrl = "\/";
        let serverUrl = "https:\/\/io.chouti.com";
        let socket = io.connect(serverUrl);
        if (window.jid) {
            socket.emit('clientId',window.jid);//登陆状态下发送jid
        }
        //收到进入热榜消息
        if (requestUrl == '/' || requestUrl == '/all/hot') {
            socket.on('hot_updated', function(msg){
                // console.log(msg);
                const jsonObj = JSON.parse(msg);
                messageCount = messageCount + 1;
                messageArr.push(jsonObj.linkId);
                if (messageCount > 0) {
                    $('.msgAlert-place').show();
                } else {
                    $('.msgAlert-place').hide();
                }
                $('.msg-alert .num').text(messageCount);
            });
        }
 
        //收到顶部通知
        if (requestUrl != '/download') {
            socket.on('has_notification', function(msg){
                window.CT.fetchNotify();
            });
        }
 
 
        // 关闭黄条
        $('body').on('click', '.close-area', e => {
            e.stopPropagation();
            e.preventDefault();
            $('.msgAlert-place').hide();
        });
 
        // 查看新入热榜的新闻
        $('body').on('click', '#refreshLink', e => {
            const linkIds = messageArr.join(',');
            if (messageCount < 25) {
                $.ajax({
                    url: '/get/links/ajax',
                    type: "GET",
                    data: {
                        linkIds: linkIds
                    },
                    success: function (res) {
                        window.scrollTo(0, 0);
                        messageArr = [];
                        messageCount = 0;
                        $('.msgAlert-place').hide();
                        const linkList = res.linkList.length ? window.CT.preDealLinks(res.linkList,'index'):[];
                        if (linkList.length && res.refresh == false) {
                            const template = Handlebars.compile(window.CT.LinkTmpl);
                            const context = {links:linkList,isHot:true};
                            $('.link-con').prepend(template(context));
 
                            window.CT.generateQrcode();
                        } else {
                            window.location.href = window.location.href;
                        }
 
                    },
                    error: function () {}
                });
            } else {
                window.location.href = window.location.href;
            }
 
        });</script><script type="text/javascript">var _hmt = _hmt || [];
        (function() {
            var hm = document.createElement("script");
            var s = document.getElementsByTagName("script")[0];
            s.parentNode.insertBefore(hm, s);
        })();</script><script type="text/javascript">//获取二维码图片的地址
        $.ajax({
            url: '/download/code/img',
            type: 'GET',
            data: {},
            success: function (res) {
                $('.ct-code-img').attr('src', window.CT.replaceHttps(res.chouti));
            }
        });</script></div><script src="/scripts/links/linksRouter-6fbe9ba4ac.js"></script></body></html>
 
 
把这个前端代码以前端页面显示会发现可以正常获取到网站数据了
爬虫讲解爬虫讲解
 
 
 
 

相关文章: