假期学习【六】Python网络爬虫2020.2.4

今天通过Python网络爬虫视频复习了一下以前初学的网络爬虫，了解了网络爬虫的相关规范。

案例：京东的Robots协议

https://www.jd.com/robots.txt

说明可以爬虫的范围

#注释. *代表所有 /代表根目录

robots协议

http://www.baidu.com/robots.txt 百度

http://news.sina.com.cn/robots.txt 新浪新闻

http://www.qq.com/robots.txt 腾讯

http://news.qq.com/robots.txt 腾讯新闻

如果一个网站不设置robots协议说明所有内容都可以爬取

网络爬虫：自动或人工识别robots.txt，再进行内容爬取

约束性:robots协议建议但非约束性，不遵守可能存在法律风险

爬取网页的通用代码框架

#爬取网页的通用代码框架
import requests
def getHTMLText(url):
    try:
        r=requests.get(url,timeout=30)
        r.raise_for_status()
        r.encoding=r.apparent_encoding
        return r.text
    except:
        return "产生异常"

if __name__=="__main__":
    url="http://www.baidu.com"
    print(getHTMLText(url))

View Code