python爬虫----基本操作

有些网站和其他网站是有关系（链接），全球的网站就相当于一个蜘蛛网，我们放一只蜘蛛在上面爬，一定能够把网爬个遍。那么如果我们要爬取互联网上内容我们就相当于放一只蜘蛛在上面。

爬虫分为

定向爬虫：只爬这一类网站，有针对性（基本上做的都是定向的）
非定向爬虫：没有目的性，没有针对性，所有链接都爬取

爬虫：就是去某个URL获取指定的内容

发送http请求：http://www.baidu.com
基于正则表达式获取内容

Python实现：(爬取汽车之家的小实例，获取一个新闻的标题)

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 response = requests.get("https://www.autohome.com.cn/news/")
 5 # print(response.content)#拿到的是字节信息
 6 response.encoding='gbk' #设置文本的编码
 7 # print(response.text)#拿到的是文本信息
 8 soup = BeautifulSoup(response.text,'html.parser') #html.parser表示html解析器
 9 tag = soup.find(id='auto-channel-lazyload-article')
10 h3 = tag.find(name="h3")#name表示的是标签名
11 print(h3)

效果图：

python爬虫----基本操作

爬取汽车之家的小实例，找到所有的新闻（标题，简介，url，图片）

 1 import requests
 2 from bs4 import BeautifulSoup
 3 
 4 #爬取汽车之家的小实例，找到所有的新闻（标题，简介，url，图片）
 5 
 6 response = requests.get("https://www.autohome.com.cn/news/")
 7 response.encoding='gbk' #设置文本的编码
 8 soup = BeautifulSoup(response.text,'html.parser') #html.parser表示html解析器
 9 li_list = soup.find(id='auto-channel-lazyload-article').find_all('li')
10 for li in li_list:
11     title = li.find('h3')
12     if not title:
13         continue
14     summary = li.find('p').text
15     # li.find('a').attrs#获取到的是一个字典
16     url = li.find('a').get('href')
17     img_url = li.find('img').get('src')
18     img_url = 'http:'+img_url
19     print(title.text)#title是一个HTML对象，text可以拿到标签里的文本
20     print(summary)
21     print(url)
22     print(img_url)
23     print("==============")
24     #下载图片
25     res = requests.get(img_url)
26     file_name = "%s.jpg" %(title.text,)
27     with open(file_name,'wb') as f:
28         f.write(res.content)

requests模块：

obj = requests.get('url') 发送请求
obj.content 得到字节内容
obj.text 得到HTML内容
obj.encoding = 'gbk' 设置内容的编码（显示中文）
obj.apparent_encoding 自动检测内容编码那上面的就可以换成obj.encoding = obj.apparent_encoding

Beautifulsoup模块：

soup = BeautifulSoup(obj.text,'html.parser')
标签 = soup.find(name="标签名",)
[标签，] = soup.find_all()
标签.text 获取内容
标签.attrs 获取属性，这里获取的是一个字典，如果想要获得特定的属性，则需要在里面写
标签.get('href.....')获取指定属性的标签内容

Python代码登录github：(requset的Post方法)

1.登录页面发送请求GET，获取csrf_token和cookie（各个网站登录模式不一样）

2.发送POST请求，包含用户名，密码，csrf_token和cookie，如果登录成果可能会返回一个cookie，以后想要登录，只要带着这个cookie就可以了

 1 import requests
 2 from bs4 import BeautifulSoup
 3 r1 = requests.get('https://github.com/login')
 4 s1 = BeautifulSoup(r1.text,'html.parser')
 5 #获得登录要发送的token
 6 token = s1.find(name='input',attrs={'name':'authenticity_token'}).get('value')
 7 r1_cookie_dict = r1.cookies.get_dict()
 8 #将用户名和密码发送到客户端
 9 '''
10 commit: Sign in
11 utf8: ✓
12 authenticity_token: AVkRqH1wYmS6BsmnR4FS1d+ng19SHJLgZhaY9SemGiHVIzZvKvzmLIIhQ6j5nsisaIXI+A9KLAslu7JoIvdxOg==
13 login: asdf
14 password: asdf
15 '''
16 
17 r2 = requests.post('https://github.com/session',
18                    data={
19                     'commit': 'Sign in',
20                     'utf8': '✓',
21                     'authenticity_token': token,
22                     'login': '729330778@qq.com',
23                     'password': 'wjxm08250920',
24                    },
25                    cookies = r1_cookie_dict
26                    )
27 
28 r2_cookie_dict = r2.cookies.get_dict()
29 cookie_dic = dict()
30 cookie_dic.update(r1_cookie_dict)
31 cookie_dic.update(r2_cookie_dict)
32 r3 = requests.get(
33     url='https://github.com/settings/emails',
34     cookies=cookie_dic,
35 )
36 print(r3.text)

View Code