1.网页信息爬取
首先进入知乎热门榜单页面:https://www.zhihu.com/hot,使用requests库对页面进行爬取,其中需要注意的是:
请求头headers的user-agent应设置为Mozilla/5.0,将程序伪装成浏览器,否则服务器会判定你的程序是python爬虫,进而影响爬取;
url="https://www.zhihu.com/hot" headers={'User-Agent':'Mozilla/5.0', 'cookie':'_xsrf=0NKUqgDc8ezRmsJGb1xC5ukDIxHhxeMq; _zap=bfe65d37-53d8-46d3-ac1e-784e06dcf8a9; d_c0="ALCgxU862A6PTvBEmeag_oGAglx-a-SfU-g=|1547808969"; z_c0="2|1:0|10:1547808985|4:z_c0|92:Mi4xNFpjWUJBQUFBQUFBc0tERlR6cllEaVlBQUFCZ0FsVk4yZjR1WFFCcy1xaEJMbWpyNGNUSkJSY1JacnlXYTJQUWhn|ec267d1d4420cb5fdcfbad75dcf91d0216c07ec185736bbb8595d4b82628cf41"; __utmv=51854390.100--|2=registration_date=20170209=1^3=entry_date=20170209=1; tst=r; __gads=ID=ac032a91f2a254f3:T=1553527557:S=ALNI_MahJ6DLYcUbqypiEeyeB2-kDGPIYg; q_c1=1d097810e84e467388c20b8f87e71621|1554087131000|1550025265000; __utmc=51854390; __utma=51854390.820497302.1550041062.1554087142.1554291501.8; __utmz=51854390.1554291501.8.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; tgw_l7_route=4860b599c6644634a0abcd4d10d37251'} r=requests.get(url,headers=headers)
2.网页数据解析
右键审查元素对页面元素进行审查,可以发现每个标题都包含在h2标签下,并且其对应的属性class='HotItem-title';
soup=BeautifulSoup(r.text,'html.parser') for a in soup.find_all(class_="HotItem-title"): with open('hot.txt',mode='a',encoding='UTF-8') as f: f.write('%s\n'%a.string)将提取到的热榜标题存入文件中,后边对其中文分词和制作词云图会方便很多;
3.制作词云图
用jieba库对提取到的标题进行中文分词,然后生成词云图,这些都是套路了、、、
m=open('hot.txt',mode='r',encoding='UTF-8') t=m.read() ls=jieba.lcut(t) m.close() txt=" ".join(ls) w=wordcloud.WordCloud(background_color='white',font_path='msyh.ttc',width=1000,height=700,max_words=25,random_state=30) w.generate(txt) w.generate_from_text(txt) plt.imshow(w) # 是否显示x轴、y轴下标 plt.axis('off') plt.show() # 获得模块所在的路径 w.to_file('hot.png')
4.效果展示
完整代码
import requests
import jieba
import wordcloud
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
url="https://www.zhihu.com/hot"
headers={'User-Agent':'Mozilla/5.0',
'cookie':'_xsrf=0NKUqgDc8ezRmsJGb1xC5ukDIxHhxeMq; _zap=bfe65d37-53d8-46d3-ac1e-784e06dcf8a9; d_c0="ALCgxU862A6PTvBEmeag_oGAglx-a-SfU-g=|1547808969"; z_c0="2|1:0|10:1547808985|4:z_c0|92:Mi4xNFpjWUJBQUFBQUFBc0tERlR6cllEaVlBQUFCZ0FsVk4yZjR1WFFCcy1xaEJMbWpyNGNUSkJSY1JacnlXYTJQUWhn|ec267d1d4420cb5fdcfbad75dcf91d0216c07ec185736bbb8595d4b82628cf41"; __utmv=51854390.100--|2=registration_date=20170209=1^3=entry_date=20170209=1; tst=r; __gads=ID=ac032a91f2a254f3:T=1553527557:S=ALNI_MahJ6DLYcUbqypiEeyeB2-kDGPIYg; q_c1=1d097810e84e467388c20b8f87e71621|1554087131000|1550025265000; __utmc=51854390; __utma=51854390.820497302.1550041062.1554087142.1554291501.8; __utmz=51854390.1554291501.8.8.utmcsr=zhihu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; tgw_l7_route=4860b599c6644634a0abcd4d10d37251'}
r=requests.get(url,headers=headers)
soup=BeautifulSoup(r.text,'html.parser')
for a in soup.find_all(class_="HotItem-title"):
with open('hot.txt',mode='a',encoding='UTF-8') as f:
f.write('%s\n'%a.string)
m=open('hot.txt',mode='r',encoding='UTF-8')
t=m.read()
ls=jieba.lcut(t)
m.close()
txt=" ".join(ls)
w=wordcloud.WordCloud(background_color='white',font_path='msyh.ttc',width=1000,height=700,max_words=25,random_state=30)
w.generate(txt)
w.generate_from_text(txt)
plt.imshow(w)
# 是否显示x轴、y轴下标
plt.axis('off')
plt.show()
# 获得模块所在的路径
w.to_file('hot.png')