最近有爬虫相关的需求,所以上B站找了个视频(链接在文末)看了一下,做了一个小程序出来,大体上没有修改,只是在最后的存储上,由txt换成了excel。
- 简要需求:爬虫爬取 猫眼电影TOP100榜单 数据
- 使用语言:python
- 工具:PyCharm
- 涉及库:requests、re、openpyxl(高版本excel操作库)
实现代码
# -*- coding: utf-8 -*- # @Author : yocichen # @Email : yocichen@126.com # @File : maoyan100.py # @Software: PyCharm # @Time : 2019 # @UpdateTime : 2020/4/26 import requests from requests import RequestException import re import openpyxl import traceback # Get page\'s html by requests module def get_one_page(url): try: headers = { \'user-agent\': \'Mozilla / 5.0(Windows NT 10.0; WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 53.0.2785.104Safari / 537.36Core / 1.53.4882.400QQBrowser / 9.7.13059.400\' } # Sometimes, the proxies need to be replaced. # You can get them by accessing https://www.kuaidaili.com/free/inha/ proxies = { \'http\': \'60.190.250.120:8080\' } # use headers to avoid 403 Forbidden Error(reject spider) response = requests.get(url, headers=headers, proxies=proxies) if response.status_code == 200 : return response.text return None except RequestException: traceback.print_exc() return None # Get useful info from html of a page by re module def parse_one_page(html): try: pattern = re.compile(\'<dd>.*?board-index.*?>(\d+)<.*?<a.*?title="(.*?)"\' +\'.*?data-src="(.*?)".*?</a>.*?star">[\\s]*(.*?)[\\n][\\s]*</p>.*?\' +\'releasetime">(.*?)</p>.*?integer">(.*?)</i>.*?\' +\'fraction">(.*?)</i>.*?</dd>\', re.S) items = re.findall(pattern, html) return items except Exception: traceback.print_exc() return [] # Main call function def main(url): page_html = get_one_page(url) parse_res = parse_one_page(page_html) return parse_res # Write the useful info in excel(*.xlsx file) def write_excel_xlsx(items): wb = openpyxl.Workbook() ws = wb.active rows = len(items) cols = len(items[0]) # First, write col\'s title. ws.cell(1, 1).value = \'编号\' ws.cell(1, 2).value = \'片名\' ws.cell(1, 3).value = \'宣传图片\' ws.cell(1, 4).value = \'主演\' ws.cell(1, 5).value = \'上映时间\' ws.cell(1, 6).value = \'评分\' # Write film\'s info for i in range(0, rows): for j in range(0, cols): if j != 5: ws.cell(i+2, j+1).value = items[i][j] else: ws.cell(i+2, j+1).value = items[i][j]+items[i][j+1] break # Save the work book as *.xlsx wb.save(\'maoyan_top100.xlsx\') if __name__ == \'__main__\': print(\'spider working...\') res = [] url = \'https://maoyan.com/board/4?\' for i in range(0, 10): if i == 0: res = main(url) else: newUrl = url+\'offset=\'+str(i*10) res.extend(main(newUrl)) print(\'writing into excel...\') write_excel_xlsx(res) print(\'work done!\nNote: the data is in the current directory.\')
更新效果图:
后记
入门了一点后发现,如果使用正则表达式和requests库来实行进行数据爬取的话,分析HTML页面结构和正则表达式的构造是关键,剩下的工作不过是替换url罢了。
补充一个分析HTML构造正则的例子
审查元素我们会发现每一项都是<dd>****</dd>格式
我想要获取电影名称和评分,先拿出HTML代码看一看
试着构造正则
\'.*?<dd>.*?movie-item-title.*?title="(.*?)">.*?integer">(.*?)<.*?fraction">(.*?)<.*?</dd>\' (随手写的,未经验证)
参考资料
【B站视频 2018年最新Python3.6网络爬虫实战】https://www.bilibili.com/video/av19057145/?p=14
【猫眼电影robots】https://maoyan.com/robots.txt (最好爬之前去看一下,那些可爬那些不允许爬)