我的爬虫自学之旅
电子版参考书:https://pan.baidu.com/s/15R08yEjLDj8FxrBwnUaTyA 注:仅限网上学习交流,如有侵权请联系我
我们一起学习┏(^0^)┛
自我介绍,我是一个python迈过基础游荡在爬虫自学之路的一只小蚂蚁。在计算机编程漫长枯燥的道路上,很多技术博客帮助了我,心怀感激,想把自己的经历也记录下来,这是我的第一篇博客,如有瑕疵请多包涵,谢谢~对了,如果你也是自学入门的,来试试hackerrank.com,我只是需要一个队友~你会有不一样的感受的^_^
安装第三方库经常报错:error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools
下载:https://download.microsoft.com/download/5/f/7/5f7acaeb-8363-451f-9425-68a90f98b238/visualcppbuildtools_full.exe?fixForIE=.exe. 安装挺久,但一劳永逸有木有哈哈?
安装selenium,chromedriver.exe地址:http://chromedriver.storage.googleapis.com/index.html?path=2.41/()
我的是windows系统,文件放在python/Scripts目录下,不用配置环境变量。本文只用Chrome爬虫。
照教程爬了猫眼排行榜还是啥也不懂的我,接了朋友给的艰巨任务:智联招聘(【内牛满面】)
所学库不多,但好歹迈出了第一步。对代码运行结果也有困惑,希望交流~
from urllib.parse import urlencode import requests import json import csv import time def get_one_page(page): headers = { \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36\' } params = { \'start\': \'\', \'pageSize\': \'60\', \'cityId\': \'489\', \'workExperience\': \'-1\', \'education\': \'-1\', \'companyType\': \'-1\', \'employmentType\': \'-1\', \'jobWelfareTag\': \'-1\', \'kw\': \'数据分析师\', \'kt\': \'3\', \'lastUrlQuery\': {"p": page, "pageSize": "60", "jl": "489", "kw": "数据分析师", "kt": "3" } } base_url = \'https://fe-api.zhaopin.com/c/i/sou?\' url = base_url + urlencode(params) # print(url) response = requests.get(url, headers=headers) try: if response.status_code == 200: return response.json() except Exception as e: print(\'Error:\', e) @get_one_page def func(page): if page == 0: get_one_page().params.pop(\'start\') get_one_page().params[\'lastUrlQuery\'].pop(\'p\') else: get_one_page().params[\'start\'] = 60 * (page - 1) return get_one_page() def parse_page(json): if json.get(\'data\'): data = json.get(\'data\').get(\'results\') data_list = [] for item in data: job_name = item.get(\'jobName\') salary = item.get(\'salary\') company = item.get(\'company\').get(\'name\') welfare = item.get(\'welfare\') city = item.get(\'city\').get(\'name\') work = item.get(\'workingExp\').get(\'name\') edu_level = item.get(\'eduLevel\').get(\'name\') data_list.append([job_name, company, welfare, salary, city, work, edu_level]) print(data_list) return data_list def save_data(datas): with open(\'data_zhilian_findjob.csv\', \'w\') as csvfile: writer = csv.writer(csvfile) writer.writerow([\'job_name\', \'company\', \'welfare,salary\', \'city\', \'workingExp\', \'edu_level\']) for row in datas: writer.writerow(row) def main(): for page in range(20): json = get_one_page(page) data = parse_page(json) # print(data) time.sleep(0.8) save_data(data) if __name__ == \'__main__\': main()