爬虫案例-前程无忧

基本开发环境

Python 3.6
Pycharm

使用模块

　　requests、csv、parsel

目标网站

　　前程无忧:https://www.51job.com/

相关分析

首先在搜索框输入职位

通过开发者工具，从返回数据中可以发现关于职位的相关信息都存放在window.__SEARCH_RESULT__中（这里可以利用查找页面信息的方式来进行定位）。

这里我们可以将数据进行提取，但首页的信息相对而言比较简单，它的详细内容都在详情页中，我们需要尝试获取详情页的所有信息。

接下来，我们尝试用requests库进行页面的获取：

import requests


headers = {
    \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36\',
}

res = requests.get(\'https://search.51job.com/list/090200,000000,0000,00,9,99,python,2,1.html\', headers=headers)


res.encoding = res.apparent_encoding
print(res.text)

测试之后，可以拿到正常的数据。

我们先对每一个详情页的路由进行分析，看看其规律

第一页前三个详情页的地址：
https://jobs.51job.com/chengdu-gxq/128178776.html?s=01&t=0
https://jobs.51job.com/chengdu-chq/120260203.html?s=01&t=0
https://jobs.51job.com/chengdu-gxq/126058275.html?s=01&t=0

第二页前两个详情页的地址：
https://jobs.51job.com/chengdu-jnq/127709876.html?s=01&t=0
https://jobs.51job.com/chengdu-slq/123946763.html?s=01&t=0

从这里可以发现地址中唯一发生变化的内容，这个数字可能是详情页对应的ID值。再次回到window.__SEARCH_RESULT__中会发现其中有“jobid”，为了测试可以将这个值替代上面的任意一个对应ID看看是否可以跳转到对应的页面。

我们在原有代码的基础上加入使用正则对id值的查找：

import requests
import re


headers = {

    \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36\',

}

res = requests.get(\'https://search.51job.com/list/090200,000000,0000,00,9,99,python,2,1.html\', headers=headers)


res.encoding = res.apparent_encoding

p = \'"jobid":"(\d+)"\'
ids = re.findall(pattern=p, string=res.text)
print(len(ids))
print(ids)

下面是运行结果：可以看到，一共找到了50个id值：

目前就差验证详情页的数据获取是否正常了。利用下面代码进行数据获取：

import requests
import re


headers = {

    \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36\',

}

res = requests.get(\'https://jobs.51job.com/chengdu-slq/123946763.html?s=01&t=0\', headers=headers)


res.encoding = res.apparent_encoding

print(res.text)

在结果中，会找到我们所需要的信息，而这些可以利用xpath来进行获取：

接下来还有关于分页处理，通过前几页的地址进行分析，就会发现其规律：

https://search.51job.com/list/090200,000000,0000,00,9,99,python,2,2.html
https://search.51job.com/list/090200,000000,0000,00,9,99,python,2,1.html

到目前为止分析阶段就结束了！

最终代码

import requests
import re
import sys
from lxml import etree
import csv
import time
import parsel


headers = {
    \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36\',
}

# 1. 创建文件对象
f = open(\'前程无忧.csv\', \'w\', encoding="utf-8", newline=\'\')  # 这里需要设置newline（区分换行符），否则会出现空行

# 2. 基于文件对象构建 csv写入对象
csv_writer = csv.writer(f)

# 3. 构建列表头
csv_writer.writerow(["职位", "工资范围", "公司", "公司介绍地址", "工作地点", "工作经验", "学历要求", "招聘人数", "发布时间", "职位介绍"])


def get_html(url):
    time.sleep(1)       # 适当降低爬虫速度，减少被封风险
    response = requests.get(url, headers=headers)
    response.encoding = response.apparent_encoding
    return response


def get_data(url):
    """
        根据传入的详情页url进行请求和解析，并调用保存函数完成数据存储
    :param url: 
    :return: 
    """
    response = get_html(url)
    tree = etree.HTML(response.text)     # 文件解析
    # # 下面是使用lxml库进行xpath进行定位，但在实际使用中，经常出现未找到数据的情况，这时用下标方式去数据会出现报错
    # title = tree.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/h1/text()\')[0]       #职位
    # salary = tree.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/strong/text()\')[0]      # 工资范围
    # company = tree.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/p[1]/a[1]/@title\')[0]      # 公司
    # company_presentation = tree.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/p[1]/a[1]/@href\')[0]      # 公司介绍地址
    #
    # data = tree.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/p[2]/@title\')[0].split("|")    # 部分数据的特殊处理
    # workingPlace = data[0].strip() if data[0].strip() else "无数据"  # 工作地点
    # workingTime = data[1].strip() if data[1].strip() else "无数据"  # 工作经验
    # degreeRequired = data[2].strip() if data[2].strip() else "无数据"  # 学历要求
    # number = data[3].strip() if data[3].strip() else "无数据"  # 招聘人数
    # releaseTime = data[4].strip() if data[4].strip() else "无数据"  # 发布时间
    #
    # job_Requirements_base = tree.xpath(\'/html/body/div[3]/div[2]/div[3]/div[1]/div\')[0]     # 获取职位介绍
    # job_Requirements = etree.tostring(job_Requirements_base, encoding="utf-8").decode("utf-8")      # 将查询到的对象转换为字符串方便后续存储

    sel = parsel.Selector(response.text)
    title = sel.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/h1/text()\').extract_first()  # 职位
    salary = sel.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/strong/text()\').extract_first()  # 工资范围
    company = sel.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/p[1]/a[1]/@title\').extract_first()  # 公司
    company_presentation = sel.xpath(
        \'/html/body/div[3]/div[2]/div[2]/div/div[1]/p[1]/a[1]/@href\').extract_first()  # 公司介绍地址

    data = sel.xpath(\'/html/body/div[3]/div[2]/div[2]/div/div[1]/p[2]/@title\').extract_first().split("|")

    workingPlace = "无数据"  # 工作地点
    workingTime = "无数据"  # 工作经验
    degreeRequired = "无数据"  # 学历要求
    number = "无数据"  # 招聘人数
    releaseTime = "无数据"  # 发布时间

    try:
        workingPlace = data[0].strip()  # 工作地点
        workingTime = data[1].strip()  # 工作经验
        degreeRequired = data[2].strip()  # 学历要求
        number = data[3].strip()  # 招聘人数
        releaseTime = data[4].strip()  # 发布时间
    except BaseException as e:
        pass

    job_Requirements = sel.xpath(\'/html/body/div[3]/div[2]/div[3]/div[1]/div\').extract_first()  # 获取职位介绍

    # 将构造好的数据进行传递
    print([title, salary, company, company_presentation, workingPlace, workingTime, degreeRequired, number, releaseTime])
    csv_writer.writerow([title, salary, company, company_presentation, workingPlace, workingTime, degreeRequired, number, releaseTime, job_Requirements])
    print("保存成功！")



def main(url):
    """
        根据传入的url，获取简介页相关内容，并从中获取详情页的id值
    :param url: 简介页地址
    :return: 
    """
    res = get_html(url)
    p = \'"jobid":"(\d+)"\'
    ids = re.findall(pattern=p, string=res.text)        # 正则获取id
    base_detail_url = "https://jobs.51job.com/chengdu-gxq/%s.html?s=01&t=0"
    for id in ids:
        get_data(base_detail_url % id)



if __name__ == \'__main__\':
    position = input("请输入您要查询的岗位：")
    page = input("要查询的页数：")
    base_url = "https://search.51job.com/list/090200,000000,0000,00,9,99,%s,2,%s.html"

    try:
        page = int(page)
    except BaseException as e:
        sys.exit()      # 结束整个程序

    for i in range(1, page+1):
        print("-----------正在获取第%s页数据-----------" % i)
        main(base_url % (position, i))
    f.close()

下面是爬取下来的部分数据，可以看到有一些数据是混乱的，在数据可视化分析的时候需要进行相关的数据清洗：