python+requests+ 爬取官网双色球开奖数据

python+requests+mysql 爬取官网双色球开奖数据

分析网页数据获取方式

第一种查询方式
第二种查询方式

完整代码

分析网页数据获取方式

第一种查询方式

python+requests+ 爬取官网双色球开奖数据
在官网上可以找到多种数据查询方式，第一种是按期号查询数据指定code去查询该期数据
查询链接如下：
Request URL: http://www.cwl.gov.cn/cwl_admin/kjxx/findKjxx/forIssue?name=ssq&code=2018126

第二种查询方式

python+requests+ 爬取官网双色球开奖数据
在这里有其他的批量查询数据的方式其中自定义查询中选用按期号查询的方式较为方便

这种查询方式可以指定例如从 2013001期到2013101期甚至从 2013001-2015001都可以查但是如果时间跨度较大的话回传的数据会分页
链接如下
Request URL: http://www.cwl.gov.cn/cwl_admin/kjxx/findDrawNotice?name=ssq&issueCount=&issueStart=2013001&issueEnd=2015001&dayStart=&dayEnd=&pageNo=

有些参数可以为空这里可以精简一下
Url: http://www.cwl.gov.cn/cwl_admin/kjxx/findDrawNotice?name=ssq&issueStart=2018001&issueEnd=2018051

注意：链接直接点过去的话是看不到任何东西的，服务端应该设置了某种 Referrer Policy
这个坑困扰了我有一阵刚入坑python爬虫就遇到这种棘手的问题
我并没有从豆瓣爬虫开始练起而是找了一个需要的数据的链接开始爬
虽然会遇到更多的坑但与此同时在解决这些问题的过程中也能学到更多的东西
这个问题经过反复对比请求头终于发现了问题：
请求头中不带referrer信息的话是拿不到任何数据的
这里可以对比两次请求的请求头来看

图1 是成功拿到数据的请求头
python+requests+ 爬取官网双色球开奖数据
图2 无任何数据返回

先简单拿到部分数据看一下

url= 'http://www.cwl.gov.cn/cwl_admin/kjxx/findDrawNotice?name=ssq&issueStart=2018001&issueEnd=2018003'
      
        res = requests.get(url)
        print(res.text)

这样写的话对于某些链接是有效的开始什么数据都拿不到确实很让人迷惑
先开始想是不是没有加user-agent???
空想无用尝试验证一下这里在请求头中加上user-angent试试
代码如下

url= 'http://www.cwl.gov.cn/cwl_admin/kjxx/findDrawNotice?name=ssq&issueStart=2018001&issueEnd=2018003'
        headers = {
                    'User - Agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, like Gecko) Chrome / 70.0.3538.77 Safari / 537.36'

        }
        res = requests.get(url, headers=headers)
        print(res.text)

还是拿不到数据那就不只是user-angent的问题了
就像上面说的通过反复对比请求头最后终于解决了问题

url= 'http://www.cwl.gov.cn/cwl_admin/kjxx/findDrawNotice?name=ssq&issueStart=2018001&issueEnd=2018003'
        headers = {
            'Referer': 'http://www.cwl.gov.cn/kjxx/ssq/kjgg/',
        }
        res = requests.get(url, headers=headers)
        print(res.text)

再次尝试，结果不需要加user-angent 也可以，最终正常拿到数据的代码如上
现在可以简单拿到回传的数据了

为了好分析这里先拿两期数据下来分析一下
拿到的数据如下：

{
	'state': 0,
	'message': '查询成功',
	'pageCount': 1,
	'countNum': 2,
	'Tflag': 1,
	'result': [{
		'name': '双色球',
		'code': '2014004',
		'detailsLink': '/c/2014-01-09/384956.shtml',
		'videoLink': '',
		'date': '2014-01-09(四)',
		'week': '四',
		'red': '01,04,19,22,24,25',
		'blue': '15',
		'blue2': '',
		'sales': '373155800',
		'poolmoney': '201488460',
		'content': '北京1注,安徽1注,山东2注,广东1注,共5注。',
		'addmoney': '',
		'addmoney2': '',
		'msg': '',
		'z2add': '',
		'm2add': '',
		'prizegrades': [{
			'type': 1,
			'typenum': '5',
			'typemoney': '9580091'
		}, {
			'type': 2,
			'typenum': '113',
			'typemoney': '303988'
		}, {
			'type': 3,
			'typenum': '1035',
			'typemoney': '3000'
		}, {
			'type': 4,
			'typenum': '56508',
			'typemoney': '200'
		}, {
			'type': 5,
			'typenum': '1166507',
			'typemoney': '10'
		}, {
			'type': 6,
			'typenum': '8454477',
			'typemoney': '5'
		}, {
			'type': 7,
			'typenum': '',
			'typemoney': ''
		}]
	}, {
		'name': '双色球',
		'code': '2014003',
		'detailsLink': '/c/2014-01-07/384933.shtml',
		'videoLink': '',
		'date': '2014-01-07(二)',
		'week': '二',
		'red': '06,10,11,28,30,33',
		'blue': '12',
		'blue2': '',
		'sales': '363993362',
		'poolmoney': '169237320',
		'content': '内蒙古1注,江苏2注,福建1注,四川1注,贵州1注,共6注。',
		'addmoney': '',
		'addmoney2': '',
		'msg': '',
		'z2add': '',
		'm2add': '',
		'prizegrades': [{
			'type': 1,
			'typenum': '6',
			'typemoney': '7804024'
		}, {
			'type': 2,
			'typenum': '147',
			'typemoney': '171674'
		}, {
			'type': 3,
			'typenum': '1778',
			'typemoney': '3000'
		}, {
			'type': 4,
			'typenum': '85239',
			'typemoney': '200'
		}, {
			'type': 5,
			'typenum': '1536591',
			'typemoney': '10'
		}, {
			'type': 6,
			'typenum': '11297659',
			'typemoney': '5'
		}, {
			'type': 7,
			'typenum': '',
			'typemoney': ''
		}]
	}]
}

这里的数据是以json格式回传的，这样也免了从页面中匹配需要的数据
只需要从json中拿到我们需要的数据就可以了，这里先分析一下回传的数据

'state': 0,
	'message': '查询成功',
	'pageCount': 1,
	'countNum': 2,
	'Tflag': 1,
	'result': [{

state字段是查询状态字段，0就代表返回的json中有数据
result字段中就包含了我们需要的数据

result': [{
		'name': '双色球',
		'code': '2014004',
		'detailsLink': '/c/2014-01-09/384956.shtml',
		'videoLink': '',
		'date': '2014-01-09(四)',
		'week': '四',
		'red': '01,04,19,22,24,25',
		'blue': '15',
		'blue2': '',
		'sales': '373155800',
		'poolmoney': '201488460',
		'content': '北京1注,安徽1注,山东2注,广东1注,共5注。',
		'addmoney': '',
		'addmoney2': '',
		'msg': '',
		'z2add': '',
		'm2add': '',
	'prizegrades': [{
			'type': 1,
			'typenum': '5',
			'typemoney': '9580091'
		}, {
			'type': 2,
			'typenum': '113',
			'typemoney': '303988'
		}, {
			'type': 3,
			'typenum': '1035',
			'typemoney': '3000'
		}, {
			'type': 4,
			'typenum': '56508',
			'typemoney': '200'
		}, {
			'type': 5,
			'typenum': '1166507',
			'typemoney': '10'
		}, {
			'type': 6,
			'typenum': '8454477',
			'typemoney': '5'
		}, {
			'type': 7,
			'typenum': '',
			'typemoney': ''
		}]

prizegrades 字段中是中奖的等级、注数以及奖金数额（content字段是一等奖中奖者的分布）

这里还需要注意下没有第7等的奖项，所以这里是空的，插入数据库的时候要注意一下

 {
			'type': 7,
			'typenum': '',
			'typemoney': ''
		}]

需要的数据有红、蓝球号码、开奖时间、期号、卖出时间、奖池金额以及中奖情况的明细数据

完整代码

分析完了开始码代码
数据库采用的是mysql

import requests
import json
import time
import mysql.connector

class Lottery(object):

    def __init__(self):
        self.db = mysql.connector.connect(
            host = 'localhost',
            user = 'root',#你的mysql 用户名
            passwd = '',#你的mysql 密码
            database = 'lottery'
        )
        self.cursor = self.db.cursor()
        self.baseUrl = "http://www.cwl.gov.cn/cwl_admin/kjxx/findDrawNotice?name=ssq"
        self.headers = {
            'Referer': 'http://www.cwl.gov.cn/kjxx/ssq/kjgg/',
                    }
        self.session = requests.Session()
        # 定义起止期号 以及数据间隔期数
        self.lastIssue = 157
        self.firstIssue = 1
        self.page = 50
   

#传入int 类型的期号 返回期号str  处理期号  返回三位期号
    def getIssueStr( self,issue):
        issue = int(issue)
        isStr = ''
        if issue == 0:
            return '001'

        if issue < 10:
            isStr = '00' + str( issue )
        if issue <100 and issue >= 10:
            isStr = '0' + str(issue)
        if issue >= 100 and issue <self.lastIssue:
            isStr = str(issue)
        if issue > self.lastIssue:
            return str(self.lastIssue)
        return isStr

    #获取目标url
    def getUrl(self, year, startIssue, lastIssue):
        endIssue = self.getIssueStr(lastIssue)
        Url = self.baseUrl + '&issueStart='+ str(year) +\
              self.getIssueStr(startIssue)+'&issueEnd='+ str(year) + endIssue
        print("Url:", Url)
        return Url

    def getResponse(self, url):
        response = requests.get(url,headers=self.headers)
        if response.status_code != 200:
            return 'error'
        else:
            return response

    def run(self):
        list = range(2013, 2015) # 生成年份 官网只有从13年开始的数据  注意range函数的边界值
        #这里时间跨度三年左右不会有问题 结束年为19年 传入2018的话  不会包括18年的数据
        issueList = range(self.firstIssue, self.lastIssue, self.page + 1)#生成期号
        data = ''
        for year in list:
            for issue in issueList:
                data = self.getResponse( self.getUrl(year,issue,issue+self.page))
                if data == 'error':
                    print("response error")
                    continue
                else:
                    self.saveData(data)
                    time.sleep(5)




    def saveData(self,response):
        res = json.loads(response.text)
        resultList = res['result']
        state = res['state']
        if int(state) != 0:
            print("无数据")
            return
        for result in resultList:
            code = result['code']
            date = result['date']
            redballs = result['red']
            blueball = result['blue']
            sales = result['sales']
            poolmoney = result['poolmoney']
            content = result['content']
            prizegrades = result['prizegrades']#中奖信息列表
            for pri in prizegrades:
                type = pri['type']
                typenum = pri['typenum']
                typemoney = pri['typemoney']
                if int(type) == 7:#该字段数据为空 没有该奖项 跳过
                    continue
                self.cursor.execute('insert into bingo (code,level,num,money) VALUES (%s,%s,%s,%s)',(code,int(type),int(typenum),typemoney))
            self.cursor.execute('insert into base (code,redballs,blueball,date,sales,poolmoney,content) VALUES (%s,%s,%s,%s,%s,%s,%s)',(code,redballs
                                 ,blueball,date,int(sales),int(poolmoney),content))
            self.db.commit()


if __name__ == '__main__':
    lottery = Lottery()
    lottery.run()

数据库脚本

/*
Navicat MySQL Data Transfer

Source Server         : 127.0.0.1_3306
Source Server Version : 80011
Source Host           : 127.0.0.1:3306
Source Database       : lottery

Target Server Type    : MYSQL
Target Server Version : 80011
File Encoding         : 65001

Date: 2018-11-01 16:44:33
*/

SET FOREIGN_KEY_CHECKS=0;

-- ----------------------------
-- Table structure for base
-- ----------------------------
DROP TABLE IF EXISTS `base`;
CREATE TABLE `base` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `code` varchar(50) NOT NULL,
  `redballs` varchar(50) NOT NULL,
  `blueball` varchar(5) NOT NULL,
  `date` varchar(50) NOT NULL,
  `sales` int(11) NOT NULL COMMENT '总销售额',
  `poolmoney` int(11) NOT NULL COMMENT '奖池',
  `content` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=897 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

-- ----------------------------
-- Table structure for bingo
-- ----------------------------
DROP TABLE IF EXISTS `bingo`;
CREATE TABLE `bingo` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `code` varchar(255) NOT NULL,
  `level` tinyint(4) NOT NULL,
  `num` int(11) NOT NULL DEFAULT '0',
  `money` varchar(50) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5377 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

这里还有个问题请求次数过多的话会被强制中断连接这里先将年份跨度设置小一点
请求的次数就会少一点分批来爬取
代码中有设置time.sleep（）目的是解决爬取数据被中断连接的情况可能是时间间隔设置太小，并没有起到相应的作用，可以去掉或者将时间设置长一点尝试一下
初次入坑python爬虫，如文章中有错误请多指教