Hive分析窗口函数之SUM,AVG,MIN和MAX,用于求历史一段时间内,截至到每天的累计访问次数、平均刚问次数、最小访问次数、最大访问次数
Hive中提供了很多的分析函数,用于完成负责的统计分析。
本文先介绍SUM、AVG、MIN、MAX这四个函数。
一、创建数据文件data.txt,内容如下:
P0887,2016-02-10,1
P0887,2016-02-11,3
P0887,2016-02-12,1
P0887,2016-02-13,9
P0887,2016-02-14,3
P0887,2016-02-15,12
P0889,2016-02-16,2
P0889,2016-02-14,3
P0889,2016-02-15,10
P0889,2016-02-16,6
P0890,2016-02-14,1
P0890,2016-02-15,19
P0890,2016-02-16,30
二、创建hive表并加载数据:
CREATE TABLE `yyz_func`(
`polno` string COMMENT 'polno',
`createtime` string COMMENT 'createtime',
`pnum` int COMMENT 'pnum')
COMMENT 'yyz_func'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
load data local inpath '/home/workdir/yyz/data/data' into table yyz_func;
三、为了便于查看结果,set相关配置,显示列名:
spark-hive> select * from yyz_func;
19/12/17 11:18:42 INFO SparkHiveShell: current SQL: select * from yyz_func
P0887 2016-02-10 1
P0887 2016-02-11 3
P0887 2016-02-12 1
P0887 2016-02-13 9
P0887 2016-02-14 3
P0887 2016-02-15 12
P0889 2016-02-16 2
P0889 2016-02-14 3
P0889 2016-02-15 10
P0889 2016-02-16 6
P0890 2016-02-14 1
P0890 2016-02-15 19
P0890 2016-02-16 30
Time taken: 1.874 s
公司集群CLI命令行测试,发现Hive命令行输出不显示列名,或者显示的列名带表名,可读性很差哇。
1)、显示列名办法:进入cli后 set hive.cli.print.header=true;
在某些版本(hive cli只输入上面命令即会出现表名+列名,在spark-hive中只会出现列名,不会出现表名),为解决显示列名以后,又出现表名+列名的显示方式,可读性也不好。
显示列名不显示表名的办法:set hive.resultset.use.unique.column.names=false;
2)、在cli中set配置属性只是当次有效,如果想永久配置的话,将上述命令配置到hive/conf下的配置文件中,或者配置到hiverc文件里,因为每次CLI启动时,在提示符出现之前,Hive会自动在HOME目录下查找名为.hiverc的文件,而且执行这个文件中的所有命令。非常适合做初始化,所以有些关于hive的初始化设置可以配置到家目录下的.hiverc文件里
spark-hive> set hive.cli.print.header=true;
19/12/17 11:19:34 INFO SparkHiveShell: current SQL: set hive.cli.print.header=true
Time taken: 4.205 s
spark-hive> select * from yyz_func;
19/12/17 11:19:40 INFO SparkHiveShell: current SQL: select * from yyz_func
polno createtime pnum
P0887 2016-02-10 1
P0887 2016-02-11 3
P0887 2016-02-12 1
P0887 2016-02-13 9
P0887 2016-02-14 3
P0887 2016-02-15 12
P0889 2016-02-16 2
P0889 2016-02-14 3
P0889 2016-02-15 10
P0889 2016-02-16 6
P0890 2016-02-14 1
P0890 2016-02-15 19
P0890 2016-02-16 30
Time taken: 4.09 s
四、sum函数示例,适用于一段求历史数据中累计到每一天的访问量:
SELECT polno,
createtime,
pnum,
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1, --默认为从起点到当前行
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2, --从起点到当前行
SUM(pnum) OVER(PARTITION BY polno) ASpnum3, --分组内所有行
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4, --当前行+往前3行(当前行的值+前面三行的值)
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5, --当前行+往前3行+往后1行
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 ---当前行+往后所有行
FROM yyz_func;
具体例子和结果:
SELECT polno,
createtime,
pnum,
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1,
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2,
SUM(pnum) OVER(PARTITION BY polno) AS pnum3,
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4,
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5,
SUM(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6
FROM yyz_func;
注释:
1)、如果不指定ROWS BETWEEN,默认为从起点到当前行;
2)、如果不指定ORDER BY,则将分组内所有值累加;
理解ROWS BETWEEN含义,也叫做WINDOW子句:
PRECEDING:往前
FOLLOWING:往后
CURRENT ROW:当前行
UNBOUNDED:起点,UNBOUNDED PRECEDING表示从前面的起点, UNBOUNDED FOLLOWING:表示到后面的终点
其他AVG,MIN,MAX,和SUM用法一样。
五、avg函数示例,适用于求一段历史数据中累计到每一天的平均访问量:
SELECT polno,
createtime,
pnum,
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1, --默认为从起点到当前行
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2, --从起点到当前行
AVG(pnum) OVER(PARTITION BY polno) AS pnum3, --分组内所有行
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4, --当前行+往前3行(当前行的值+前面三行的值)
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5, --当前行+往前3行+往后1行
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6 --当前行+往后所有行
FROM yyz_func;
select polno,createtime,pnum,cast(pnum1 as decimal(10,2)),cast(pnum2 as decimal(10,2)),cast(pnum3 as decimal(10,2)),cast(pnum4 as decimal(10,2)),cast(pnum5 as decimal(10,2)),cast(pnum6 as decimal(10,2))
from
(
SELECT polno,
createtime,
pnum,
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1,
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2,
AVG(pnum) OVER(PARTITION BY polno) AS pnum3,
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4,
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5,
AVG(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6
FROM yyz_func
)aa
ps,hive中保留小数位数的方法,参照 https://blog.csdn.net/helloxiaozhe/article/details/103578666
六、max函数示例,适用于求一段历史数据中最大的单日访问量
SELECT polno,
createtime,
pnum,
max(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1,
max(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2,
max(pnum) OVER(PARTITION BY polno) AS pnum3,
max(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4,
max(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5,
max(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6
FROM yyz_func;
七、min函数示例,适用于求一段历史数据中最小的单日访问量
SELECT polno,
createtime,
pnum,
min(pnum) OVER(PARTITION BY polno ORDER BY createtime) AS pnum1,
min(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS pnum2,
min(pnum) OVER(PARTITION BY polno) AS pnum3,
min(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS pnum4,
min(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND 1 FOLLOWING) AS pnum5,
min(pnum) OVER(PARTITION BY polno ORDER BY createtime ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS pnum6
FROM yyz_func;
参考:https://blog.csdn.net/jiangshouzhuang/article/details/51057093