hive常用基础DML

DML:

基本的Select操作
语法结构:与sql语法基本相同

SELECT [ALL | DISTINCT] select_expr, select_expr, ... 
FROM table_reference
[WHERE where_condition] 
[GROUP BY col_list [HAVING condition]] 
[CLUSTER BY col_list 
  | [DISTRIBUTE BY col_list] [SORT BY| ORDER BY col_list] 
] 
[LIMIT number]

注：1、order by 会对输入做全局排序，因此只有一个reducer，会导致当输入规模较大时，需要较长的计算时间。
2、sort by不是全局排序，其在数据进入reducer前完成排序。因此，如果用sort by进行排序，并且设置mapred.reduce.tasks>1，则sort by只保证每个reducer的输出有序，不保证全局有序。
3、distribute by(字段)根据指定的字段将数据分到不同的reducer，且分发算法是hash散列。
4、Cluster by(字段) 除了具有Distribute by的功能外，还会对该字段进行排序。
因此，如果分桶和sort字段是同一个时，此时，cluster by = distribute by + sort by

case when then :

select ename, sal, 
case 
when sal > 1 and sal <=1000 then 'lower'
when sal > 1000 and sal <=2000 then 'middle'
when sal > 2000 and sal <=3000 then 'high'
else 'highest' end
from emp;

HIVE build-in 内置函数：

## 显示所有hive内置函数
show functions
## 查看内置函数的详细信息
desc function extended abs;

lower：转小写
upper：转大写

current_date:获取当前的天
current_timestamp：获取当前时间（包括年月日时分秒）

unix_timestamp（string date）:获取当前时间戳（从1970开始到现在的s）
按照 yyyy-MM-dd HH:mm:ss 时间格式转换
unix_timestamp(string date,string pattern)：
pattern 输入数据的时间格式

to_date(string timestamp):返回日期：如 2019-01-01
year(string timestamp):返回年
month
day
hour
minute
second

date_add(date/timestamp/string startdate,tinyint/smallint/int days)：加时间
round()：四舍五入

substring:取子串
concat：合并字符串

split（）：按照分隔符进行分隔。注意字符转义
hive常用基础DML

explode（）：行转列
hive常用基础DML
利用explode 做wc：