Hive高级应用
一、支持复杂数据类型
array,map,struct
支持对应复杂数据类型的遍历和查询正则表达式
二、支持视图sql
三、函数
3.一、丰富的内置函数
3.二、支持自定义Java处理类,以jar文件的方式添加至Hive,定义临时函数关联处理类,对数据进行自定义处理
3.三、Json数据的解析和操做get_json_object,json_tuple
3.三、经过Transform在HQL中调用自定义脚本如Python
3.四、分析窗口函数
a.sum,avg,min,max窗口内聚合分析
over (partition by col1 order by col2 rows between unbounded[n] preceding and current row[n following])
若是不指定ROWS BETWEEN,默认为从起点到当前行;
若是不指定ORDER BY,则将分组内全部值累加;
关键是理解ROWS BETWEEN含义,也叫作WINDOW子句:
PRECEDING:往前
FOLLOWING:日后
CURRENT ROW:当前行
UNBOUNDED:起点,
UNBOUNDED PRECEDING 表示从前面的起点,
UNBOUNDED FOLLOWING:表示到后面的终点
b.Ntile,row_number,ran,dense_ran
NTILE(n) 用于将分组数据按照顺序切分红n片,返回当前切片值
ROW_NUMBER() 从1开始,按照顺序,生成分组内记录的序列,无重复
RANK() 生成数据项在分组中的排名,排名相等会在名次中留下空位335
DENSE_RANK() 生成数据项在分组中的排名,排名相等会在名次中不会留下空位,334
c.cume_dist,percent_rank
CUME_DIST :小于等于当前值的行数/分组内总行数
PERCENT_RANK :分组内当前行的RANK值-1/分组内总行数-1
d.lag,lead,first_value,last_value
LAG(col,n,DEFAULT) 用于统计窗口内往上第n行值
LEAD(col,n,DEFAULT) 用于统计窗口内往下第n行值
first_value(col1) over (partition by col2 order by col3)取分组内排序后,截止到当前行,第一个值
last_value(col1) over (partition by col2 order by col3)取分组内排序后,截止到当前行,最后一个值
e.grouping sets,grouping_id,cube,rollup 经常使用于OLAP
grouping sets,grouping_id 将GROUP BY分组字段各个进行聚合,最终结果合并一块
cube 将GROUP BY分组字段全部组合的聚合
rollup 将GROUP BY分组字段层级组合的聚合json
grouping sets (group by columns list):column list 不一样组合 grouping__id:给不一样集合编号 eg: select month,day,count(distinct cookieid) as uv,grouping__id from cookie5 group by month,day grouping sets (month,day[,month,day]) order by grouping__id; cube: with cube 根据group by的维度的全部组合进行聚合 eg: select month,day,count(distinct cookieid) as uv,grouping__id from cookie5 group by month,day with cube order by grouping__id; rollup: with rollup 根据group by的维度顺序逐层组合聚合 eg: select month,day,count(distinct cookieid) as uv,grouping__id from cookie5 group by month,day with rollup order by grouping__id; lag(column,n,default):统计窗口内取前n行值,窗口内错行显示 lead(column,n,default):窗口内取后n行值,窗口内错行显示 eg: select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, LAG(createtime,1,'1970-01-01 00:00:00') over (partition by cookieid order by createtime) as front_1_time, LEAD(createtime,2,'2018-12-24 00:00:00') over (partition by cookieid order by createtime) as behind_2_time from cookie4; first_value(column):窗口内,排序第一个值(倒排序即最后一个值) last_value(column):窗口内排序截至当前行的最后一个值,即该列值 select cookieid, createtime, url, row_number() over (partition by cookieid order by createtime) as rn, first_value(url) over (partition by cookieid order by createtime) as first1, first_value(url) over (partition by cookieid order by createtime desc) as last1, last_value(url) over (partition by cookieid order by createtime) as last2 from cookie4; CUME_DIST():小于等于当前值的行数/分组内总行数 PERCENT_RANK():分组内当前行的RANK值-1/分组内总行数-1 eg: select dept, userid, sal, cume_dist() over (order by sal) as rn1, cume_dist() over (partition by dept order by sal) as rn2 from cookie3; select dept, userid, sal, percent_rank() over (order by sal) as rn1, --分组内 rank() over (order by sal) as rn11, --分组内的rank值 sum(1) over (partition by null) as rn12, --分组内总行数 percent_rank() over (partition by dept order by sal) as rn2, rank() over (partition by dept order by sal) as rn21, sum(1) over (partition by dept) as rn22 from cookie3; ntile(n):将窗口内的数据切成n片,窗口内分块 row_number():从1开始窗口内记录的序列 rank():窗口内记录的排名,335 dense_rank():窗口内记录的排名,334 eg: select cookieid, createtime, pv, ntile(2) over (partition by cookieid order by createtime) as rn1, row_number() over (partition by cookieid order by pv desc) as rn2, rank() over (partition by cookieid order by pv desc) as rn3, dense_rank() over (partition by cookieid order by pv desc) as rn4 from cookie2 order by cookieid,createtime; sum|avg|min|max(column) over(partition by col1 order by col2 rows between n|unbounded preceding current row and n|unbounded following current row):窗口内记录的聚合,自由定义窗口聚合范围 eg: select cookieid,createtime,pv, sum(pv) over (partition by cookieid order by createtime rows between unbounded preceding and current row) as pv1, -- 默认为从起点到当前行 avg(pv) over (partition by cookieid order by createtime) as pv2, -- 从起点到当前行 max(pv) over (partition by cookieid) as pv3, -- 分组内全部行 min(pv) over (partition by cookieid order by createtime rows between 3 preceding and current row) as pv4, -- 当前行+往前3行 sum(pv) over (partition by cookieid order by createtime rows between 3 preceding and 1 following) as pv5, -- 当前行+往前3行+日后1行 avg(pv) over (partition by cookieid order by createtime rows between current row and unbounded following) as pv6 -- 当前行+日后全部行 from cookie1;
四、特殊分隔符处理,regexserde正则表达式解析,自定义inputformat处理cookie
lateral view explode函数
lateral view侧视图用于和split、explode等UDTF一块儿使用的,能将一行数据拆分红多行数据,在此基础上能够对拆分的数据进行聚合,lateral view首先为原始表的每行调用UDTF,UDTF会把一行拆分红一行或者多行,lateral view在把结果组合,产生一个支持别名表的虚拟表。
lateral clause 至关于一个虚拟表,与原表explode_lateral_view笛卡尔积关联。
explode不能写在别的函数内url