[Hive_12] Hive 的自定义函数

时间 2020-06-12

标签 hive 自定义函数栏目 Hadoop 繁體版

原文原文链接

0. 说明

　　UDF 　　//user define function
　　　　　　//输入单行，输出单行，相似于 format_number(age,'000')java

　　UDTF 　　//user define table-gen function
　　　　　　 //输入单行，输出多行，相似于 explode(array);git

　　UDAF 　　//user define aggr function
　　　　　　 //输入多行，输出单行，相似于 sum(xxx)github

　　Hive 经过 UDF 实现对 temptags 的解析json

1. UDF

　　1.1 代码示例

　　Code
centos

　　1.2 用户自定义函数的使用

　　1. 将 Hive 自定义函数打包并发送到 /soft/hive/lib 下
　　2. 重启 Hive
　　3. 注册函数数组

# 永久函数 　　create function myudf as 'com.share.udf.MyUDF'; # 临时函数 　　create temporary function myudf as 'com.share.udf.MyUDF';

　　1.3 Demo

　　Hive 经过 UDF 实现对 temptags 的解析并发

　　0. 准备数据函数

　　1. 建表oop

create table temptags(id int,json string) row format delimited fields terminated by '\t';

　　2. 加载数据测试

load data local inpath '/home/centos/files/temptags.txt' into table temptags;

　　3. 代码编写

　　Code

　　4. 打包

　　5. 添加 fastjson-1.2.47.jar & myhive-1.0-SNAPSHOT.jar 到 /soft/hive/lib 中

　　6. 重启 Hive

　　7. 注册临时函数

create temporary function parsejson as 'com.share.udf.ParseJson';

　　8. 测试

select id ,parsejson(json) as tags from temptags;

# 将 id 和 tag 炸开 select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag; # 开始统计每一个商家每一个标签个数 select id, tag, count(*) as count
from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id, tag; # 进行商家内标签数的排序 select id, tag , count, row_number()over(partition by id order by count desc) as rank
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b ; # 将标签和个数进行拼串，取得前 10 标签数 select id, concat(tag,'_',count)
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank 
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c
where rank<=10; #聚合拼串 //concat_ws(',', List<>) //collect_set(name) 将全部字段变为数组,去重 //collect_list(name) 将全部字段变为数组,不去重 select id, concat_ws(',',collect_set(concat(tag,'_',count))) as tags
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank
from  (select id, tag, count(*) as count from (select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c  where rank<=10 group by id;

　　1.4 虚列：lateral view

　　123456 味道好_10,环境卫生_9

　　id　　 tags
　　1 　　[味道好，环境卫生]　　 =>　　 1 味道好
　　　　　　　　　　　　　　　　　　1 环境卫生

select name, workplace from employee lateral view explode(work_place) xx as workplace;

　　1.5 类找不到异常

　　缺乏 jar 包致使的: 类找不到异常的解决方案

　　问题描述

　　Caused by: java.lang.ClassNotFoundException: com.share.udf.ParseJson

　　解决方案

　　1. 将 fastjson 和 myhive.jar 放在 /soft/hadoop/share/hadoop/common/lib 下

　　cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/ 　　cp /soft/hive/lib/fastjson-1.2.47.jar /soft/hadoop/share/hadoop/common/lib/

　　2. 同步到其余节点

　　xsync.sh /soft/hadoop/share/hadoop/common/lib/fastjson-1.2.47.jar 　　xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

　　3. 重启 Hadoop 和 Hive

　　stop-all.sh 　　hive

2. UDTF

　　2.0 说明

　　Hive 实现 Word Count 经过如下两种方式

　　array => explode

　　string => split => explode

　　如今直接经过 UDTF 实现 WordCount

　　string => myudtf

　　2.1 代码编写

　　Code

　　2.2 打包

　　将 myhive-1.0-SNAPSHOT.jar 添加到 /soft/hive/lib 中

　　2.3 重启 Hive

　　2.4 注册临时函数

　　create function myudtf as 'com.share.udtf.MyUDTF';

　　2.5 测试

select myudtf(line) from wc2;

　　2.6 流程分析

　　1. 经过 initialize的参数(方法参数)类型或参数个数

　　2. 返回输出表的表结构(字段名+字段类型)

　　3. 经过 process函数，取出参数值

　　4. 进行处理后经过 forward函数将其输出