[Hive_12] Hive 的自定义函数

 


 

0. 说明 

  UDF   //user define function
      //输入单行,输出单行,相似于 format_number(age,'000')java

  UDTF   //user define table-gen function
         //输入单行,输出多行,相似于 explode(array);git

  UDAF   //user define aggr function
         //输入多行,输出单行,相似于 sum(xxx)github

 

  Hive 经过 UDF 实现对 temptags 的解析json

 


1. UDF

  1.1 代码示例

  Code
centos

 

  1.2 用户自定义函数的使用

  1. 将 Hive 自定义函数打包并发送到 /soft/hive/lib 下
  2. 重启 Hive
  3. 注册函数数组

# 永久函数   create function myudf as 'com.share.udf.MyUDF'; # 临时函数   create temporary function myudf as 'com.share.udf.MyUDF';

 

  1.3 Demo

  Hive 经过 UDF 实现对 temptags 的解析并发

  0. 准备数据函数

  1. 建表oop

create table temptags(id int,json string) row format delimited fields terminated by '\t';

  2. 加载数据测试

load data local inpath '/home/centos/files/temptags.txt' into table temptags;

  3. 代码编写

  Code

  4. 打包

  5. 添加 fastjson-1.2.47.jar & myhive-1.0-SNAPSHOT.jar 到 /soft/hive/lib 中

  6. 重启 Hive

  7. 注册临时函数

create temporary function parsejson as 'com.share.udf.ParseJson';

  8. 测试

select id ,parsejson(json) as tags from temptags;

 

# 将 id 和 tag 炸开 select id,  tag from temptags lateral view explode(parsejson(json)) xx as tag; # 开始统计每一个商家每一个标签个数 select id, tag, count(*) as count
from (select id, tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id, tag; # 进行商家内标签数的排序 select id, tag , count, row_number()over(partition by id order by count desc) as rank
from (select id, tag, count(*) as count from (select id, tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b ; # 将标签和个数进行拼串,取得前 10 标签数 select id, concat(tag,'_',count)
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank
from (select id, tag, count(*) as count from (select id, tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c
where rank<=10; #聚合拼串 //concat_ws(',', List<>) //collect_set(name) 将全部字段变为数组,去重 //collect_list(name) 将全部字段变为数组,不去重 select id, concat_ws(',',collect_set(concat(tag,'_',count))) as tags
from (select id, tag , count, row_number()over(partition by id order by count desc) as rank
from (select id, tag, count(*) as count from (select id, tag from temptags lateral view explode(parsejson(json)) xx as tag) a
group by id,tag) b )c where rank<=10 group by id;

 

   1.4 虚列:lateral view

  123456 味道好_10,环境卫生_9

  id   tags
  1   [味道好,环境卫生]   =>   1 味道好
                      1 环境卫生

 

select name, workplace from employee lateral view explode(work_place) xx as workplace;

 

  1.5 类找不到异常

  缺乏 jar 包致使的: 类找不到异常的解决方案

   问题描述

  Caused by: java.lang.ClassNotFoundException: com.share.udf.ParseJson

 

  解决方案

  1. 将 fastjson 和 myhive.jar 放在 /soft/hadoop/share/hadoop/common/lib 下

  cp /soft/hive/lib/myhive-1.0-SNAPSHOT.jar /soft/hadoop/share/hadoop/common/lib/   cp /soft/hive/lib/fastjson-1.2.47.jar /soft/hadoop/share/hadoop/common/lib/

 

  2. 同步到其余节点

  xsync.sh /soft/hadoop/share/hadoop/common/lib/fastjson-1.2.47.jar   xsync.sh /soft/hadoop/share/hadoop/common/lib/myhive-1.0-SNAPSHOT.jar

 

  3. 重启 Hadoop 和 Hive

  stop-all.sh   hive

 

 


2. UDTF

  2.0 说明

  Hive 实现 Word Count 经过如下两种方式

  array => explode

  string => split => explode

 

  如今直接经过 UDTF 实现 WordCount

  string => myudtf

 

  2.1 代码编写

  Code

  2.2 打包

  将 myhive-1.0-SNAPSHOT.jar 添加到 /soft/hive/lib 中

  2.3 重启 Hive

 

  2.4 注册临时函数

  create function myudtf as 'com.share.udtf.MyUDTF';

 

  2.5 测试

  

select myudtf(line) from wc2;

 

  2.6 流程分析

  1. 经过 initialize的参数(方法参数)类型或参数个数

  2. 返回输出表的表结构(字段名+字段类型)

  3. 经过 process函数,取出参数值

  4. 进行处理后经过 forward函数 将其输出

相关文章
相关标签/搜索