Hive函数介绍以及内置函数

时间 2019-12-01

标签 hive 函数介绍以及内置栏目 Hadoop 繁體版

原文原文链接

一、Hive函数介绍以及内置函数查看php

内容较多，见《Hive官方文档》 cwiki.apache.org/confluence/…java

1）查看系统自带的函数 hive> show functions; 2）显示自带的函数的用法 hive> desc function upper; 3）详细显示自带的函数的用法 hive> desc function extended upper; 二、经常使用函数介绍关系运算一、等值比较: = 语法：A=B 操做类型：全部基本类型描述: 若是表达式A与表达式B相等，则为TRUE；不然为FALSE hive> select 1 from tableName where 1=1;node

二、不等值比较: <> 语法: A <> B 操做类型: 全部基本类型描述: 若是表达式A为NULL，或者表达式B为NULL，返回NULL；若是表达式A与表达式B不相等，则为TRUE；不然为FALSE hive> select 1 from tableName where 1 <> 2;nginx

三、小于比较: < 语法: A < B 操做类型：全部基本类型描述: 若是表达式A为NULL，或者表达式B为NULL，返回NULL；若是表达式A小于表达式B，则为TRUE；不然为FALSE hive> select 1 from tableName where 1 < 2;正则表达式

四、小于等于比较: <= 语法: A <= B 操做类型: 全部基本类型描述: 若是表达式A为NULL，或者表达式B为NULL，返回NULL；若是表达式A小于或者等于表达式B，则为TRUE；不然为FALSE hive> select 1 from tableName where 1 < = 1;sql

五、大于比较: > 语法: A > B 操做类型: 全部基本类型描述: 若是表达式A为NULL，或者表达式B为NULL，返回NULL；若是表达式A大于表达式B，则为TRUE；不然为FALSE hive> select 1 from tableName where 2 > 1;数据库

六、大于等于比较: >= 语法: A >= B 操做类型: 全部基本类型描述: 若是表达式A为NULL，或者表达式B为NULL，返回NULL；若是表达式A大于或者等于表达式B，则为TRUE；不然为FALSE hive> select 1 from tableName where 1 >= 1; 1 注意：String的比较要注意(经常使用的时间比较能够先 to_date 以后再比较) hive> select * from tableName; OK 2011111209 00:00:00 2011111209 hive> select a, b, a<b, a>b, a=b from tableName; 2011111209 00:00:00 2011111209 false true false 七、空值判断: IS NULL 语法: A IS NULL 操做类型: 全部类型描述: 若是表达式A的值为NULL，则为TRUE；不然为FALSE hive> select 1 from tableName where null is null;express

八、非空判断: IS NOT NULL 语法: A IS NOT NULL 操做类型: 全部类型描述: 若是表达式A的值为NULL，则为FALSE；不然为TRUE hive> select 1 from tableName where 1 is not null;apache

九、LIKE比较: LIKE 语法: A LIKE B 操做类型: strings 描述: 若是字符串A或者字符串B为NULL，则返回NULL；若是字符串A符合表达式B 的正则语法，则为TRUE；不然为FALSE。B中字符”_”表示任意单个字符，而字符”%”表示任意数量的字符。 hive> select 1 from tableName where 'football' like 'foot%';json

hive> select 1 from tableName where 'football' like 'foot____';

注意：否认比较时候用NOT A LIKE B hive> select 1 from tableName where NOT 'football' like 'fff%';

十、JAVA的LIKE操做: RLIKE 语法: A RLIKE B 操做类型: strings 描述: 若是字符串A或者字符串B为NULL，则返回NULL；若是字符串A符合JAVA正则表达式B的正则语法，则为TRUE；不然为FALSE。 hive> select 1 from tableName where 'footbar' rlike '^f.r $'; 1 注意：判断一个字符串是否全为数字： hive>select 1 from tableName where '123456' rlike '^\\d+$ '; 1 hive> select 1 from tableName where '123456aa' rlike '^\d+'; 1 数学运算：一、加法操做: + 语法: A + B 操做类型：全部数值类型说明：返回A与B相加的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。好比，int + int 通常结果为int类型，而 int + double 通常结果为double类型 hive> select 1 + 9 from tableName; 10 hive> create table tableName as select 1 + 1.2 from tableName; hive> describe tableName; _c0 double 二、减法操做: - 语法: A – B 操做类型：全部数值类型说明：返回A与B相减的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。好比，int – int 通常结果为int类型，而 int – double 通常结果为double类型 hive> select 10 – 5 from tableName; 5 hive> create table tableName as select 5.6 – 4 from tableName; hive> describe tableName; _c0 double 三、乘法操做: 语法: A * B 操做类型：全部数值类型说明：返回A与B相乘的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。注意，若是A乘以B的结果超过默认结果类型的数值范围，则须要经过cast将结果转换成范围更大的数值类型 hive> select 40 * 5 from tableName; 200 四、除法操做: / 语法: A / B 操做类型：全部数值类型说明：返回A除以B的结果。结果的数值类型为double hive> select 40 / 5 from tableName; 8.0 注意：hive中最高精度的数据类型是double,只精确到小数点后16位，在作除法运算的时候要特别注意 hive>select ceil(28.0/6.999999999999999999999) from tableName limit 1; 结果为4 hive>select ceil(28.0/6.99999999999999) from tableName limit 1; 结果为5 五、取余操做: % 语法: A % B 操做类型：全部数值类型说明：返回A除以B的余数。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。 hive> select 41 % 5 from tableName; 1 hive> select 8.4 % 4 from tableName; 0.40000000000000036 注意：精度在hive中是个很大的问题，相似这样的操做最好经过round指定精度 hive> select round(8.4 % 4 , 2) from tableName; 0.4 六、位与操做: & 语法: A & B 操做类型：全部数值类型说明：返回A和B按位进行与操做的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。 hive> select 4 & 8 from tableName; 0 hive> select 6 & 4 from tableName; 4 七、位或操做: | 语法: A | B 操做类型：全部数值类型说明：返回A和B按位进行或操做的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。 hive> select 4 | 8 from tableName; 12 hive> select 6 | 8 from tableName; 14 八、位异或操做: ^ 语法: A ^ B 操做类型：全部数值类型说明：返回A和B按位进行异或操做的结果。结果的数值类型等于A的类型和B的类型的最小父类型（详见数据类型的继承关系）。 hive> select 4 ^ 8 from tableName; 12 hive> select 6 ^ 4 from tableName; 2 9．位取反操做: ~ 语法: ~A 操做类型：全部数值类型说明：返回A按位取反操做的结果。结果的数值类型等于A的类型。 hive> select ~6 from tableName; -7 hive> select ~4 from tableName; -5 逻辑运算：一、逻辑与操做: AND 语法: A AND B 操做类型：boolean 说明：若是A和B均为TRUE，则为TRUE；不然为FALSE。若是A为NULL或B为NULL，则为NULL hive> select 1 from tableName where 1=1 and 2=2; 1 二、逻辑或操做: OR 语法: A OR B 操做类型：boolean 说明：若是A为TRUE，或者B为TRUE，或者A和B均为TRUE，则为TRUE；不然为FALSE hive> select 1 from tableName where 1=2 or 2=2; 1 三、逻辑非操做: NOT 语法: NOT A 操做类型：boolean 说明：若是A为FALSE，或者A为NULL，则为TRUE；不然为FALSE hive> select 1 from tableName where not 1=2; 1 数值计算一、取整函数: round * 语法: round(double a) 返回值: BIGINT 说明: 返回double类型的整数值部分（遵循四舍五入） hive> select round(3.1415926) from tableName; 3 hive> select round(3.5) from tableName; 4 hive> create table tableName as select round(9542.158) from tableName; hive> describe tableName; _c0 bigint 二、指定精度取整函数: round * 语法: round(double a, int d) 返回值: DOUBLE 说明: 返回指定精度d的double类型 hive> select round(3.1415926,4) from tableName; 3.1416 三、向下取整函数: floor * 语法: floor(double a) 返回值: BIGINT 说明: 返回等于或者小于该double变量的最大的整数 hive> select floor(3.1415926) from tableName; 3 hive> select floor(25) from tableName; 25 四、向上取整函数: ceil * 语法: ceil(double a) 返回值: BIGINT 说明: 返回等于或者大于该double变量的最小的整数 hive> select ceil(3.1415926) from tableName; 4 hive> select ceil(46) from tableName; 46 五、向上取整函数: ceiling * 语法: ceiling(double a) 返回值: BIGINT 说明: 与ceil功能相同 hive> select ceiling(3.1415926) from tableName; 4 hive> select ceiling(46) from tableName; 46 六、取随机数函数: rand * 语法: rand(),rand(int seed) 返回值: double 说明: 返回一个0到1范围内的随机数。若是指定种子seed，则会等到一个稳定的随机数序列 hive> select rand() from tableName; 0.5577432776034763 hive> select rand() from tableName; 0.6638336467363424 hive> select rand(100) from tableName; 0.7220096548596434 hive> select rand(100) from tableName; 0.7220096548596434 七、天然指数函数: exp 语法: exp(double a) 返回值: double 说明: 返回天然对数e的a次方 hive> select exp(2) from tableName; 7.38905609893065 天然对数函数: ln 语法: ln(double a) 返回值: double 说明: 返回a的天然对数 1 hive> select ln(7.38905609893065) from tableName; 2.0 八、以10为底对数函数: log10 语法: log10(double a) 返回值: double 说明: 返回以10为底的a的对数 hive> select log10(100) from tableName; 2.0 九、以2为底对数函数: log2 语法: log2(double a) 返回值: double 说明: 返回以2为底的a的对数 hive> select log2(8) from tableName; 3.0 十、对数函数: log 语法: log(double base, double a) 返回值: double 说明: 返回以base为底的a的对数 hive> select log(4,256) from tableName; 4.0 十一、幂运算函数: pow 语法: pow(double a, double p) 返回值: double 说明: 返回a的p次幂 hive> select pow(2,4) from tableName; 16.0 十二、幂运算函数: power 语法: power(double a, double p) 返回值: double 说明: 返回a的p次幂,与pow功能相同 hive> select power(2,4) from tableName; 16.0 1三、开平方函数: sqrt 语法: sqrt(double a) 返回值: double 说明: 返回a的平方根 hive> select sqrt(16) from tableName; 4.0 1四、二进制函数: bin 语法: bin(BIGINT a) 返回值: string 说明: 返回a的二进制代码表示 hive> select bin(7) from tableName; 111 1五、十六进制函数: hex 语法: hex(BIGINT a) 返回值: string 说明: 若是变量是int类型，那么返回a的十六进制表示；若是变量是string类型，则返回该字符串的十六进制表示 hive> select hex(17) from tableName; 11 hive> select hex(‘abc’) from tableName; 616263 1六、反转十六进制函数: unhex 语法: unhex(string a) 返回值: string 说明: 返回该十六进制字符串所代码的字符串 hive> select unhex(‘616263’) from tableName; abc hive> select unhex(‘11’) from tableName;

hive> select unhex(616263) from tableName; abc 1七、进制转换函数: conv 语法: conv(BIGINT num, int from_base, int to_base) 返回值: string 说明: 将数值num从from_base进制转化到to_base进制 hive> select conv(17,10,16) from tableName; 11 hive> select conv(17,10,2) from tableName; 10001 1八、绝对值函数: abs 语法: abs(double a) abs(int a) 返回值: double int 说明: 返回数值a的绝对值 hive> select abs(-3.9) from tableName; 3.9 hive> select abs(10.9) from tableName; 10.9 1九、正取余函数: pmod 语法: pmod(int a, int b),pmod(double a, double b) 返回值: int double 说明: 返回正的a除以b的余数 hive> select pmod(9,4) from tableName; 1 hive> select pmod(-9,4) from tableName; 3 20、正弦函数: sin 语法: sin(double a) 返回值: double 说明: 返回a的正弦值 hive> select sin(0.8) from tableName; 0.7173560908995228 2一、反正弦函数: asin 语法: asin(double a) 返回值: double 说明: 返回a的反正弦值 hive> select asin(0.7173560908995228) from tableName; 0.8 2二、余弦函数: cos 语法: cos(double a) 返回值: double 说明: 返回a的余弦值 hive> select cos(0.9) from tableName; 0.6216099682706644 2三、反余弦函数: acos 语法: acos(double a) 返回值: double 说明: 返回a的反余弦值 hive> select acos(0.6216099682706644) from tableName; 0.9 2四、positive函数: positive 语法: positive(int a), positive(double a) 返回值: int double 说明: 返回a hive> select positive(-10) from tableName; -10 hive> select positive(12) from tableName; 12 2五、negative函数: negative 语法: negative(int a), negative(double a) 返回值: int double 说明: 返回-a hive> select negative(-5) from tableName; 5 hive> select negative(8) from tableName; -8 日期函数一、UNIX时间戳转日期函数: from_unixtime *** 语法: from_unixtime(bigint unixtime[, string format]) 返回值: string 说明: 转化UNIX时间戳（从1970-01-01 00:00:00 UTC到指定时间的秒数）到当前时区的时间格式 hive> select from_unixtime(1323308943,'yyyyMMdd') from tableName; 20111208 二、获取当前UNIX时间戳函数: unix_timestamp *** 语法: unix_timestamp() 返回值: bigint 说明: 得到当前时区的UNIX时间戳 hive> select unix_timestamp() from tableName; 1323309615 三、日期转UNIX时间戳函数: unix_timestamp *** 语法: unix_timestamp(string date) 返回值: bigint 说明: 转换格式为"yyyy-MM-dd HH:mm:ss"的日期到UNIX时间戳。若是转化失败，则返回0。 hive> select unix_timestamp('2011-12-07 13:01:03') from tableName; 1323234063 四、指定格式日期转UNIX时间戳函数: unix_timestamp *** 语法: unix_timestamp(string date, string pattern) 返回值: bigint 说明: 转换pattern格式的日期到UNIX时间戳。若是转化失败，则返回0。 hive> select unix_timestamp('20111207 13:01:03','yyyyMMdd HH:mm:ss') from tableName; 1323234063 五、日期时间转日期函数: to_date *** 语法: to_date(string timestamp) 返回值: string 说明: 返回日期时间字段中的日期部分。 hive> select to_date('2011-12-08 10:03:01') from tableName; 2011-12-08 六、日期转年函数: year *** 语法: year(string date) 返回值: int 说明: 返回日期中的年。 hive> select year('2011-12-08 10:03:01') from tableName; 2011 hive> select year('2012-12-08') from tableName; 2012 七、日期转月函数: month *** 语法: month (string date) 返回值: int 说明: 返回日期中的月份。 hive> select month('2011-12-08 10:03:01') from tableName; 12 hive> select month('2011-08-08') from tableName; 8 八、日期转天函数: day **** 语法: day (string date) 返回值: int 说明: 返回日期中的天。 hive> select day('2011-12-08 10:03:01') from tableName; 8 hive> select day('2011-12-24') from tableName; 24 九、日期转小时函数: hour *** 语法: hour (string date) 返回值: int 说明: 返回日期中的小时。 hive> select hour('2011-12-08 10:03:01') from tableName; 10 十、日期转分钟函数: minute 语法: minute (string date) 返回值: int 说明: 返回日期中的分钟。 hive> select minute('2011-12-08 10:03:01') from tableName; 3 十一、日期转秒函数: second 语法: second (string date) 返回值: int 说明: 返回日期中的秒。 hive> select second('2011-12-08 10:03:01') from tableName; 1 十二、日期转周函数: weekofyear 语法: weekofyear (string date) 返回值: int 说明: 返回日期在当前的周数。 hive> select weekofyear('2011-12-08 10:03:01') from tableName; 49 1三、日期比较函数: datediff *** 语法: datediff(string enddate, string startdate) 返回值: int 说明: 返回结束日期减去开始日期的天数。 hive> select datediff('2012-12-08','2012-05-09') from tableName; 213 1四、日期增长函数: date_add *** 语法: date_add(string startdate, int days) 返回值: string 说明: 返回开始日期startdate增长days天后的日期。 hive> select date_add('2012-12-08',10) from tableName; 2012-12-18 1五、日期减小函数: date_sub *** 语法: date_sub (string startdate, int days) 返回值: string 说明: 返回开始日期startdate减小days天后的日期。 hive> select date_sub('2012-12-08',10) from tableName; 2012-11-28 条件函数一、If函数: if *** 语法: if(boolean testCondition, T valueTrue, T valueFalseOrNull) 返回值: T 说明: 当条件testCondition为TRUE时，返回valueTrue；不然返回valueFalseOrNull hive> select if(1=2,100,200) from tableName; 200 hive> select if(1=1,100,200) from tableName; 100 二、非空查找函数: COALESCE 语法: COALESCE(T v1, T v2, …) 返回值: T 说明: 返回参数中的第一个非空值；若是全部值都为NULL，那么返回NULL hive> select COALESCE(null,'100','50') from tableName; 100 三、条件判断函数：CASE *** 语法: CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END 返回值: T 说明：若是a等于b，那么返回c；若是a等于d，那么返回e；不然返回f hive> Select case 100 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName; mary hive> Select case 200 when 50 then 'tom' when 100 then 'mary' else 'tim' end from tableName; tim 四、条件判断函数：CASE **** 语法: CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END 返回值: T 说明：若是a为TRUE,则返回b；若是c为TRUE，则返回d；不然返回e hive> select case when 1=2 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName; mary hive> select case when 1=1 then 'tom' when 2=2 then 'mary' else 'tim' end from tableName; tom 字符串函数一、字符串长度函数：length 语法: length(string A) 返回值: int 说明：返回字符串A的长度 hive> select length('abcedfg') from tableName; 7 二、字符串反转函数：reverse 语法: reverse(string A) 返回值: string 说明：返回字符串A的反转结果 hive> select reverse('abcedfg') from tableName; gfdecba 三、字符串链接函数：concat *** 语法: concat(string A, string B…) 返回值: string 说明：返回输入字符串链接后的结果，支持任意个输入字符串 hive> select concat('abc','def’,'gh')from tableName; abcdefgh 四、带分隔符字符串链接函数：concat_ws *** 语法: concat_ws(string SEP, string A, string B…) 返回值: string 说明：返回输入字符串链接后的结果，SEP表示各个字符串间的分隔符 hive> select concat_ws(',','abc','def','gh')from tableName; abc,def,gh 五、字符串截取函数：substr,substring **** 语法: substr(string A, int start),substring(string A, int start) 返回值: string 说明：返回字符串A从start位置到结尾的字符串 hive> select substr('abcde',3) from tableName; cde hive> select substring('abcde',3) from tableName; cde hive> select substr('abcde',-1) from tableName; （和ORACLE相同） e 六、字符串截取函数：substr,substring **** 语法: substr(string A, int start, int len),substring(string A, int start, int len) 返回值: string 说明：返回字符串A从start位置开始，长度为len的字符串 hive> select substr('abcde',3,2) from tableName; cd hive> select substring('abcde',3,2) from tableName; cd hive>select substring('abcde',-2,2) from tableName; de 七、字符串转大写函数：upper,ucase **** 语法: upper(string A) ucase(string A) 返回值: string 说明：返回字符串A的大写格式 hive> select upper('abSEd') from tableName; ABSED hive> select ucase('abSEd') from tableName; ABSED 八、字符串转小写函数：lower,lcase *** 语法: lower(string A) lcase(string A) 返回值: string 说明：返回字符串A的小写格式 hive> select lower('abSEd') from tableName; absed hive> select lcase('abSEd') from tableName; absed 九、去空格函数：trim *** 语法: trim(string A) 返回值: string 说明：去除字符串两边的空格 hive> select trim(' abc ') from tableName; abc 十、左边去空格函数：ltrim 语法: ltrim(string A) 返回值: string 说明：去除字符串左边的空格 hive> select ltrim(' abc ') from tableName; abc 十一、右边去空格函数：rtrim 语法: rtrim(string A) 返回值: string 说明：去除字符串右边的空格 hive> select rtrim(' abc ') from tableName; abc 十二、正则表达式替换函数：regexp_replace 语法: regexp_replace(string A, string B, string C) 返回值: string 说明：将字符串A中的符合java正则表达式B的部分替换为C。注意，在有些状况下要使用转义字符,相似oracle中的regexp_replace函数。 hive> select regexp_replace('foobar', 'oo|ar', '') from tableName; fb 1三、正则表达式解析函数：regexp_extract 语法: regexp_extract(string subject, string pattern, int index) 返回值: string 说明：将字符串subject按照pattern正则表达式的规则拆分，返回index指定的字符。 hive> select regexp_extract('foothebar', 'foo(.?)(bar)', 1) from tableName; the hive> select regexp_extract('foothebar', 'foo(.?)(bar)', 2) from tableName; bar hive> select regexp_extract('foothebar', 'foo(.?)(bar)', 0) from tableName; foothebar strong>注意，在有些状况下要使用转义字符，下面的等号要用双竖线转义，这是java正则表达式的规则。 select data_field, regexp_extract(data_field,'.?bgStart\=([^&]+)',1) as aaa, regexp_extract(data_field,'.?contentLoaded_headStart\=([^&]+)',1) as bbb, regexp_extract(data_field,'.?AppLoad2Req\=([^&]+)',1) as ccc from pt_nginx_loginlog_st where pt = '2012-03-26' limit 2; 1四、URL解析函数：parse_url **** 语法: parse_url(string urlString, string partToExtract [, string keyToExtract]) 返回值: string 说明：返回URL中指定的部分。partToExtract的有效值为：HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. hive> select parse_url ('www.tableName.com/path1/p.php…', 'HOST') from tableName; www.tableName.com hive> select parse_url ('www.tableName.com/path1/p.php…', 'QUERY', 'k1') from tableName; v1 1五、json解析函数：get_json_object **** 语法: get_json_object(string json_string, string path) 返回值: string 说明：解析json的字符串json_string,返回path指定的内容。若是输入的json字符串无效，那么返回NULL。 hive> select get_json_object('{"store":{"fruit":[{"weight":8,"type":"apple"},{"weight":9,"type":"pear"}], "bicycle":{"price":19.95,"color":"red"} },"email":"amy@only_for_json_udf_test.net","owner":"amy"}','$.owner') from tableName;

1六、空格字符串函数：space
语法: space(int n) 返回值: string 说明：返回长度为n的字符串 hive> select space(10) from tableName; hive> select length(space(10)) from tableName; 10 1七、重复字符串函数：repeat *** 语法: repeat(string str, int n) 返回值: string 说明：返回重复n次后的str字符串 hive> select repeat('abc',5) from tableName; abcabcabcabcabc 1八、首字符ascii函数：ascii 语法: ascii(string str) 返回值: int 说明：返回字符串str第一个字符的ascii码 hive> select ascii('abcde') from tableName; 97 1九、左补足函数：lpad 语法: lpad(string str, int len, string pad) 返回值: string 说明：将str进行用pad进行左补足到len位 hive> select lpad('abc',10,'td') from tableName; tdtdtdtabc 注意：与GP，ORACLE不一样，pad 不能默认 20、右补足函数：rpad 语法: rpad(string str, int len, string pad) 返回值: string 说明：将str进行用pad进行右补足到len位 hive> select rpad('abc',10,'td') from tableName; abctdtdtdt 2一、分割字符串函数: split **** 语法: split(string str, string pat) 返回值: array 说明: 按照pat字符串分割str，会返回分割后的字符串数组 hive> select split('abtcdtef','t') from tableName; ["ab","cd","ef"] 2二、集合查找函数: find_in_set 语法: find_in_set(string str, string strList) 返回值: int 说明: 返回str在strlist第一次出现的位置，strlist是用逗号分割的字符串。若是没有找该str字符，则返回0 hive> select find_in_set('ab','ef,ab,de') from tableName; 2 hive> select find_in_set('at','ef,ab,de') from tableName; 0 集合统计函数一、个数统计函数: count *** 语法: count(), count(expr), count(DISTINCT expr[, expr_.]) 返回值: int 说明: count()统计检索出的行的个数，包括NULL值的行；count(expr)返回指定字段的非空值的个数；count(DISTINCT expr[, expr_.])返回指定字段的不一样的非空值的个数 hive> select count(*) from tableName; 20 hive> select count(distinct t) from tableName; 10 二、总和统计函数: sum *** 语法: sum(col), sum(DISTINCT col) 返回值: double 说明: sum(col)统计结果集中col的相加的结果；sum(DISTINCT col)统计结果中col不一样值相加的结果 hive> select sum(t) from tableName; 100 hive> select sum(distinct t) from tableName; 70 三、平均值统计函数: avg *** 语法: avg(col), avg(DISTINCT col) 返回值: double 说明: avg(col)统计结果集中col的平均值；avg(DISTINCT col)统计结果中col不一样值相加的平均值 hive> select avg(t) from tableName; 50 hive> select avg (distinct t) from tableName; 30 四、最小值统计函数: min *** 语法: min(col) 返回值: double 说明: 统计结果集中col字段的最小值 hive> select min(t) from tableName; 20 五、最大值统计函数: max *** 语法: maxcol) 返回值: double 说明: 统计结果集中col字段的最大值 hive> select max(t) from tableName; 120 六、非空集合整体变量函数: var_pop 语法: var_pop(col) 返回值: double 说明: 统计结果集中col非空集合的整体变量（忽略null）七、非空集合样本变量函数: var_samp 语法: var_samp (col) 返回值: double 说明: 统计结果集中col非空集合的样本变量（忽略null）八、整体标准偏离函数: stddev_pop 语法: stddev_pop(col) 返回值: double 说明: 该函数计算整体标准偏离，并返回整体变量的平方根，其返回值与VAR_POP函数的平方根相同九、样本标准偏离函数: stddev_samp 语法: stddev_samp (col) 返回值: double 说明: 该函数计算样本标准偏离 10．中位数函数: percentile 语法: percentile(BIGINT col, p) 返回值: double 说明: 求准确的第pth个百分位数，p必须介于0和1之间，可是col字段目前只支持整数，不支持浮点数类型十一、中位数函数: percentile 语法: percentile(BIGINT col, array(p1 [, p2]…)) 返回值: array 说明: 功能和上述相似，以后后面能够输入多个百分位数，返回类型也为array，其中为对应的百分位数。 select percentile(score,<0.2,0.4>) from tableName；取0.2，0.4位置的数据十二、近似中位数函数: percentile_approx 语法: percentile_approx(DOUBLE col, p [, B]) 返回值: double 说明: 求近似的第pth个百分位数，p必须介于0和1之间，返回类型为double，可是col字段支持浮点类型。参数B控制内存消耗的近似精度，B越大，结果的准确度越高。默认为10,000。当col字段中的distinct值的个数小于B时，结果为准确的百分位数 1三、近似中位数函数: percentile_approx 语法: percentile_approx(DOUBLE col, array(p1 [, p2]…) [, B]) 返回值: array 说明: 功能和上述相似，以后后面能够输入多个百分位数，返回类型也为array，其中为对应的百分位数。 1四、直方图: histogram_numeric 语法: histogram_numeric(col, b) 返回值: array<struct {‘x’,‘y’}> 说明: 以b为基准计算col的直方图信息。 hive> select histogram_numeric(100,5) from tableName; [{"x":100.0,"y":1.0}] 复合类型构建操做一、Map类型构建: map **** 语法: map (key1, value1, key2, value2, …) 说明：根据输入的key和value对构建map类型 hive> Create table mapTable as select map('100','tom','200','mary') as t from tableName; hive> describe mapTable; t map<string ,string> hive> select t from tableName; {"100":"tom","200":"mary"} 二、Struct类型构建: struct 语法: struct(val1, val2, val3, …) 说明：根据输入的参数构建结构体struct类型 hive> create table struct_table as select struct('tom','mary','tim') as t from tableName; hive> describe struct_table; t struct<col1:string ,col2:string,col3:string> hive> select t from tableName; {"col1":"tom","col2":"mary","col3":"tim"} 三、array类型构建: array 语法: array(val1, val2, …) 说明：根据输入的参数构建数组array类型 hive> create table arr_table as select array("tom","mary","tim") as t from tableName; hive> describe tableName; t array hive> select t from tableName; ["tom","mary","tim"] 复杂类型访问操做 **** 一、array类型访问: A[n] 语法: A[n] 操做类型: A为array类型，n为int类型说明：返回数组A中的第n个变量值。数组的起始下标为0。好比，A是个值为['foo', 'bar']的数组类型，那么A[0]将返回'foo',而A[1]将返回'bar' hive> create table arr_table2 as select array("tom","mary","tim") as t from tableName; hive> select t[0],t[1] from arr_table2; tom mary tim 二、map类型访问: M[key] 语法: M[key] 操做类型: M为map类型，key为map中的key值说明：返回map类型M中，key值为指定值的value值。好比，M是值为{'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'}的map类型，那么M['all']将会返回'foobar' hive> Create table map_table2 as select map('100','tom','200','mary') as t from tableName; hive> select t['200'],t['100'] from map_table2; mary tom 三、struct类型访问: S.x 语法: S.x 操做类型: S为struct类型说明：返回结构体S中的x字段。好比，对于结构体struct foobar {int foo, int bar}，foobar.foo返回结构体中的foo字段 hive> create table str_table2 as select struct('tom','mary','tim') as t from tableName; hive> describe tableName; t struct<col1:string ,col2:string,col3:string> hive> select t.col1,t.col3 from str_table2; tom tim 复杂类型长度统计函数 **** 1.Map类型长度函数: size(Map<k .V>) 语法: size(Map<k .V>) 返回值: int 说明: 返回map类型的长度 hive> select size(t) from map_table2; 2 2.array类型长度函数: size(Array) 语法: size(Array) 返回值: int 说明: 返回array类型的长度 hive> select size(t) from arr_table2; 4 3.类型转换函数 *** 类型转换函数: cast 语法: cast(expr as ) 返回值: Expected "=" to follow "type" 说明: 返回转换后的数据类型 hive> select cast('1' as bigint) from tableName; 1

三、hive当中的lateral view 与 explode以及reflect和窗口函数

一、使用explode函数将hive表中的Map和Array字段数据进行拆分

lateral view用于和split、explode等UDTF一块儿使用的，能将一行数据拆分红多行数据，在此基础上能够对拆分的数据进行聚合，lateral view首先为原始表的每行调用UDTF，UDTF会把一行拆分红一行或者多行，lateral view在把结果组合，产生一个支持别名表的虚拟表。其中explode还能够用于将hive一列中复杂的array或者map结构拆分红多行

需求：如今有数据格式以下 zhangsan child1,child2,child3,child4 k1:v1,k2:v2 lisi child5,child6,child7,child8 k3:v3,k4:v4

字段之间使用\t分割，需求将全部的child进行拆开成为一列

将map的key和value也进行拆开，成为以下结果 +-----------+-------------+--+ | mymapkey | mymapvalue | +-----------+-------------+--+ | k1 | v1 | | k2 | v2 | | k3 | v3 | | k4 | v4 | +-----------+-------------+--+

第一步：建立hive数据库建立hive数据库 hive (default)> create database hive_explode; hive (default)> use hive_explode;

第二步：建立hive表，而后使用explode拆分map和array hive (hive_explode)> create table t3(name string,children array,address Map<string,string>) row format delimited fields terminated by '\t' collection items terminated by ',' map keys terminated by ':' stored as textFile;

第三步：加载数据 node03执行如下命令建立表数据文件 mkdir -p /export/servers/hivedatas/ cd /export/servers/hivedatas/ vim maparray zhangsan child1,child2,child3,child4 k1:v1,k2:v2 lisi child5,child6,child7,child8 k3:v3,k4:v4

hive表当中加载数据 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/maparray' into table t3;

第四步：使用explode将hive当中数据拆开将array当中的数据拆分开 hive (hive_explode)> SELECT explode(children) AS myChild FROM t3;

将map当中的数据拆分开

hive (hive_explode)> SELECT explode(address) AS (myMapKey, myMapValue) FROM t3;

二、使用explode拆分json字符串需求：如今有一些数据格式以下： a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

其中字段与字段之间的分隔符是 | 咱们要解析获得全部的monthSales对应的值为如下这一列（行转列） 4900 2090 6987 第一步：建立hive表 hive (hive_explode)> create table explode_lateral_view > (area string, > goods_id string, > sale_info string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '|' > STORED AS textfile;

第二步：准备数据并加载数据准备数据以下 cd /export/servers/hivedatas vim explode_json

a:shandong,b:beijing,c:hebei|1,2,3,4,5,6,7,8,9|[{"source":"7fresh","monthSales":4900,"userCount":1900,"score":"9.9"},{"source":"jd","monthSales":2090,"userCount":78981,"score":"9.8"},{"source":"jdmart","monthSales":6987,"userCount":1600,"score":"9.0"}]

加载数据到hive表当中去 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/explode_json' overwrite into table explode_lateral_view;

第三步：使用explode拆分Array

hive (hive_explode)> select explode(split(goods_id,',')) as goods_id from explode_lateral_view;

第四步：使用explode拆解Map hive (hive_explode)> select explode(split(area,',')) as area from explode_lateral_view;

5．建立hive表并导入数据建立hive表并加载数据 hive (hive_explode)> create table person_info( name string, constellation string, blood_type string) row format delimited fields terminated by "\t"; 加载数据 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/constellation.txt' into table person_info; 第五步：拆解json字段 hive (hive_explode)> select explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{')) as sale_info from explode_lateral_view;

而后咱们想用get_json_object来获取key为monthSales的数据： hive (hive_explode)> select get_json_object(explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{')),'$.monthSales') as sale_info from explode_lateral_view;

而后挂了FAILED: SemanticException [Error 10081]: UDTF's are not supported outside the SELECT clause, nor nested in expressions UDTF explode不能写在别的函数内若是你这么写，想查两个字段，select explode(split(area,',')) as area,good_id from explode_lateral_view; 会报错FAILED: SemanticException 1:40 Only a single expression in the SELECT clause is supported with UDTF's. Error encountered near token 'good_id' 使用UDTF的时候，只支持一个字段，这时候就须要LATERAL VIEW出场了

三、配合LATERAL VIEW使用配合lateral view查询多个字段 hive (hive_explode)> select goods_id2,sale_info from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2; 其中LATERAL VIEW explode(split(goods_id,','))goods至关于一个虚拟表，与原表explode_lateral_view笛卡尔积关联。也能够多重使用 hive (hive_explode)> select goods_id2,sale_info,area2 from explode_lateral_view LATERAL VIEW explode(split(goods_id,','))goods as goods_id2 LATERAL VIEW explode(split(area,','))area as area2;也是三个表笛卡尔积的结果

最终，咱们能够经过下面的句子，把这个json格式的一行数据，彻底转换成二维表的方式展示

hive (hive_explode)> select get_json_object(concat('{',sale_info_1,'}'),' $.source') as source, get_json_object(concat('{',sale_info_1,'}'),'$ .monthSales') as monthSales, get_json_object(concat('{',sale_info_1,'}'),' $.userCount') as monthSales, get_json_object(concat('{',sale_info_1,'}'),'$ .score') as monthSales from explode_lateral_view LATERAL VIEW explode(split(regexp_replace(regexp_replace(sale_info,'\[\{',''),'}]',''),'},\{'))sale_info as sale_info_1; 总结： Lateral View一般和UDTF一块儿出现，为了解决UDTF不容许在select字段的问题。 Multiple Lateral View能够实现相似笛卡尔乘积。 Outer关键字能够把不输出的UDTF的空结果，输出成NULL，防止丢失数据。四、行转列 1．相关函数说明 CONCAT(string A/col, string B/col…)：返回输入字符串链接后的结果，支持任意个输入字符串; CONCAT_WS(separator, str1, str2,...)：它是一个特殊形式的 CONCAT()。第一个参数剩余参数间的分隔符。分隔符能够是与剩余参数同样的字符串。若是分隔符是 NULL，返回值也将为 NULL。这个函数会跳过度隔符参数后的任何 NULL 和空字符串。分隔符将被加到被链接的字符串之间; COLLECT_SET(col)：函数只接受基本数据类型，它的主要做用是将某字段的值进行去重汇总，产生array类型字段。 2．数据准备表6-6 数据准备 name constellation blood_type 孙悟空白羊座 A 老王射手座 A 宋宋白羊座 B 猪八戒白羊座 A 凤姐射手座 A 3．需求把星座和血型同样的人归类到一块儿。结果以下：射手座,A 老王|凤姐白羊座,A 孙悟空|猪八戒白羊座,B 宋宋 4．建立本地constellation.txt，导入数据 node03服务器执行如下命令建立文件，注意数据使用\t进行分割 cd /export/servers/hivedatas vim constellation.txt

孙悟空白羊座 A 老王射手座 A 宋宋白羊座 B
猪八戒白羊座 A 凤姐射手座 A 6．按需求查询数据 hive (hive_explode)> select t1.base, concat_ws('|', collect_set(t1.name)) name from (select name, concat(constellation, "," , blood_type) base from person_info) t1 group by t1.base; 五、列转行 1．函数说明 EXPLODE(col)：将hive一列中复杂的array或者map结构拆分红多行。 LATERAL VIEW 用法：LATERAL VIEW udtf(expression) tableAlias AS columnAlias 解释：用于和split, explode等UDTF一块儿使用，它可以将一列数据拆成多行数据，在此基础上能够对拆分后的数据进行聚合。 2．数据准备 cd /export/servers/hivedatas vim movie.txt 数据字段之间使用\t进行分割《疑犯追踪》悬疑,动做,科幻,剧情《Lie to me》悬疑,警匪,动做,心理,剧情《战狼2》战争,动做,灾难 3．需求将电影分类中的数组数据展开。结果以下：《疑犯追踪》悬疑《疑犯追踪》动做《疑犯追踪》科幻《疑犯追踪》剧情《Lie to me》悬疑《Lie to me》警匪《Lie to me》动做《Lie to me》心理《Lie to me》剧情《战狼2》战争《战狼2》动做《战狼2》灾难 4．建立hive表并导入数据建立hive表 create table movie_info( movie string, category array) row format delimited fields terminated by "\t" collection items terminated by ",";

加载数据 load data local inpath "/export/servers/hivedatas/movie.txt" into table movie_info;

5．按需求查询数据 select movie, category_name from movie_info lateral view explode(category) table_tmp as category_name; 六、reflect函数 reflect函数能够支持在sql中调用java中的自带函数，秒杀一切udf函数。使用java.lang.Math当中的Max求两列中最大值建立hive表 create table test_udf(col1 int,col2 int) row format delimited fields terminated by ','; 准备数据并加载数据 cd /export/servers/hivedatas vim test_udf 1,2 4,3 6,4 7,5 5,6 加载数据 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf' overwrite into table test_udf; 使用java.lang.Math当中的Max求两列当中的最大值 hive (hive_explode)> select reflect("java.lang.Math","max",col1,col2) from test_udf; 不一样记录执行不一样的java内置函数建立hive表 hive (hive_explode)> create table test_udf2(class_name string,method_name string,col1 int , col2 int) row format delimited fields terminated by ','; 准备数据 cd /export/servers/hivedatas vim test_udf2

java.lang.Math,min,1,2 java.lang.Math,max,2,3

加载数据 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/test_udf2' overwrite into table test_udf2;

执行查询 hive (hive_explode)> select reflect(class_name,method_name,col1,col2) from test_udf2;

判断是否为数字使用apache commons中的函数，commons下的jar已经包含在hadoop的classpath中，因此能够直接使用。使用方式以下： select reflect("org.apache.commons.lang.math.NumberUtils","isNumber","123") 七、窗口函数与分析函数 hive当中也带有不少的窗口函数以及分析函数，主要用于如下这些场景（1）用于分区排序（2）动态Group By （3）Top N （4）累计计算（5）层次查询一、建立hive表并加载数据建立表 hive (hive_explode)> create table order_detail( user_id string,device_id string,user_type string,price double,sales int )row format delimited fields terminated by ','; 加载数据

cd /export/servers/hivedatas vim order_detail

zhangsan,1,new,67.1,2 lisi,2,old,43.32,1 wagner,3,new,88.88,3 liliu,4,new,66.0,1 qiuba,5,new,54.32,1 wangshi,6,old,77.77,2 liwei,7,old,88.44,3 wutong,8,new,56.55,6 lilisi,9,new,88.88,5 qishili,10,new,66.66,5 加载数据 hive (hive_explode)> load data local inpath '/export/servers/hivedatas/order_detail' into table order_detail;

二、窗口函数 FIRST_VALUE：取分组内排序后，截止到当前行，第一个值 LAST_VALUE：取分组内排序后，截止到当前行，最后一个值 LEAD(col,n,DEFAULT) ：用于统计窗口内往下第n行值。第一个参数为列名，第二个参数为往下第n行（可选，默认为1），第三个参数为默认值（当往下第n行为NULL时候，取默认值，如不指定，则为NULL） LAG(col,n,DEFAULT) ：与lead相反，用于统计窗口内往上第n行值。第一个参数为列名，第二个参数为往上第n行（可选，默认为1），第三个参数为默认值（当往上第n行为NULL时候，取默认值，如不指定，则为NULL）三、OVER从句一、使用标准的聚合函数COUNT、SUM、MIN、MAX、AVG 二、使用PARTITION BY语句，使用一个或者多个原始数据类型的列三、使用PARTITION BY与ORDER BY语句，使用一个或者多个数据类型的分区或者排序列四、使用窗口规范，窗口规范支持如下格式：

当ORDER BY和窗口从句都缺失, 窗口规范默认是 ROW BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.

OVER从句支持如下函数，可是并不支持和窗口一块儿使用它们。 Ranking函数: Rank, NTile, DenseRank, CumeDist, PercentRank. Lead 和 Lag 函数.

使用窗口函数进行统计求销量使用窗口函数sum over统计销量

hive (hive_explode)> select user_id, user_type, sales, --分组内全部行 sum(sales) over(partition by user_type) AS sales_1 , sum(sales) over(order by user_type) AS sales_2 , --默认为从起点到当前行，若是sales相同，累加结果相同 sum(sales) over(partition by user_type order by sales asc) AS sales_3, --从起点到当前行，结果与sales_3不一样。根据排序前后不一样，可能结果累加不一样 sum(sales) over(partition by user_type order by sales asc rows between unbounded preceding and current row) AS sales_4, --当前行+往前3行 sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and current row) AS sales_5, --当前行+往前3行+日后1行 sum(sales) over(partition by user_type order by sales asc rows between 3 preceding and 1 following) AS sales_6, --当前行+日后全部行
sum(sales) over(partition by user_type order by sales asc rows between current row and unbounded following) AS sales_7 from order_detail order by user_type, sales, user_id;

统计以后求得结果以下： +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+ | user_id | user_type | sales | sales_1 | sales_2 | sales_3 | sales_4 | sales_5 | sales_6 | sales_7 | +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+ | liliu | new | 1 | 23 | 23 | 2 | 2 | 2 | 4 | 22 | | qiuba | new | 1 | 23 | 23 | 2 | 1 | 1 | 2 | 23 | | zhangsan | new | 2 | 23 | 23 | 4 | 4 | 4 | 7 | 21 | | wagner | new | 3 | 23 | 23 | 7 | 7 | 7 | 12 | 19 | | lilisi | new | 5 | 23 | 23 | 17 | 17 | 15 | 21 | 11 | | qishili | new | 5 | 23 | 23 | 17 | 12 | 11 | 16 | 16 | | wutong | new | 6 | 23 | 23 | 23 | 23 | 19 | 19 | 6 | | lisi | old | 1 | 6 | 29 | 1 | 1 | 1 | 3 | 6 | | wangshi | old | 2 | 6 | 29 | 3 | 3 | 3 | 6 | 5 | | liwei | old | 3 | 6 | 29 | 6 | 6 | 6 | 6 | 3 | +-----------+------------+--------+----------+----------+----------+----------+----------+----------+----------+--+

注意: 结果和ORDER BY相关,默认为升序若是不指定ROWS BETWEEN,默认为从起点到当前行; 若是不指定ORDER BY，则将分组内全部值累加;

关键是理解ROWS BETWEEN含义,也叫作WINDOW子句： PRECEDING：往前 FOLLOWING：日后 CURRENT ROW：当前行 UNBOUNDED：无界限（起点或终点） UNBOUNDED PRECEDING：表示从前面的起点 UNBOUNDED FOLLOWING：表示到后面的终点其余COUNT、AVG，MIN，MAX，和SUM用法同样。

求分组后的第一个和最后一个值first_value与last_value 使用first_value和last_value求分组后的第一个和最后一个值 select user_id, user_type, ROW_NUMBER() OVER(PARTITION BY user_type ORDER BY sales) AS row_num,
first_value(user_id) over (partition by user_type order by sales desc) as max_sales_user, first_value(user_id) over (partition by user_type order by sales asc) as min_sales_user, last_value(user_id) over (partition by user_type order by sales desc) as curr_last_min_user, last_value(user_id) over (partition by user_type order by sales asc) as curr_last_max_user from order_detail;

四、分析函数一、ROW_NUMBER()：从1开始，按照顺序，生成分组内记录的序列,好比，按照pv降序排列，生成分组内天天的pv名次,ROW_NUMBER()的应用场景很是多，再好比，获取分组内排序第一的记录;获取一个session中的第一条refer等。二、RANK() ：生成数据项在分组中的排名，排名相等会在名次中留下空位三、DENSE_RANK() ：生成数据项在分组中的排名，排名相等会在名次中不会留下空位四、CUME_DIST ：小于等于当前值的行数/分组内总行数。好比，统计小于等于当前薪水的人数，所占总人数的比例五、PERCENT_RANK ：分组内当前行的RANK值-1/分组内总行数-1 六、NTILE(n) ：用于将分组数据按照顺序切分红n片，返回当前切片值，若是切片不均匀，默认增长第一个切片的分布。NTILE不支持ROWS BETWEEN，好比 NTILE(2) OVER(PARTITION BY cookieid ORDER BY createtime ROWS BETWEEN 3 PRECEDING AND CURRENT ROW)。 RANK、ROW_NUMBER、DENSE_RANK OVER的使用使用这几个函数，能够实现分组求topN 需求：按照用户类型进行分类，求取销售量最大的前N条数据 select user_id,user_type,sales, RANK() over (partition by user_type order by sales desc) as r, ROW_NUMBER() over (partition by user_type order by sales desc) as rn, DENSE_RANK() over (partition by user_type order by sales desc) as dr from order_detail;

+-----------+------------+--------+----+-----+-----+--+ | user_id | user_type | sales | r | rn | dr | +-----------+------------+--------+----+-----+-----+--+ | wutong | new | 6 | 1 | 1 | 1 | | qishili | new | 5 | 2 | 2 | 2 | | lilisi | new | 5 | 2 | 3 | 2 | | wagner | new | 3 | 4 | 4 | 3 | | zhangsan | new | 2 | 5 | 5 | 4 | | qiuba | new | 1 | 6 | 6 | 5 | | liliu | new | 1 | 6 | 7 | 5 | | liwei | old | 3 | 1 | 1 | 1 | | wangshi | old | 2 | 2 | 2 | 2 | | lisi | old | 1 | 3 | 3 | 3 | +-----------+------------+--------+----+-----+-----+--+

使用NTILE求取百分比咱们可使用NTILE来将咱们的数据分红多少份，而后求取百分比使用NTILE将数据进行分片 select user_type,sales, --分组内将数据分红2片 NTILE(2) OVER(PARTITION BY user_type ORDER BY sales) AS nt2, --分组内将数据分红3片
NTILE(3) OVER(PARTITION BY user_type ORDER BY sales) AS nt3, --分组内将数据分红4片
NTILE(4) OVER(PARTITION BY user_type ORDER BY sales) AS nt4, --将全部数据分红4片 NTILE(4) OVER(ORDER BY sales) AS all_nt4 from order_detail order by user_type, sales;

获得结果以下： +------------+--------+------+------+------+----------+--+ | user_type | sales | nt2 | nt3 | nt4 | all_nt4 | +------------+--------+------+------+------+----------+--+ | new | 1 | 1 | 1 | 1 | 1 | | new | 1 | 1 | 1 | 1 | 1 | | new | 2 | 1 | 1 | 2 | 2 | | new | 3 | 1 | 2 | 2 | 3 | | new | 5 | 2 | 2 | 3 | 4 | | new | 5 | 2 | 3 | 3 | 3 | | new | 6 | 2 | 3 | 4 | 4 | | old | 1 | 1 | 1 | 1 | 1 | | old | 2 | 1 | 2 | 2 | 2 | | old | 3 | 2 | 3 | 3 | 2 | +------------+--------+------+------+------+----------+--+

使用NTILE求取sales前20%的用户id select user_id from (select user_id, NTILE(5) OVER(ORDER BY sales desc) AS nt from order_detail )A where nt=1;

五、加强的聚合Cuhe和Grouping和Rollup 这几个分析函数一般用于OLAP中，不能累加，并且须要根据不一样维度上钻和下钻的指标统计，好比，分小时、天、月的UV数。

GROUPING SETS 在一个GROUP BY查询中，根据不一样的维度组合进行聚合，等价于将不一样维度的GROUP BY结果集进行UNION ALL, 其中的GROUPING__ID，表示结果属于哪个分组集合。需求：按照user_type和sales分别进行分组求取数据 0: jdbc:hive2://node03:10000>select user_type, sales, count(user_id) as pv, GROUPING__ID from order_detail group by user_type,sales GROUPING SETS(user_type,sales) ORDER BY GROUPING__ID;

求取结果以下： +------------+--------+-----+---------------+--+ | user_type | sales | pv | grouping__id | +------------+--------+-----+---------------+--+ | old | NULL | 3 | 1 | | new | NULL | 7 | 1 | | NULL | 6 | 1 | 2 | | NULL | 5 | 2 | 2 | | NULL | 3 | 2 | 2 | | NULL | 2 | 2 | 2 | | NULL | 1 | 3 | 2 | +------------+--------+-----+---------------+--+ 需求：按照user_type，sales，以及user_type + salse 分别进行分组求取统计数据

0: jdbc:hive2://node03:10000>select user_type, sales, count(user_id) as pv, GROUPING__ID from order_detail group by user_type,sales GROUPING SETS(user_type,sales,(user_type,sales)) ORDER BY GROUPING__ID; 求取结果以下： +------------+--------+-----+---------------+--+ | user_type | sales | pv | grouping__id | +------------+--------+-----+---------------+--+ | old | NULL | 3 | 1 | | new | NULL | 7 | 1 | | NULL | 1 | 3 | 2 | | NULL | 6 | 1 | 2 | | NULL | 5 | 2 | 2 | | NULL | 3 | 2 | 2 | | NULL | 2 | 2 | 2 | | old | 3 | 1 | 3 | | old | 2 | 1 | 3 | | old | 1 | 1 | 3 | | new | 6 | 1 | 3 | | new | 5 | 2 | 3 | | new | 3 | 1 | 3 | | new | 1 | 2 | 3 | | new | 2 | 1 | 3 | +------------+--------+-----+---------------+--+

六、使用cube 和ROLLUP 根据GROUP BY的维度的全部组合进行聚合。 cube进行聚合需求：不进行任何的分组，按照user_type进行分组，按照sales进行分组，按照user_type+sales进行分组求取统计数据 0: jdbc:hive2://node03:10000>select user_type, sales, count(user_id) as pv, GROUPING__ID from order_detail group by user_type,sales WITH CUBE ORDER BY GROUPING__ID;

+------------+--------+-----+---------------+--+ | user_type | sales | pv | grouping__id | +------------+--------+-----+---------------+--+ | NULL | NULL | 10 | 0 | | new | NULL | 7 | 1 | | old | NULL | 3 | 1 | | NULL | 6 | 1 | 2 | | NULL | 5 | 2 | 2 | | NULL | 3 | 2 | 2 | | NULL | 2 | 2 | 2 | | NULL | 1 | 3 | 2 | | old | 3 | 1 | 3 | | old | 2 | 1 | 3 | | old | 1 | 1 | 3 | | new | 6 | 1 | 3 | | new | 5 | 2 | 3 | | new | 3 | 1 | 3 | | new | 2 | 1 | 3 | | new | 1 | 2 | 3 | +------------+--------+-----+---------------+--+ ROLLUP进行聚合 rollup是CUBE的子集，以最左侧的维度为主，从该维度进行层级聚合。 select user_type, sales, count(user_id) as pv, GROUPING__ID from order_detail group by user_type,sales WITH ROLLUP ORDER BY GROUPING__ID;

+------------+--------+-----+---------------+--+ | user_type | sales | pv | grouping__id | +------------+--------+-----+---------------+--+ | NULL | NULL | 10 | 0 | | old | NULL | 3 | 1 | | new | NULL | 7 | 1 | | old | 3 | 1 | 3 | | old | 2 | 1 | 3 | | old | 1 | 1 | 3 | | new | 6 | 1 | 3 | | new | 5 | 2 | 3 | | new | 3 | 1 | 3 | | new | 2 | 1 | 3 | | new | 1 | 2 | 3 | +------------+--------+-----+---------------+--+

1. hive 内置函数
2. Hive 内置函数
3. hive内置函数
4. HIVE内置函数
5. hive transform函数介绍
6. Hive 内置函数和UDF函数
7. Hive 函数以内置运算符
8. Hive 内置函数详解
9. Hive 06 内置函数
10. Hive的内置函数
更多相关文章...
• XPath、XQuery 以及 XSLT 函数函数参考手册 - XPath 教程
• 网站主机介绍 - 网站主机教程
• Flink 数据传输及反压详解
• Java Agent入门实战（一）-Instrumentation介绍与使用