hive基础1

时间 2019-11-09

标签 hive 基础栏目 Hadoop 繁體版

原文原文链接

Hive基础

一、介绍

Hive是OLAP(online analyze process，在线分析处理)。一般称为数据仓库，简称数仓。内置不少分析函数，可进行海量数据的在线分析处理。hive构建在hadoop之上，使用hdfs做为进行存储，计算过程采用的是Mapreduce完成，本质上hive是对hadoop的mr的封装，经过原始的mr方式进行数据处理与分析，每每效率较低，并且具备至关的复杂度，学习曲线较长。hive经常使用传统的sql方式做为操做手段，极大的下降了学习曲线，毕竟大部分人对sql仍是比较熟悉的。但在运行时，仍然要将sql进行翻译成mapreduce程序进行。java

二、hive安装

下载hive的软件包，解压以后配置环境变量便可。node

三、hive配置

hive中元数据存在关系型数据库中，默认是derby数据库，这里改为mysql数据库。hive的配置文件为conf/hive-site.xml目录下：mysql

<configuration>
  ...
  <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://192.168.231.1:3306/big11_hive</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>Username to use against metastore database</description>
  </property>
  <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>root</value>
    <description>password to use against metastore database</description>
  </property>
  <property>
    <name>hive.server2.enable.doAs</name>
    <value>false</value>
    <description>
      Setting this property to true will have HiveServer2 execute
      Hive operations as the user making the calls to it.
    </description>
  </property>
  <property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description>
      Enforce metastore schema version consistency.
      True: Verify that version information stored in metastore matches with one from Hive jars.  Also disable automatic
            schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
            proper metastore schema migration. (Default)
      False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
    </description>
  </property>


  ...
</configuration>

经过hive命令查看或者修改配置：算法

header设置sql

# 查看属性
$hive>set hive.cli.print.header ;
# 修改属性
$hive>set hive.cli.print.header=true ;

四、初始化数据库

$>schematool -dbtype mysql -dbType mysql -initSchema

五、登陆hive的命令行终端

$>hive
$hive>

六、hive经常使用命令

建立数据库shell

$hive>create database if not exists big12 ;

删除库数据库
```
$hive>drop database if exists big12 ;
```

建立表apache

$hive>CREATE TABLE if not exists emp(
id int ,
name string ,
age int,
dep string,
salary int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
lines terminated by  '\n'
stored as textfile;

准备数据上传到emp表centos

[emp.txt]服务器

1,tom,22,001,3000
2,tomas,22,001,2500
3,tomasLee,25,002,1200
4,tomson,24,002,3400
5,peter,24,001,3800
6,john,25,001,4200
7,looser,24,001,5500
8,alex,25,003,6000
9,jason,24,003,6000
10,japser,22,003,3000

# local是上传文件到hdfs上(hive仓库目录下)
$hive>load data local inpath '/home/centos/emp2.txt' into table emp ;

# 移动hdfs的目录到hive的表目录下。
$hive>load data inpath '/user/centos/emp.txt' into table emp ;

重命名表
```
$hive>alter table emp rename to e ;
```

七、hive的复杂类型

7.1类型

map

映射，key-value对
struct

结构体，至关于元组（等价于java的实体类），有多个属性，每一个属性有不一样类型。在RDBMS中至关于一行记录。
array

也能够成为list，有多行构成。

7.2 使用复杂类型

建立表

$hive>CREATE TABLE emp(
name string,
arr ARRAY<string>,
stru STRUCT<sex:string,age:int>,
map1 MAP<string,int>,
map2 MAP<string,ARRAY<string>>
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
lines terminated by '\n';

准备数据

Michael|Montreal,Toronto|Male,30|DB:80|Product:Developer^DLead
Will|Montreal|Male,35|Perl:85|Product:Lead,Test:Lead
Shelley|New York|Female,27|Python:80|Test:Lead,COE:Architect
Lucy|Vancouver|Female,57|Sales:89,HR:94|Sales:Lead

查询复杂数据类型

$hive>select arr[0] , stru.sex , map1["DB"] ,map2["Product"][0] from emp ;

CTAS方式建立表

# create table as .. select ..
$hive>create table emp2 ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
lines terminated by '\n' as select name ,arr from emp ;

Like方式建立表

like方式建立表和原表结构相同，不携带数据。
```
$hive>create external table emp3 like emp ;
```

表数据复制

$hive>insert into emp3 select * from emp ;

八、复杂类型对应的函数

查看全部函数
```
$hive>show functions ;
```

查看函数帮助

# 查看函数
$hive>desc function array ;

# 查看函数扩展信息
$hive>desc function extended array ;

array()
```
$hive>select array(1,2,3) ;
```

struct()

不带名成的结构体。

# 构造结构体
$hive>select struct(1, 'tomas' , 12 ,'hebei' , 80000) ;
# 无名列，使用coln访问
$hive>select struct(1, 'tomas' , 12 ,'hebei' , 80000).col5 ;

named_struct()

带有名称的结构体。

# 构造带名结构体
$hive>select named_struct('id',1,'name','tomas','sal' , 80000) ;
# 查询特定字段
$hive>select named_struct('id',1,'name','tomas','sal' , 80000).sal ;

map()

map()函数试音key-value映射，使用方式同named_struct相相似。

# 构造带名结构体
$hive>select map('id',1,'name','tomas','sal' , 80000) ;
# 查询特定字段
$hive>select map('id',1,'name','tomas','sal' , 80000).sal ;

九、分区表

在表目录再根据分区字段建立的子目录，可以有效缩小搜索范围。

建立分区表

$hive>CREATE TABLE par1(
id int,
name string,
age int
)
PARTITIONED BY (prov string, city string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':'
lines terminated by '\n';

手动添加分区

$hive>alter table par1 add partition(prov='henan' , city='kaifeng') partition(prov='henan' , city='zhengzhou');

显式分区信息
```
$hive>show partitions par1 ;
```

删除分区

$hive>alter table par1 DROP IF EXISTS PARTITION (prov='henan' , city='kaifeng');

加载数据到分区

$hive>load data local inpath '/home/centos/1.txt' into table par1 partition(prov='henan' , city='kaifeng') ;

# -f :force 覆盖原有文件
$>hdfs dfs -put -f par1.txt /user/hive/warehouse/big12.db/par1/prov=henan/city=kaifeng

复制数据到指定分区下

# 分区表在严格模式下，必须至少指定一个静态分区。
$hive>insert into par1 partition(prov='hebei' , city) select 1 , 'tom' , 'cc';       
# 指定了两个都是动态分区(错误的 )
$hive>insert into par1 partition(prov , city) select 10,'tom10',56,'Liaoning' , 'DL';

# 设置动态分区非严格模式
$hvie>set hive.exec.dynamic.partition.mode=nonstrict ;

动态分区模式

动态分区有严格和非严格之分。严格模式下插入数据时至少有一个是静态分区，非严格模式下均可以是动态分区。
```
# 非严格
$hive>set hive.exec.dynamic.partition.mode=nonstrict ; 
# 严格
$hive>set hive.exec.dynamic.partition.mode=strict ;
```

关闭动态分区

# true | false
$hive>set hive.exec.dynamic.partition=true ;

十、桶表

桶表原理就是hash，针对文件进行划分。分区表是针对目录划分。对记录须要单条检索的时候可使用桶表。分桶的大小推荐咱们每一个buck数据量是blocksize的2倍。桶表不能使用load指令进行数据的加载。经过表数据复制方式实现桶表数据存储。

建立桶表

$hive>CREATE TABLE buck1(
id int ,
name string
)
CLUSTERED BY (id) INTO 3 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
lines terminated by  '\n';

复制数据到桶表

$hive>insert into buck1 select id,name from par1 ;

十一、文件格式

在hive中列存储格式的文件具备较好的性能，面向列的存储将同一列的值连续存储起来，当进行投影（查询若干字段，而不是所有字段）查询时，能够发挥磁盘的线性读写，默认是文本文件。

textfile

文本文件，默认模式。

# 建立表指定格式
$hive>CREATE TABLE .. STORED AS textfile ;

# 修改表，指定格式
$hive>ALTER TABLE .. [PARTITION partition_spec] SET FILEFORMAT textfile ;

sequencefile

序列文件，key-value存储方式。能够指定压缩形式，有block、record、none。
rcfile

Record Columnar File，kv存储，相似于sequencefile，将数据文件水平切割多个组。若干group存放在一个hdfs中，先保存全部行的第一列，第二列，以此类推。该文件可切割.能够跳过不相关部分，更快获得数据，成本更低。

经过CTAS方式建立RCFile文件：
```
$hive>create table rcfile1 stored as rcfile as select * from buck1 ;
```
orc

Optimized Row Columnar，RCFile加强版。支持的大数据块(256M)。和rcfile的不一样是使用特殊的编码器感知数据类型，并依据不一样的类型进行压缩优化。同时也存储了基于column的一些基本的统计(MIN, MAX, SUM, and COUNT)信息,还有轻量级索引。支持范围较窄，只有hive和pig。
```
$hive>create table orc1 stored as orc as select * from buck1;
```
parquet

相似于orc，相对于orc文件格式，hadoop生态系统中大部分工程都支持parquet文件。
```
$hive>create table parquet1 stored as parquet as select * from buck1;
```

十二、事务支持

hive对事务的支持是有限制的，须要桶表+事务属性+orc文件+事务控制。

建立orc文件格式的桶表，并指定支持事务

$hive>create table tx(id int ,name string , age int)
clustered by (id) into 2 buckets
stored as orc TBLPROPERTIES('transactional'='true');

设置hive相关的属性支持

$hive>SET hive.support.concurrency = true;
SET hive.enforce.bucketing = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
SET hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
SET hive.compactor.initiator.on = true;
SET hive.compactor.worker.threads = 1;

复制数据都事务表

$hive>insert into tx select id , name , 18 from par1 ;

1三、查询

去重

指定多个reduce有效，key进行hash处理，key只要相同，hash到同一reduce，所以可使用多个reduce实现去重。
```
$hive>set mapreduce.job.reduces= 3 ;
$hive>select distinct id ,name from par1 ;
```
排序

hive实现全排序有两种方式，order by方式和sort by方式。order by使用reduce实现，致使数据倾斜。
- order by
```
$hive>select * from par1 order by id desc ;
```
- sort by
  
  部分排序，在reduce端按照指定的key进行部分排序。如下语句在每一个reduce内按照sal降序排列。
```
$hive>create table tmp as select * from emp2 sort by sal desc ;
```
- distribute by
  
  按照指定字段进行分发，就是分区过程，该语句须要在sort by以前。
```
$hive>create table tmp3 as select * from emp2 distribute by dep sort by sal desc;
```
- cluster by
  
  若是distribute by 和 sort by使用的是相同的字段，则能够直接使用cluster by。
```
$hive>create table tmp4 as select * from emp2 cluster by sal desc ;
```
- 对气温数据进行全排序
  
  年份升序全排序，年分内气温值降序排列。
  1. 建立表
```
$hive>CREATE TABLE temps(
year int,
temp int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
lines terminated by '\n';
```
  2. 加载数据
```
$hive>load data
```
  3. 执行查询
```
$hive>create table tmp5 as select * from temps distribute by case when year < 1930 then 0 when year > 1960 then 2 else 1 end sort by year asc , temp desc ;
```

链接查询

链接查询实现方式能够是map端链接，也能够reduce端链接。

hive自动转换链接到map端链接使用以下属性控制：
```
# 是否启用自动转换成map端控制
$hive>set hive.auto.convert.join=true ;

#无条件任务链接自动转换
$hive>set hive.auto.convert.join.noconditionaltask=true ;

#
$hive>set hive.auto.convert.join.noconditionaltask.size=100000000 ;
```
- map端链接
  
  map端链接不须要reduce，天然也没有shuffle，性能好。使用场景是小表+大表，小表能够载入内存。
```
$hive>select a.*,b.* FROM custs a ,orders b WHERE b.cid = a.id and b.id < 10;
```
- reduce链接
  
  若是大表 + 大表方式，须要使用reduce端链接。

倾斜控制

hive对倾斜链接的控制。

# 是否启用倾斜链接控制，默认false。
$hive>set hive.optimize.skewjoin=true ;

# 链接key超过100000行。
$hive>set hive.skewjoin.key=100000;

$hive>set hive.optimize.skewjoin=true ;

# 控制倾斜链接的map端任务数
$hive>set hive.skewjoin.mapjoin.map.tasks=10000 ;

# 控制切片大小
$hive>set hive.skewjoin.mapjoin.min.split=33554432 ;

1四、hiveserver2服务器

hiveserver2是相似于mysql的服务器，接受jdbc的链接请求，转换成mr程序，在hadoop上运行。对于其余的开发人员就能够经过jdbc方式进行访问了。

14.1 启动hiveserver2服务器

# 后台启动hiveserver2服务器
$>/soft/hive/bin/hiveserver2 &

14.2 验证是否成功

$>netstat -anop |grep 10000

14.3 启动beeline命令行

# 启动
$>/soft/hive/bin/beeline

# 查看帮助
$beeline>!help

# 链接到库
$beeline>!connect jdbc:hive2://localhost:10000/big12 ;

# 执行查询
$beeline>select * from par1 ;

14.3 使用jdbc客户端访问hiveserver

@Test
public void testSelect() throws Exception {
  String driver = "org.apache.hive.jdbc.HiveDriver" ;
  Class.forName(driver) ;
  Connection conn = DriverManager.getConnection("jdbc:hive2://s101:10000/big12");
  Statement st = conn.createStatement();
  ResultSet rs = st.executeQuery("select * from par1 sort by age desc") ;
  while(rs.next()){
    int id = rs.getInt("id") ;
    String name = rs.getString("name") ;
    String prov= rs.getString("prov") ;
    String city = rs.getString("city") ;
    System.out.printf("%d,%s,%s,%s\r\n" , id ,name , prov , city);
  }
  rs.close();
  conn.close();
}

1五、导出导入表

15.1 导出

导出表包含表结构和数据。可用于数据迁移。

# 导出指定表的内容到hdfs目录下
$hive>export big12.par1 to '/user/centos/par1' ;

运行结果以下：

_metadata是元数据，包含的是表结构信息。

15.2 导入

导入时将导出的数据目录和表机构导入到当前的数据库中。

# 指定目录记录
$hive>import from '/user/centos/par1' ;

15.3 导出hive数据到本地目录

$hive>INSERT OVERWRITE LOCAL DIRECTORY '/home/centos/par1'
SELECT * FROM par1 ;

1六、压缩控制

16.1 中间结果的压缩处理

中间文件是指在mr之间产生的输出文件是否进行压缩。

# 指定是否压缩
$hive>set hive.exec.compress.intermediate=true ;
# 使用的压缩算法
$hive>SET hive.intermediate.compression.codec=org.apache.hadoop.io.compress.SnappyCodec ;

16.2 控制job的输出压缩

# 指定是否压缩
$hive>SET hive.exec.compress.output=true ;
# 使用的压缩算法
$hive>SET mapred.output.compression.codec=org.apache.hadoop.io.compress.Lz4Codec ;

1七、视图

视图是虚表，能够理解为一些复杂查询语句的别名，使用表的方式来对待。

17.1 建立视图

$hive>create view v1 as

17.2 使用视图

$hive>select * from v1 ;

17.3 删除视图

$hive>drop view if exists v1 ;

1八、虚列

hive自动为表分配的列，建表语句中不须要指定这些列的。

$hive>select id ,INPUT__FILE__NAME ,BLOCK__OFFSET__INSIDE__FILE ,ROW__OFFSET__INSIDE__BLOCK from par1 ;

1九、高级聚合函数

聚合函数和group by配合使用。select子句中的字段要么是聚合函数，或者是分组字段。

19.1 常规聚合

# ok
$hive>select dep , avg(sal) from emp2 group by dep ;
# error ,name不是分组字段，也不是聚合函数
$hive>select name , dep , avg(sal) from emp2 group by dep ;

19.2 高级函数

19.2.1 高级聚合函数

在group by以后经过grouping sets指定各类聚合条件。group by指定哪些字段参与聚合，grouping sets指定具体按照哪些组合进行聚合。

grouping sets

$hive>select dep,age,avg(sal) from emp2 group by dep,age grouping sets(dep,age, (dep,age)) ;

cube

$hive>select dep,age,avg(sal) from emp2 group by age , dep with cube ;
$hive>select dep,age,avg(sal) from emp2 group by age , dep grouping sets(dep,age,(dep,age),()) ;

rollup

$hive>select dep,age,substr(name,1,1) ,avg(sal) from emp2 group by dep,age,substr(name,1,1) with rollup ;

$hive>select dep,age,substr(name,1,1) ,avg(sal) from emp2 group by dep,age,substr(name,1,1) grouping sets((dep,age,substr(name,1,1)),(dep,age),(dep),()) ;

19.2.2 分析函数

聚合函数不能将常规字段进行统计输出，输出的只能是分组字段和聚合函数表达式。分析函数能够将聚合字段和每条记录组合输出。有开窗函数，开窗函数能够按照行开窗和范围开窗。

rows between .. and ..

行开窗，查行数。
range between ..add ..

范围开窗，对字段值进行加减后的范围肯定数据的范围。

语法结构：

function (arg1,..., argn) OVER ([PARTITION BY <..>] [ORDER BY <..>] [<window_clause>])

分区

# 使用分区
$hive>select id,name,age,salary , avg(salary) over (partition by dep) from emp2;

$hive>select id,name,age,salary , avg(salary) over (partition by dep , age) from emp2 ;

排序

# first_value
$hive>select id,name,age,sal,first_value(sal) over (order by id desc) from emp2  ;

# last_value, 开窗范围是前导全部行到当前行结束
# rows between 指定开窗范围
# UNBOUNDED PRECEDING 前导全部行
# current row 当前行
# UNBOUNDED following 后续全部行
$hive>select id,name,age,sal,last_value(sal) over (order by id desc rows BETWEEN UNBOUNDED PRECEDING AND current row) from emp2 ;

# 行开窗范围 ：前导全部行和后续全部行==所有数据
$hive>select id,name,age,sal,last_value(sal) over (order by id desc rows BETWEEN UNBOUNDED PRECEDING AND unbounded following) from emp2 ;

# 行开窗范围 : 上一行，当前行和下一行三者之间的最后一个值。
$hive>select id,name,age,sal,last_value(sal) over (order by age desc rows BETWEEN 1 PRECEDING AND 1 following) from emp2 ;

# 范围开窗范围 : 上一行，当前行和下一行三者之间的最后一个值。
$hive>select id,name,age,sal,last_value(sal) over (order by age desc range BETWEEN 1 PRECEDING AND 1 following) from emp2 ;

# max()
$hive>select id,name,dep,sal,MAX(sal) OVER (PARTITION BY dep ORDER BY sal rows BETWEEN 3 PRECEDING AND 3 following) as t2 from emp2;

# rank() ,有缝排名
$hive>select id,name,dep,sal,rank() OVER (PARTITION BY dep ORDER BY sal desc) as t2 from emp2;

# dense_rank() ,有缝排名
$hive>select id,name,dep,sal,dense_rank() OVER (PARTITION BY dep ORDER BY sal desc) as t2 from emp2;

# percent_rank() ,百分比排名
$hive>select id,name,dep,sal,percent_rank() OVER (PARTITION BY dep ORDER BY sal desc) as t2 from emp2;

# Ntile()分桶函数 ,10条记录，3个桶 ,分配结果: 4,3,3
$hive>select id,name,dep,sal,ntile(4) OVER (PARTITION BY dep ORDER BY sal desc) as t2 from emp2;

# lag()前导查询 lag(字段,前导的个数,默认值),前导个数不能是负数。
$hive>select id,name,dep,sal,lag(sal,2,'-1000') OVER (PARTITION BY dep ORDER BY sal desc) as t2 from emp2;

# lead()后向查询 lag(字段,前导的个数,默认值)
$hive>select id,name,dep,sal,lead(sal,2,'-1000') OVER (PARTITION BY dep ORDER BY sal desc) as t2 from emp2;

# row_number() 取行号。
$hive>select id,name,dep,sal,row_number() OVER (PARTITION BY dep ORDER BY sal desc) as t2 from emp2;

20、修改hive本地模式

hive的本地模式是指使用hadoop的本地模式来运行。

# 开启自动切换hive的本地模式
$hive>set hive.exec.mode.local.auto=true;
$hive>set hive.exec.mode.local.auto.inputbytes.max=134217728 ;
$hive>set hive.exec.mode.local.auto.input.files.max=4 ;

$hive>set mapreduce.input.fileinputformat.split.maxsize=50;
$hive>set mapreduce.input.fileinputformat.split.minsize=50
$hive>set dfs.blocksize ;

2一、索引

hive中索引本质上一种表，元数据库中查询的tbls表结果是index_table。

21.1 建立索引

建立索引后，索引表时空的，没有所数据。

$hive>CREATE INDEX idx_emp2_name ON TABLE emp2(name) AS 'COMPACT' WITH DEFERRED REBUILD;

21.2 查看元数据中的索引表

$mysql>select * from big12_hive.tbls ;

查询结果如图：

21.3 重建索引值

# 重建索引
$hive>ALTER INDEX idx_emp2_name ON emp2 REBUILD ;

# 查看索引表
$hive>select * from idx_emp2_name ;

2二、合并小文件

对太小文件不要单独启动一个mapper，默认状况下，每一个文件至少对应一个map。

# 每一个Map切片的最大值
$hive>set mapred.max.split.size=256000000;

# 每一个节点的切片的最小值
$hive>set mapred.min.split.size.per.node=1;

# 每一个机架的切片的最小值
$hive>set mapred.min.split.size.per.rack=1;

# hive的输入格式
$hive>set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

hadoop能够经过CombineTextinputFormat解决大量小文件问题：

// 避免产生大量的mapper
job.setInputFormatClass(CombineTextInputFormat.class);
// 设置路径过滤,哪些输入路径能够访问
CombineTextInputFormat.setInputPathFilter(job , MyPathFilter.class);
// 设置组合文件输入格式的切片的最大值，默认Long.MAX_VALUE。
CombineTextInputFormat.setMaxInputSplitSize(job , 700);

2三、hive的解释执行计划

explain指令能够翻译sql语句对应的mr执行过程。

$hive>explain select distinct dep,age from emp2

2四、hive表采样

# 
$hive>SELECT * FROM emp TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;

# 将id分红16个桶，取第3个桶的数据。
$hive>SELECT * FROM emp TABLESAMPLE(BUCKET 3 OUT OF 16 ON id) ;
# 从开始按比例采样
$hive>SELECT * FROM emp TABLESAMPLE(50 PERCENT) s ;
# 从开始位置采样100M
$hive>SELECT * FROM emp TABLESAMPLE(100M) s; 
# 从开始行采样 10行
$hive>SELECT * FROM emp TABLESAMPLE(10 ROWS);