Hive数据库表的操做

时间 2019-11-13

标签 hive 数据库栏目 Hadoop 繁體版

原文原文链接

Hive数据库表的操做

Hive数据库建表的详细操做

1.Hive数据表的四种类型：管理表，外部表，分区表，桶表

2. 建立临时表（关键字temporary）

2.1 语法数据库

create temporary table student(安全

ip string comment 'student ip',session

name stringapp

)ide

2.2 Hive从0.14.0开始提供建立临时表的功能优化

表只对当前session有效，session退出后，表自动删除。spa

2.3 注意点.net

a. 若是建立的临时表表名已存在，那么当前session引用到该表名时实际用的是日志

临时表，只有drop或rename临时表名才能使用原始表；orm

b. 临时表限制：不支持分区字段和建立索引。

2.4 临时表的存储类型配置

从Hive1.1开始临时表能够存储在内存或SSD，使用hive.exec.temporary.table.storage

参数进行配置，该参数有三种取值：memory、ssd、default。

3. 外部表和管理表（关键字：external）

3.1 语法

creata external table student_ext(

id int,

name string

)

row format delimited fields terminated by '\t';

3.2 管理表和外部表的异同

a. 外部表建立的时候多一个关键字external

b. 管理表删除：数据库表的数据和HDFS上的对应表文件同时删除；

外部表删除：只删除数据库表的数据，HDFS上的对应表文件不会删除。

c. 外部表的这种状况保证了数据的安全性，防止误操做删除数据。

d. 除了建立表的时候外部表多了一个关键字，其余操做语句和管理表彻底一致。

4. 分区表的相关操做（关键字：partitioned by）

4.1 建立分区表（注意partpartitioned和row的顺序）

天+小时（通常分区表和外部表一块儿使用）

creata external table student_parthour(

id int,

name string

)

partitioned by (date string,hour string)

row format delimited fields terminated by '\t';

4.2 导入数据

load data local inpath '/opt/modules/mydata/student.txt'

into table student_parthour Partition(date='20161101',hour='18');

4.3 查看分区表数据

select * from student_parthour where date='20161101' and hour='18';

4.4 查看分区表（主要是查看分区状况）：show partitions student_part ;

4.5 删除分区表：alter table student_part drop partition(date='20161030');

4.6 分区表的应用场景

a. 由第三方提供的数据源或者数据源是日志文件；

b. 分区表通常采用外部表+分区表的格式；

c. 主要用于定时任务加载数据；

d. 主要用于同比或者环比分析数据。

注意：引用外部表数据是由于删除外部表保留数据；防止误删除操做，删除全部

数据；查询时尽可能利用分区字段，若是不使用分区字段，就会所有扫描。

4.7 分区表优势：分期增量抽取数据，定时任务完成，主要在于优化查询。

5. 建立桶表（关键字：clustered by，sorted by）

5.1 语法

creata table student_new(

id int,

name string

)

clustered by (id) sorted by(name) into 4 buckets

row format delimited fields terminated by '\t';

5.2 桶表的使用

a. Hive采用对列值哈希，而后除以桶的个数求余决定该条记录存放在哪一个桶当中；

b. 采用桶可以带来一些好处，好比JOIN操做. 对于JOIN操做两个表有一个相同的列，若是对这两个表都进行了桶操做. 那么将保存相同列值的桶进行JOIN操做就能够，能够大大较少JOIN的数据量;

c. hive中table能够分红partition,clustered by能够将table和partition分红bucket，

d. sorted by将bucket中的数据排序. 提高某些查询操做效率，例如mapside join;

e. clustered by和sorted by不会影响数据的导入,用户必须本身负责数据如何导入，包括数据的分桶和排序。'set hive.enforce.bucketing = true' 能够自动控制上一轮reduce的数量从而适配bucket的个数,用户也能够本身设置 mapred.reduce.tasks去适配bucket 个数。

5.3 bucket主要做用

a. 数据sampling(抽样)

b. 提高某些查询操做效率，例如mapside join

6. like建立一个和已经存在表类似的表（关键字：like）

like建立的表只有表结构，没有数据

creata table student_new(

id int,

name string

)

like student

location '/user/hive/warehouse/student'

row format delimited fields terminated by '\t';

7. 行格式化分隔符（关键字：row format delimited fields terminated by )

creata table student_new(

id int,

name string

)

row format delimited fields terminated by '\t';

常见的分隔符：\t（tab），逗号，空格

8. 建立表指定位置（关键字：location）

creata table student2(

id int,

name string

)

row format delimited fields terminated by '\t'

location '/app/mydata/'; --这个对应的是hdfs上的目录

建立表的方式

1 直接建立表

create table db_1031.emp(empno int,ename string,job string,mgr int,hiredate string,

sal double,comm double,deptno int)row format delimited fields terminated by '\t';

2 把另一张表的某几个字段抽取出来建立成一张新表（建立的表包含数据）

create table db_0831.emp_as as select * from emp ;

create table db_0831.emp_as as select name from emp ;

3 复制表结构（不包含数据）

create table db_0831.emp_like like emp ;