Hive分桶表总结

时间 2019-11-21

标签 hive 总结栏目 Hadoop 繁體版

原文原文链接

本文主要转至：http://www.cnblogs.com/skyl/p/4737847.htmlhtml

Hive 中 table 能够拆分红 Partition table 和桶（BUCKET），对于Table或者Partition， Hive能够进一步组织成桶，也就是说桶Bucket是更为细粒度的数据范围划分。Bucket是对指定列进行hash，而后根据hash值除以桶的个数进行求余，决定该条记录存放在哪一个桶中。桶操做是经过 Partition 的 CLUSTERED BY 实现的，BUCKET 中的数据能够经过 SORT BY 排序。android

优势①：得到更高的查询处理效率。桶为表加上了额外的结构，Hive 在处理有些查询时能利用这个结构。具体而言，链接两个在相同列上划分了桶的表，可使用 Map-side Join 的高效实现。sql

优势②：抽样（sampling）能够在全体数据上进行采样，这样效率天然就低，它仍是要去访问全部数据。而若是一个表已经对某一列制做了bucket，就能够采样全部桶中指定序号的某个桶，这就减小了访问量。ide

缺点：使用业务字段来查询的话，没有什么效果。oop

须要特别主要的是，CLUSTERED BY 和 SORT BY 不会影响数据的导入，这意味着，用户必须本身负责数据的导入，包括数据分桶和排序。 ‘set hive.enforce.bucketing=true’ 能够自动控制上一轮 Reduce 的数量从而适配 BUCKET 的个数，固然，用户也能够自主设置 mapred.reduce.tasks 去适配 BUCKET 个数，推荐使用：spa

操做示例以下。code

1) 建立临时表 student_tmp，并导入数据。orm

hive> desc student_tmp;
hive> select * from student_tmp;

2). 建立桶表htm

使用 Clustered By 子句来指定划分桶所用的列，以及划分桶的个数。桶中的数据能够根据一个或多个列进行排序Sorted by【此处默认是降序】。因为这样对每一个桶的链接变成了高效的归并排序(merge-sort)，所以能够进一步提高map端链接的效率。 blog

hive> create table student0
      (id INT, 
       age INT, 
       name STRING
       )
     partitioned by(stat_date STRING)
     row format delimited 
     fields terminated by ','; 
OK
Time taken: 0.292 seconds

hive> create table student1
      ( id INT, 
        age INT, 
        name STRING
       ) 
      partitioned by(stat_date STRING) 
      clustered by(id) sorted by(age) into 2 buckets 
      row format delimited 
      fields terminated by ',';
OK
Time taken: 0.215 seconds

3). 设置环境变量。让程序自动分配reduce的数量从而适配相应的bucket;

hive> set hive.enforce.bucketing=true;

4). 导入数据

桶表 student1 加载数据 From Select 是通过MR的，而普通表 student0 加载数据 Load 是不须要启动MR的。事实上，桶表数据文件对应MR的 Reduce输出文件：桶n 对应于输出文件 000000_n

[root@hadoop01 hive]# more bucket.txt
1,20,zxm
2,21,ljz
3,19,cds
4,18,mac
5,22,android
6,23,symbian
7,25,wp

hive> LOAD data local INPATH '/root/hive/bucket.txt' 
    > OVERWRITE INTO TABLE student0                  
    > partition(stat_date="20120802");

hive> from student0                                                   
    > insert overwrite table student1 partition(stat_date="20120802") 
    > select id,age,name where stat_date="20120802"                   
    > sort by age;

5) 查看文件目录。

hive> dfs -ls /user/hive/warehouse/student1/stat_date=20120802;
Found 2 items
-rw-r--r--   1 root supergroup         31 2015-08-17 21:23
 /user/hive/warehouse/student1/stat_date=20120802/000000_0

-rw-r--r--   1 root supergroup         39 2015-08-17 21:23
 /user/hive/warehouse/student1/stat_date=20120802/000001_0

hive> dfs -text /user/hive/warehouse/student1/stat_date=20120802/000000_0;
6,23,symbian
2,21,ljz
4,18,mac

hive> dfs -text /user/hive/warehouse/student1/stat_date=20120802/000001_0;
7,25,wp
5,22,android
1,20,zxm
3,19,cds

6) 查看 sampling 数据。

hive> select * from student1                     
    > TableSample(bucket 1 out of 2 on id); 
OK
6       23      symbian 20120802
2       21      ljz     20120802
4       18      mac     20120802
Time taken: 10.871 seconds, Fetched: 3 row(s)

注：tablesample是抽样语句，语法：TABLESAMPLE(BUCKET x OUT OF y)

y必须是桶数的整数倍或者因子。hive根据y的大小，决定抽样的比例。例如，桶数64：

当y=32时，抽取(64/32=)2个bucket的数据
当y=64时，抽取(64/64=)1个bucket的数据（此例子就是1）
当y=128时，抽取(64/128=)1/2个bucket的数据

x表示从哪一个bucket开始抽取。例如，桶数64，tablesample(bucket 3 out of 32)，表示：

总共抽取（64/32=）2个bucket的数据，分别为第3个bucket和第（3+32=）35个bucket的数据。
此例子中，总共抽取（2/2=）1个bucket的数据，而且是第一个桶中的数据。