hive 优化实践

时间 2020-01-21

标签 hive 优化实践栏目 Hadoop 繁體版

原文原文链接

背景

在工做中须要同步pg数据库下的某张表到hive，使用的工具是开源的sqoop，业务表的数据表包含最近一年的数据，数据表的行数为366,830,898，数据表的字段个数为71个，数据表在pg中的空间大小为110G；pg中表没有惟一主键，同一个id的数据可能会出现屡次，且都是业务容许的正常场景。node

分析

全量同步数据
每次将pg的整表全量同步到hive分区表中，这种作法同步的速度很慢，这张表后面会有依赖，会影响后续数据的产出，且会有数据表block的风险，因此不适合用全量的方式同步
增量同步数据
查看天天有改动的数据的记录数，最多为200万，数据的体积约为700M。尝试单独用sqoop同步一天的数据速度很快，sqoop同步的时候要指定map划分的split字段，因此在pg中先在查询和分割字段上加上索引。增量的数据要和前一天的全量分区作合并，由于同一个id无论在增量的表中仍是全量的表中都会出现多行记录，因此并不能使用A left join B where b.id is not null的方式去处理，也不能left semi join 去处理。

尝试

not in 方式
查询在hive原始表中同时不在增量表中的id对应的数据，接着和增量的最新数据作union all。

hive -e "
insert overwrite table target_table partition(dt = '$curr_date')
select 
    a.*
from a
where 
    a.dt = '$curr_date - 1'
    and a.id not in (select id from b where b.dt = '$curr_date')
union all
select 
    *
from b 
where b.dt = '$curr_date'
"

这种方式理论上是能够实现的，实际执行中发现任务最后会生成不少的小文件。
1. 尝试手动设定reduce的数量 set mapred.reduce.tasks = 64，实际执行中并未起做用。
2. 尝试在map阶段先进行文件合并，例以下面的设置，map的数量确实减小了，可是reducer数量仍是同样没变。同理设定reduce完成之后的文件合并，同样不起做用。算法

set mapred.min.split.size=100000000;
set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=50000000;
set mapred.min.split.size.per.rack=50000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;

exists 方式
将not in 改写成not exists方式以下，执行的时候会启动64个reducer去执行reduce，速度相对上面的流程会快，可是这时候出现的了轻微数据倾斜的问题，部分节点执行的较慢，影响了整个任务的执行效率，且生成的文件大小不一致，大小差别很大。

hive -e "
set mapred.reduce.tasks = 64;
insert overwrite table traget_table partition (dt = '$curr_date')
 
select
a.*
from a
where
   a.dt = '$curr_date - 1'
   and not exists (select 1 from b where b.dt = '$curr_date' and a.id = b.id)
union all
select
   *
from b where b.dt =  '$curr_date'
"

更新sql以下，加上 DISTRIBUTE by rand ()，distribute by是控制在map端如何拆分数据给reduce端的，hive会根据distribute by后面列，对应reduce的个数进行分发，默认是采用hash算法。rand()方法会生成一个[0,1]之间的随机数，经过随机数进行数据的划分，由于每次都随机的，因此每一个reducer上的数据会很均匀。以下的这种设置，reducer会有64个，且每一个reducer上的数据量几乎同样。sql

hive -e "
set mapred.reduce.tasks = 64;
insert overwrite table traget_table partition (dt = '$curr_date')
 
select
a.*
from a
where
   a.dt = '$curr_date - 1'
   and not exists (select 1 from b where b.dt = '$curr_date' and a.id = b.id)
union all
select
   *
from b where b.dt =  '$curr_date'
DISTRIBUTE by rand ();
"

效果以下：
数据库