MapJoin是Hive的一种优化操做,其适用于小表JOIN大表的场景,因为表的JOIN操做是在Map端且在内存进行的,因此其并不须要启动Reduce任务也就不须要通过shuffle阶段,从而能在必定程度上节省资源提升JOIN效率优化
SELECT /*+ MAPJOIN(smalltable)*/ .key,value FROM smalltable JOIN bigtable ON smalltable.key = bigtable.key
hive.auto.convert.join
hive.mapjoin.smalltable.filesize
hive.auto.convert.join=false(关闭自动MAPJOIN转换操做)
hive.ignore.mapjoin.hint=false(不忽略MAPJOIN标记)
select /*+MAPJOIN(smallTableTwo)*/ idOne, idTwo, value FROM ( select /*+MAPJOIN(smallTableOne)*/ idOne, idTwo, value FROM bigTable JOIN smallTableOne on (bigTable.idOne = smallTableOne.idOne) ) firstjoin JOIN smallTableTwo ON (firstjoin.idTwo = smallTableTwo.idTwo)
hive.auto.convert.join.noconditionaltask:Hive在基于输入文件大小的前提下将普通JOIN转换成MapJoin,并是否将多个MJ合并成一个
hive.auto.convert.join.noconditionaltask.size:多个MJ合并成一个MJ时,其表的总的大小须小于该值,同时hive.auto.convert.join.noconditionaltask必须为true