因为 hive 与传统关系型数据库面对的业务场景及底层技术架构都有着很大差别,所以,传统数据库领域的一些技能放到 Hive 中可能已再也不适用。关于 hive 的优化与原理、应用的文章,前面也陆陆续续的介绍了一些,但大多都偏向理论层面,本文就介绍一个实例,从实例中一步步加深对 hive 调优的认识与意识。算法
需求我作了简化,很简单,两张表作个 join,求指定城市,天天的 pv,用传统的 RDBMS SQL 写出来就这样的:sql
SELECT t.statdate, c.cname, count(t.cookieid) FROM tmpdb.city c JOIN ecdata.ext_trackflow t ON (t.area1= c.cname OR t.area2 =c.cname OR t.area3 = c.cname) WHERE t.statdate>='20140818' and t.statdate<='20140824' AND platform='pc' GROUP BY t.statdate, c.cname;
怎么样?根据 SQL 看懂需求没问题吧?数据库
而后把这条 SQL 贴到 hive 中去执行,而后你会发现报错了:ruby
FAILED: SemanticException [Error 10019]: Line 5:32 OR not supported in JOIN currently 'cname'
这是由于 hive 受限于 MapReduce 算法模型,只支持 equi-joins(等值 join),要实现上述的非等值 join,你能够采用笛卡儿积( full Cartesian product )来实现:cookie
SELECT t.statdate, c.cname, count(t.cookieid) FROM tmpdb.city c JOIN ecdata.ext_trackflow t WHERE t.statdate>='20140818' AND t.statdate<='20140824' AND platform='pc' AND (t.area1= c.cname OR t.area2 =c.cname OR t.area3 = c.cname) GROUP BY t.statdate, c.cname;
而后再拿着这条语句执行下。架构
既然不容许非等值 join,那咱们换一下思路,多个子查询 union all,而后汇总:优化
SELECT dt, name, count(cid) FROM (SELECT t.statdate dt, c.cname name, t.cookieid cid FROM tmpdb.city c JOIN ecdata.ext_trackflow t ON t.area1 =c.cname WHERE t.statdate>='20140818' AND t.statdate<='20140824' AND platform='pc' UNION ALL SELECT t.statdate dt, c.cname name, t.cookieid cid FROM tmpdb.city c JOIN ecdata.ext_trackflow t ON t.area2 =c.cname WHERE t.statdate>='20140818' AND t.statdate<='20140824' AND platform='pc' UNION ALL SELECT t.statdate dt, c.cname name, t.cookieid cid FROM tmpdb.city c JOIN ecdata.ext_trackflow t ON t.area3 =c.cname WHERE t.statdate>='20140818' AND t.statdate<='20140824' AND platform='pc') tmp_trackflow GROUP BY dt, name;
形成这个的缘由是子查询出来有字段相同,能够改字段方式解决ui
select *
from
(select si.name as siname,gsi.name as gsiname from student_infos si join good_student_infos gsi on si.name = gsi.name ) st;spa
若是不出现重复字段不会报错orm
联合查询代替hive不能join非相等查询问题