写了个hive的sql语句,执行效率特别慢,跑了一个多小时程序只是map完了,reduce进行到20%。
该Hive语句以下:
sql
select count(distinct ip)
from (select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d
分析:select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"这个语句筛选出来的数据约有10亿条,select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"约有10亿条条,select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1 筛选出来的数据约有10亿条,总的数据量大约30亿条。这么大的数据量,使用disticnt函数,全部的数据只会shuffle到一个reducer上,致使reducer数据倾斜严重。
解决办法:
首先,经过使用groupby,按照ip进行分组。改写后的sql语句以下:
app
select count(*)
from
(select ip
from(select ip as ip from comprehensive.f_client_boot_daily where year="2013" and month="10"
union all
select pub_ip as ip from f_app_boot_daily where year="2013" and month="10"
union all select ip as ip from format_log.format_pv1 where year="2013" and month="10" and url_first_id=1
) d
group by ip ) b
而后,合理的设置reducer数量,将数据分散到多台机器上。set mapred.reduce.tasks=50;
通过优化后,速度提升很是明显。整个做业跑完大约只须要20多分钟的时间。函数