hive中distinct和group by优化

时间 2021-07-10

原文原文链接

1、避免使用count distinct ,容易引起性能问题 select distinct(user_id) from a ; 由于必须去重，因此Hive会把map阶段的输出全部分布到一个reduce task中，容易引起性能问题，可以通过先group by ,再count得方式进行优化优化后：select count(*) from( select user_id from a group