Presto是一个开源的分布式SQL查询引擎,适用于交互式分析查询,数据量支持GB到PB字节。查询语言是类ANSI SQL语句。笔者在多个项目中用到Presto作即席查询,总结了一些优化措施。算法
INSERT INTO table nation_orc partition(p) SELECT * FROM nation SORT BY n_name;
若是须要过滤n_name字段,则性能将提高。网络
SELECT count(*) FROM nation_orc WHERE n_name=’AUSTRALIA’;
[GOOD]: SELECT time,user,host FROM tbl [BAD]: SELECT * FROM tbl
[GOOD]: SELECT time,user,host FROM tbl where acct_day=20171101 [BAD]: SELECT * FROM tbl where visit_time=20171101
[GOOD]: SELECT GROUP BY uid, gender [BAD]: SELECT GROUP BY gender, uid
[GOOD]: SELECT * FROM tbl ORDER BY time LIMIT 100 [BAD]: SELECT * FROM tbl ORDER BY time
还有尽可能将排序的字段减小些能加快计算。session
SELECT approx_distinct(user_id) FROM access
若是非要精确去重,请用Count+Group 语句代替app
[GOOD] SELECT ... FROM access WHERE regexp_like(method, 'GET|POST|PUT|DELETE') [BAD] SELECT ... FROM access WHERE method LIKE '%GET%' OR method LIKE '%POST%' OR method LIKE '%PUT%' OR method LIKE '%DELETE%'
[GOOD] SELECT ... FROM large_table l join small_table s on l.id = s.id [BAD] SELECT ... FROM small_table s join large_table l on l.id = s.id
若是左表和右表都比较大怎么办?为了防止内存报错
1)修改配置distributed-joins-enabled (presto version >=0.196)
2)在每次查询开始使用distributed_join的session选项分布式
-- set session distributed_join = 'true' SELECT ... FROM large_table1 join large_table2 on large_table1.id = large_table2.id
核心点就是使用distributed join. Presto的这种配置类型会将左表和右表同时以join key的hash value为分区字段进行分区. 因此即便右表也是大表,也会被拆分.
缺点是会增长不少网络数据传输, 因此会比broadcast join的效率慢.函数
[GOOD] SELECT checksum(rnk) FROM ( SELECT rank() OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk FROM lineitem ) t WHERE rnk = 1 [BAD] SELECT checksum(rnk) FROM ( SELECT row_number() OVER (PARTITION BY l_orderkey, l_partkey ORDER BY l_shipdate DESC) AS rnk FROM lineitem ) t WHERE rnk = 1
9.多用with语句
使用Presto分析统计数据时,可考虑把屡次查询合并为一次查询,用Presto提供的子查询完成。
这点和咱们熟知的MySQL的使用不是很同样。注意下列子查询中的逗号。性能
WITH subquery_1 AS ( SELECT a1, a2, a3 FROM Table_1 WHERE a3 between 20180101 and 20180131 ), subquery_2 AS ( SELECT b1, b2, b3 FROM Table_2 WHERE b3 between 20180101 and 20180131 ) SELECT subquery_1.a1, subquery_1.a2, subquery_2.b1, subquery_2.b2 FROM subquery_1 JOIN subquery_2 ON subquery_1.a3 = subquery_2.b3;
若是以前的hive表没有用到ORC和snappy,那么怎么无缝替换而不影响线上的应用:
好比以下一个hive表:优化
CREATE TABLE bdc_dm.res_category( channel_id1 int comment '1级渠道id', province string COMMENT '省', city string comment '市', uv int comment 'uv' ) comment 'example' partitioned by (landing_date int COMMENT '日期:yyyymmdd') ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY ',' MAP KEYS TERMINATED BY ':' LINES TERMINATED BY '\n';
创建对应的orc表ui
CREATE TABLE bdc_dm.res_category_orc( channel_id1 int comment '1级渠道id', province string COMMENT '省', city string comment '市', uv int comment 'uv' ) comment 'example' partitioned by (landing_date int COMMENT '日期:yyyymmdd') row format delimited fields terminated by '\t' stored as orc TBLPROPERTIES ("orc.compress"="SNAPPY");
先将数据灌入orc表,而后更换表名rest
insert overwrite table bdc_dm.res_category_orc partition(landing_date) select * from bdc_dm.res_category where landing_date >= 20171001; ALTER TABLE bdc_dm.res_category RENAME TO bdc_dm.res_category_tmp; ALTER TABLE bdc_dm.res_category_orc RENAME TO bdc_dm.res_category;
其中res_category_tmp是一个备份表,若线上运行一段时间后没有出现问题,则能够删除该表。
做者:叫我小名 连接:https://www.jianshu.com/p/f435ce79c966 来源:简书 简书著做权归做者全部,任何形式的转载都请联系做者得到受权并注明出处。