使用过oracle或者其余关系数据库的DBA或者开发人员都有这样的经验,在子查询上都认为数据库已经作过优化,可以很好的选择驱动表执行,而后在把该经验移植到mysql数据库上,可是不幸的是,mysql在子查询的处理上有可能会让你大失所望,在咱们的生产系统上就碰到过一些案例,例如:html
SELECT i_id, sum(i_sell) AS i_sell FROM table_data WHERE i_id IN (SELECT i_id FROM table_data WHERE Gmt_create >= '2011-10-07 00:00:00') GROUP BY i_id;
(备注:sql的业务逻辑能够打个比方:先查询出10-07号新卖出的100本书,而后在查询这新卖出的100本书在整年的销量状况)。mysql
这条sql之因此出现的性能问题在于mysql优化器在处理子查询的弱点,mysql优化器在处理子查询的时候,会将将子查询改写。一般状况下,咱们但愿由内到外,先完成子查询的结果,而后在用子查询来驱动外查询的表,完成查询;可是mysql处理为将会先扫描外面表中的全部数据,每条数据将会传到子查询中与子查询关联,若是外表很大的话,那么性能上将会出现问题;
针对上面的查询,因为table_data这张表的数据有70W的数据,同时子查询中的数据较多,有大量是重复的,这样就须要关联近70W次,大量的关联致使这条sql执行了几个小时也没有执行完成,因此咱们须要改写sql:sql
SELECT t2.i_id, SUM(t2.i_sell) AS sold FROM (SELECT DISTINCT i_id FROM table_data WHERE gmt_create >= '2011-10-07 00:00:00') t1, table_data t2 WHERE t1.i_id = t2.i_id GROUP BY t2.i_id;
咱们将子查询改成了关联,同时在子查询中加上distinct,减小t1关联t2的次数;
改造后,sql的执行时间降到100ms之内。
mysql的子查询的优化一直不是很友好,一直有受业界批评比较多,也是我在sql优化中遇到过最多的问题之一,mysql在处理子查询的时候,会将子查询改写,一般状况下,咱们但愿由内到外,也就是先完成子查询的结果,而后在用子查询来驱动外查询的表,完成查询,可是偏偏相反,子查询不会先被执行;今天但愿经过介绍一些实际的案例来加深对mysql子查询的理解。下面将介绍一个完整的案例及其分析、调优的过程与思路。 数据库
用户反馈数据库响应较慢,许多业务动更新被卡住;登陆到数据库中观察,发现长时间执行的sql;mysql优化
| 10437 | usr0321t9m9 | 10.242.232.50:51201 | oms | Execute | 1179 | Sending Sql为: SELECT tradedto0_.* FROM a1 tradedto0_ WHERE tradedto0_.tradestatus='1' AND (tradedto0_.tradeoid IN (SELECT orderdto1_.tradeoid FROM a2 orderdto1_ WHERE orderdto1_.proname LIKE '%??%' OR orderdto1_.procode LIKE '%??%')) AND tradedto0_.undefine4='1' AND tradedto0_.invoicetype='1' AND tradedto0_.tradestep='0' AND (tradedto0_.orderCompany LIKE '0002%') ORDER BY tradedto0_.tradesign ASC, tradedto0_.makertime DESC LIMIT 15;
UPDATE a1 SET tradesign='DAB67634-795C-4EAC-B4A0-78F0D531D62F', markColor=' #CD5555', memotime='2012-09- 22', markPerson='??' WHERE tradeoid IN ('gy2012092204495100032') ;
为了尽快恢复应用,将其长时间执行的sql kill掉后,应用恢复正常; oracle
db@3306 :explain SELECT tradedto0_.* FROM a1 tradedto0_ WHERE tradedto0_.tradestatus='1' AND (tradedto0_.tradeoid IN (SELECT orderdto1_.tradeoid FROM a2 orderdto1_ WHERE orderdto1_.proname LIKE '%??%' OR orderdto1_.procode LIKE '%??%')) AND tradedto0_.undefine4='1' AND tradedto0_.invoicetype='1' AND tradedto0_.tradestep='0' AND (tradedto0_.orderCompany LIKE '0002%') ORDER BY tradedto0_.tradesign ASC, tradedto0_.makertime DESC LIMIT 15; +----+--------------------+------------+------+---------------+------+---------+------+-------+----- | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+------------+------+---------------+------+---------+------+-------+----- | 1 | PRIMARY | tradedto0_ | ALL | NULL | NULL | NULL | NULL | 27454 | Using where; Using filesort | | 2 | DEPENDENT SUBQUERY | orderdto1_ | ALL | NULL | NULL | NULL | NULL | 40998 | Using where | +----+--------------------+------------+------+---------------+------+---------+------+-------+-----
从执行计划上,咱们开始一步一步地进行优化:
首先,咱们看看执行计划的第二行,也就是子查询的那部分,orderdto1_进行了全表的扫描,咱们看看能不能添加适当的索引: 性能
db@3306:alter table a2 add index ind_a2(proname,procode,tradeoid); ERROR 1071 (42000): Specified key was too long; max key length is 1000 bytes
添加组合索引超过了最大key length限制:测试
db@3306 :DESC a2 ; +---------------------+---------------+------+-----+---------+-------+ | FIELD | TYPE | NULL | KEY | DEFAULT | Extra | +---------------------+---------------+------+-----+---------+-------+ | OID | VARCHAR(50) | NO | PRI | NULL | | | TRADEOID | VARCHAR(50) | YES | | NULL | | | PROCODE | VARCHAR(50) | YES | | NULL | | | PRONAME | VARCHAR(1000) | YES | | NULL | | | SPCTNCODE | VARCHAR(200) | YES | | NULL | |
db@3306 :SELECT MAX(LENGTH(PRONAME)),avg(LENGTH(PRONAME)) FROM a2; +----------------------+----------------------+ | MAX(LENGTH(PRONAME)) | avg(LENGTH(PRONAME)) | +----------------------+----------------------+ | 95 | 24.5588 |
ALTER TABLE MODIFY COLUMN PRONAME VARCHAR(156);
再进行执行计划分析:优化
db@3306 :explain SELECT tradedto0_.* FROM a1 tradedto0_ WHERE tradedto0_.tradestatus='1' AND (tradedto0_.tradeoid IN (SELECT orderdto1_.tradeoid FROM a2 orderdto1_ WHERE orderdto1_.proname LIKE '%??%' OR orderdto1_.procode LIKE '%??%')) AND tradedto0_.undefine4='1' AND tradedto0_.invoicetype='1' AND tradedto0_.tradestep='0' AND (tradedto0_.orderCompany LIKE '0002%') ORDER BY tradedto0_.tradesign ASC, tradedto0_.makertime DESC LIMIT 15; +----+--------------------+------------+-------+-----------------+----------------------+---------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+------------+-------+-----------------+----------------------+---------+ | 1 | PRIMARY | tradedto0_ | ref | ind_tradestatus | ind_tradestatus | 345 | const,const,const,const | 8962 | Using where; Using filesort | | 2 | DEPENDENT SUBQUERY | orderdto1_ | index | NULL | ind_a2 | 777 | NULL | 41005 | Using where; Using index | +----+--------------------+------------+-------+-----------------+----------------------+---------+
发现性能仍是上不去,关键在两个表扫描的行数并无减少(8962*41005),上面添加的索引没有太大的效果,如今查看t表的执行结果: ui
db@3306 : SELECT orderdto1_.tradeoid FROM t orderdto1_ WHERE orderdto1_.proname LIKE '%??%' OR orderdto1_.procode LIKE '%??%'; Empty SET (0.05 sec)
结果集为空,因此须要将t表的结果集作做为驱动表;
经过上面测试验证,普通的mysql子查询写法性能上是不好的,为mysql的子查询自然的弱点,须要将sql进行改写为关联的写法:
SELECT tradedto0_.* FROM a1 tradedto0_ , (SELECT orderdto1_.tradeoid FROM a2 orderdto1_ WHERE orderdto1_.proname LIKE '%??%' OR orderdto1_.procode LIKE '%??%')t2 WHERE tradedto0_.tradestatus='1' AND (tradedto0_.tradeoid=t2.tradeoid) AND tradedto0_.undefine4='1' AND tradedto0_.invoicetype='1' AND tradedto0_.tradestep='0' AND (tradedto0_.orderCompany LIKE '0002%') ORDER BY tradedto0_.tradesign ASC, tradedto0_.makertime DESC LIMIT 15;
db@3306 :explain SELECT tradedto0_.* FROM a1 tradedto0_ , (SELECT orderdto1_.tradeoid FROM a2 orderdto1_ WHERE orderdto1_.proname LIKE '%??%' OR orderdto1_.procode LIKE '%??%')t2 WHERE tradedto0_.tradestatus='1' AND (tradedto0_.tradeoid=t2.tradeoid) AND tradedto0_.undefine4='1' AND tradedto0_.invoicetype='1' AND tradedto0_.tradestep='0' AND (tradedto0_.orderCompany LIKE '0002%') ORDER BY tradedto0_.tradesign ASC, tradedto0_.makertime DESC LIMIT 15; +----+-------------+------------+-------+---------------+----------------------+---------+------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+------------+-------+---------------+----------------------+---------+------+ | 1 | PRIMARY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Impossible WHERE noticed after reading const tables | | 2 | DERIVED | orderdto1_ | index | NULL | ind_a2 | 777 | NULL | 41005 | Using where; Using index | +----+-------------+------------+-------+---------------+----------------------+---------+------+
db@3306 : SELECT tradedto0_.* FROM a1 tradedto0_ , (SELECT orderdto1_.tradeoid FROM a2 orderdto1_ WHERE orderdto1_.proname LIKE '%??%' OR orderdto1_.procode LIKE '%??%')t2 WHERE tradedto0_.tradestatus='1' AND (tradedto0_.tradeoid=t2.tradeoid) AND tradedto0_.undefine4='1' AND tradedto0_.invoicetype='1' AND tradedto0_.tradestep='0' AND (tradedto0_.orderCompany LIKE '0002%') ORDER BY tradedto0_.tradesign ASC, tradedto0_.makertime DESC LIMIT 15; Empty SET (0.03 sec)
缩短到了毫秒;
当一个查询是另外一个查询的条件时,称之为子查询。子查询可使用几个简单命令构造功能强大的复合命令。
子查询最经常使用于WHERE子句中。还用在SELECT,FROM子句中,下面分别举例说明。
1. 子查询用WHERE子句。
示例:显示emp表中职位为CLERK和SALESMAN的员工信息
SQL> SELECT * FROM emp WHERE job in('CLERK','SALESMAN'); EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO ----- ---------- --------- ----- ----------- --------- --------- ------ 7369 SMITH CLERK 7902 1980/12/17 800.00 20 7499 ALLEN SALESMAN 7698 1981/2/20 1600.00 300.00 30 7521 WARD SALESMAN 7698 1981/2/22 1250.00 500.00 30 7654 MARTIN SALESMAN 7698 1981/9/28 1250.00 1400.00 30 7844 TURNER SALESMAN 7698 1981/9/8 1500.00 0.00 30 7876 ADAMS CLERK 7788 1987/5/23 1100.00 20 7900 JAMES CLERK 7698 1981/12/3 950.00 30 7934 MILLER CLERK 7782 1982/1/23 1300.00 10 8 rows selected |
2. 子查询用from子句。
示例:显示emp表中5-10条记录。
SQL> SELECT empno,ename,job,hiredate,sal,comm,deptno 2 FROM (SELECT ROWNUM r,emp.* FROM emp ) T 3 WHERE T.r>=5 AND T.r<10; EMPNO ENAME JOB HIREDATE SAL COMM DEPTNO ----- ---------- --------- ----------- --------- --------- ------ 7654 MARTIN SALESMAN 1981/9/28 1250.00 1400.00 30 7698 BLAKE MANAGER 1981/5/1 2850.00 30 7782 CLARK MANAGER 1981/6/9 2450.00 10 7788 SCOTT ANALYST 1987/4/19 3000.00 20 7839 KING PRESIDENT 1981/11/17 5000.00 10 5 rows selected |
3.子查询用select子句
示例: 显示emp表中所员工信息及所在部门名称。
SQL> SELECT e.*, 2 (SELECT dname FROM dept WHERE deptno=e.deptno) as dname 3 FROM EMP e; EMPNO ENAME JOB MGR HIREDATE SAL COMM DEPTNO dname ----- ---------- --------- ----- ----------- --------- --------- ------ ----- 7369 SMITH CLERK 7902 1980/12/17 800.00 20 RESEARCH 7499 ALLEN SALESMAN 7698 1981/2/20 1600.00 300.00 30 SALES 7521 WARD SALESMAN 7698 1981/2/22 1250.00 500.00 30 SALES 7566 JONES MANAGER 7839 1981/4/2 2975.00 20 RESEARCH 7654 MARTIN SALESMAN 7698 1981/9/28 1250.00 1400.00 30 SALES 7698 BLAKE MANAGER 7839 1981/5/1 2850.00 30 SALES …… 14 rows selected |
写在前面的话:
——MySQL 的子查询为何有时候很糟糕——
引子:这样的子查询为何这么慢?
下面的例子是一个慢查,线上执行时间至关夸张。为何呢?
SELECT gid,COUNT(id) as count
FROM shop_goods g1
WHERE status =0 and gid IN (
SELECT gid FROM shop_goods g2 WHERE sid IN (1519066,1466114,1466110,1466102,1466071,1453929)
)
GROUP BY gid;
它的执行计划以下,请注意看关键词“DEPENDENT SUBQUERY”:
id select_type table type possible_keys key key_len ref rows Extra
------ ------------------ ------ -------------- -------------------------------------- ------------ ------- ------ ------ -----------
1 PRIMARY g1 index (NULL) idx_gid 5 (NULL) 850672 Using where
2 DEPENDENT SUBQUERY g2 index_subquery id_shop_goods,idx_sid,idx_gid idx_gid 5 func 1 Using where
基础知识:Dependent Subquery意味着什么
官方含义为:
SUBQUERY:子查询中的第一个SELECT;
DEPENDENT SUBQUERY:子查询中的第一个SELECT,取决于外面的查询 。
换句话说,就是 子查询对 g2 的查询方式依赖于外层 g1 的查询。
什么意思呢?它意味着两步:
第一步,MySQL 根据 select gid,count(id) from shop_goods where status=0 group by gid; 获得一个大结果集 t1,其数据量就是上图中的 rows=850672 了。
第二步,上面的大结果集 t1 中的每一条记录,都将与子查询 SQL 组成新的查询语句:select gid from shop_goods where sid in (15...blabla..29) and gid=%t1.gid%。等于说,子查询要执行85万次……即便这两步查询都用到了索引,但不慢才怪。
如此一来,子查询的执行效率竟然受制于外层查询的记录数,那还不如拆成两个独立查询顺序执行呢。
优化策略1:
你不想拆成两个独立查询的话,也能够与临时表联表查询,以下所示:
SELECT g1.gid,count(1)
FROM shop_goods g1,(select gid from shop_goods WHERE sid in (1519066,1466114,1466110,1466102,1466071,1453929)) g2
where g1.status=0 and g1.gid=g2.gid
GROUP BY g1.gid;
也能获得一样的结果,且是毫秒级。
它的执行计划为:
id select_type table type possible_keys key key_len ref rows Extra
------ ----------- -------------- ------ ------------------------- ------------- ------- ----------- ------ -------------------------------
1 PRIMARY <derived2> ALL (NULL) (NULL) (NULL) (NULL) 30 Using temporary; Using filesort
1 PRIMARY g1 ref idx_gid idx_gid 5 g2.gid 1 Using where
2 DERIVED shop_goods range id_shop_goods,idx_sid id_shop_goods 5 (NULL) 30 Using where; Using index
DERIVED 的官方含义为:
DERIVED:用于 from 子句里有子查询的状况。MySQL 会递归执行这些子查询,把结果放在临时表里。
DBA观点引用:MySQL 子查询的弱点
hidba 论述道(参考资源3):
mysql 在处理子查询时,会改写子查询。
一般状况下,咱们但愿由内到外,先完成子查询的结果,而后再用子查询来驱动外查询的表,完成查询。
例如:
select * from test where tid in(select fk_tid from sub_test where gid=10)
一般咱们会感性地认为该 sql 的执行顺序是:
sub_test 表中根据 gid 取得 fk_tid(2,3,4,5,6)记录,
而后再到 test 中,带入 tid=2,3,4,5,6,取得查询数据。
可是实际mysql的处理方式为:
select * from test where exists (
select * from sub_test where gid=10 and sub_test.fk_tid=test.tid
)
mysql 将会扫描 test 中全部数据,每条数据都将会传到子查询中与 sub_test 关联,子查询不会先被执行,因此若是 test 表很大的话,那么性能上将会出现问题。
《高性能MySQL》一书的观点引用
《高性能MySQL》的第4.4节“MySQL查询优化器的限制(Limitations of the MySQL Query Optimizer)”之第4.4.1小节“关联子查询(Correlated Subqueries)”也有相似的论述:
MySQL有时优化子查询很糟,特别是在WHERE从句中的IN()子查询。……
好比在sakila数据库sakila.film表中找出全部的film,这些film的actoress包括Penelope Guiness(actor_id = 1)。能够这样写:
mysql> SELECT * FROM sakila.film
-> WHERE film_id IN(
-> SELECT film_id FROM sakila.film_actor WHERE actor_id = 1);
mysql> EXPLAIN SELECT * FROM sakila.film ...;
+----+--------------------+------------+--------+------------------------+
| id | select_type | table | type | possible_keys |
+----+--------------------+------------+--------+------------------------+
| 1 | PRIMARY | film | ALL | NULL |
| 2 | DEPENDENT SUBQUERY | film_actor | eq_ref | PRIMARY,idx_fk_film_id |
+----+--------------------+------------+--------+------------------------+
根据EXPLAIN的输出,MySQL将全表扫描film表,对找到的每行执行子查询,这是很很差的性能。幸运的是,很容易改写为一个join查询:
mysql> SELECT film.* FROM sakila.film
-> INNER JOIN sakila.film_actor USING(film_id)
-> WHERE actor_id = 1;
另一个方法是经过使用GROUP_CONCAT()执行子查询做为一个单独的查询,手工产生IN()列表。有时候比join还快。(注:你不妨在咱们的库上试试看 SELECT goods_id,GROUP_CONCAT(cast(id as char))
FROM bee_shop_goods
WHERE shop_id IN (1519066,1466114,1466110,1466102,1466071,1453929)
GROUP BY goods_id;)
MySQL已经由于这种特定类型的子查询执行计划而被批评。
什么时候子查询是好的
MySQL并不老是把子查询优化得很糟。有时候仍是很优化的。下面是个例子:
mysql> EXPLAIN SELECT film_id, language_id FROM sakila.film
-> WHERE NOT EXISTS(
-> SELECT * FROM sakila.film_actor
-> WHERE film_actor.film_id = film.film_id
-> )G
……(注:具体文字仍是请阅读《高性能MySQL》吧)
是的,子查询并非老是被优化得很糟糕,具体问题具体分析,但别忘了 explain 。