在校期间,一直没有接触过百万级的数据,一直没有百万数据下查询的感受。 近期作毕业设计,须要用到较大的数据来进行图表分析和订单推荐。 一直疑惑如何设计表才能让普通的sql更高效,所以便如下尝试并记录于此。 本篇文章着重于表的设计对查询sql的影响 小弟不才,文章仅作记录,有更好的想法或有误请留言指出,必当积极回应。
- 表引擎:innoDB - 字符集:utf8mb4 - 数据库版本:5.7.18 数据库引擎为默认的innoDB、由于实际业务考虑事务,所以就默认innoDB 为了避免受其余环节影响,用的是`腾讯云mysql基础版1核1000MB/50GB`
-- 建立order表 CREATE TABLE tb_order ( `id` BIGINT ( 20 ) NOT NULL AUTO_INCREMENT, `item_name` VARCHAR ( 255 ) NOT NULL, `item_price` INT ( 11 ) UNSIGNED NOT NULL, `item_state` TINYINT ( 1 ) NOT NULL, `create_time` INT ( 11 ) UNSIGNED NOT NULL, `time_year` CHAR ( 4 ) NOT NULL, `time_month` CHAR ( 2 ) NOT NULL, `time_day` CHAR ( 2 ) NOT NULL, PRIMARY KEY ( id ) ) ENGINE = INNODB DEFAULT CHARSET = utf8mb4; CREATE INDEX idx_order_ctime ON tb_order ( create_time ); CREATE INDEX idx_order_state ON tb_order ( `item_state` ); CREATE INDEX idx_order_day ON tb_order ( `time_year`, `time_month` ); CREATE INDEX idx_order_dmonth ON tb_order ( `time_year`, `time_month`, `time_day`);
-- 随机字符串函数 CREATE DEFINER=`root`@`%` FUNCTION `rand_string`(n INT) RETURNS varchar(255) CHARSET utf8 BEGIN DECLARE chars_str varchar(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; DECLARE return_str varchar(255) DEFAULT '' ; DECLARE i INT DEFAULT 0; WHILE i < n DO SET return_str = concat(return_str,substring(chars_str , FLOOR(1 + RAND()*62 ),1)); SET i = i +1; END WHILE; RETURN return_str; END -- 随机范围时间函数 CREATE DEFINER=`root`@`%` FUNCTION `rand_date`(`startDate` date,`endDate` date) RETURNS datetime BEGIN #Routine body goes here... DECLARE sec INT DEFAULT 0; DECLARE ret DATETIME; SET sec = ABS(UNIX_TIMESTAMP(endDate) - UNIX_TIMESTAMP(startDate)); SET ret = DATE_ADD(startDate, INTERVAL FLOOR( 1+RAND ( ) * ( sec-1))SECOND); RETURN ret; END -- 模拟订单存储过程 CREATE DEFINER=`root`@`%` PROCEDURE `mock_order`(IN `size` int UNSIGNED,IN `sd` date,IN `ed` date) BEGIN #Routine body goes here... DECLARE i int UNSIGNED DEFAULT 1; DECLARE randOrderName VARCHAR(10); DECLARE randOrderTime DATETIME; WHILE i<= size DO SELECT rand_string(10) INTO randOrderName; SELECT rand_date(sd,ed) INTO randOrderTime; INSERT INTO tb_order(`item_name`,`item_price`,`item_state`,`create_time`,`time_year`,`time_month`,`time_day`) VALUES(randOrderName,RAND()*100,ROUND(RAND()),UNIX_TIMESTAMP(randOrderTime),DATE_FORMAT(randOrderTime,'%Y'),DATE_FORMAT(randOrderTime,'%m'),DATE_FORMAT(randOrderTime,'%d')); SET i = i+1; END WHILE; END -- 执行存储过程 CALL mock_order(1200000,'2020-03-01','2020-12-31') -- CALL mock_order(FLOOR(RAND() * 9999999),'2020-03-01','2020-12-31')
数据总量:mysql
SELECT count(*) FROM tb_order > OK > 时间: 0.607s
.jpg)sql
数据分布:数据库
SELECT count(*) as count, DATE_FORMAT(FROM_UNIXTIME(create_time),'%Y-%m') as date FROM tb_order GROUP BY date desc > OK > 时间: 5.007s
结果分布express
SELECT count( 1 ) FROM tb_order WHERE item_state = 1 AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) > OK > 时间: 6.348s
插曲:由于不肯定 使用UNIX_TIMESTAMP()
函数对本来的查询结果是否有影响,决定多测测量使用UNIX_TIMESTAMP()
与直接使用时间戳的时间对比
次数 | 使用转换函数 | 直接使用时间戳 |
---|---|---|
1 | 6.105s | 5.768s |
2 | 6.239s | 6.199s |
3 | 5.681s | 6.161s |
4 | 5.687s | 5.605s |
5 | 6.118s | 5.621s |
平均值 | 5.966s | 5.870s |
结果显示 相差并非不少 0.09s 并且单纯的执行select UNIX_TIMESTAMP( '2020-04-01 00:00:00' )
须要0.045s 因此下列均使用UNIX_TIMESTAMP()
进行测试
SELECT sum(item_price) as total FROM tb_order WHERE item_state = 1 AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) > OK > 时间: 5.901s
SELECT count( * ) AS count, item_state AS state FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) GROUP BY item_state > OK > 时间: 12.126s
时间慢的有点不可接受
SELECT * FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' ) > OK > 时间: 0.54s
SELECT SUM(item_price) as total FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' ) > OK > 时间: 0.154s
SELECT * FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) LIMIT 10000,20 > OK > 时间: 0.107s
SELECT count(1) FROM tb_order WHERE item_state = 1 AND time_year = '2020' and time_month = '04' > OK > 时间: 1.329s
SELECT sum( item_price ) AS total FROM tb_order WHERE item_state = 1 AND time_year = '2020' AND time_month = '04' > OK > 时间: 1.23s
SELECT count( * ) AS count, item_state AS state FROM tb_order WHERE time_year = '2020' AND time_month = '04' GROUP BY item_state > OK > 时间: 1.429s
SELECT * FROM tb_order WHERE time_year = '2020' and time_month = '04' and time_day = '01' > OK > 时间: 0.663s
SELECT SUM(item_price) as total FROM tb_order WHERE time_year = '2020' AND time_month = '04' AND time_day = '01' > OK > 时间: 0.091s
SELECT * FROM tb_order WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) LIMIT 10000,20 > OK > 时间: 0.107s
因为 根据 时间戳来查询的结果是按照时间戳索引排序的,所以是从小到大
而单纯的拆分查询并无有序
所以 添加一个order by
来进行排序
SELECT * FROM tb_order WHERE time_year = '2020' AND time_month = '04' ORDER BY create_time LIMIT 10000, 20 > OK > 时间: 1.599s
由于添加分区会改变表的文件结构,所以copy一个表
由于range分区创建的字段必须为主键或惟一键,所以须要删除原先主键并新建主键
ALTER TABLE tb_order_2 CHANGE COLUMN id id BIGINT(20) UNSIGNED NOT NULL; alter table tb_order_2 DROP PRIMARY key; alter table tb_order_2 add PRIMARY key(id,create_time);
ALTER table tb_order_2 PARTITION BY RANGE(create_time)( PARTITION p_2020_01 VALUES LESS THAN (1580486400), PARTITION p_2020_02 VALUES LESS THAN (1582992000), PARTITION p_2020_03 VALUES LESS THAN (1585670400), PARTITION p_2020_04 VALUES LESS THAN (1588262400), PARTITION p_2020_05 VALUES LESS THAN (1590940800), PARTITION p_2020_06 VALUES LESS THAN (1593532800), PARTITION p_2020_07 VALUES LESS THAN (1596211200), PARTITION p_2020_08 VALUES LESS THAN (1598889600), PARTITION p_2020_09 VALUES LESS THAN (1601481600), PARTITION p_2020_10 VALUES LESS THAN (1604160000), PARTITION p_2020_11 VALUES LESS THAN (1606752000), PARTITION p_2020_12 VALUES LESS THAN MAXVALUE )
SELECT partition_name part, partition_expression expr, partition_description descr, table_rows FROM information_schema.PARTITIONS WHERE table_schema = SCHEMA () AND table_name = 'tb_order_2';
SELECT count( 1 ) FROM tb_order_2 WHERE item_state = 1 AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) > OK > 时间: 0.133s
能够看出时间相比之间查询时间戳快不少。explain一下
![]()
用了分区进行查询
SELECT sum(item_price) as total FROM tb_order_2 WHERE item_state = 1 AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) > OK > 时间: 0.526s
SELECT count( * ) AS count, item_state AS state FROM tb_order_2 WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) GROUP BY item_state > OK > 时间: 0.273s
SELECT * FROM tb_order_2 WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' ) > OK > 时间: 0.587s
所有查询的话 并 差距并不大
![]()
SELECT SUM(item_price) as total FROM tb_order_2 WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' ) > OK > 时间: 0.068s
SELECT * FROM tb_order_2 WHERE create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) LIMIT 10000,20 > OK > 时间: 0.073s
测试类型 | 统计月订单数 | 统计月订单总额 | 根据月份统计订单状态 | 查询日订单 | 统计日订单总额 | 分页查询月订单 |
---|---|---|---|---|---|---|
时间戳 | 6.348s | 5.901s | 12.126s | 0.54s | 0.154s | 0.107s |
时间拆分 | 1.329s | 1.23s | 1.429s | 0.663s | 0.091s | 0.107s |
分区 | 0.133s | 0.526s | 0.273s | 0.587s | 0.068s | 0.073s |
整体看来,就时间查询来讲,效率是表分区>时间拆分>时间戳,时间戳的表结构为int结构,由于之间看过其余文章有对比过效率 int > datetime > timestamp,所以我在这里设计表的时候就采用了int的形式。其实怎么选择都应该按需求而定,当咱们的表不大,数量级还没上去,哪一种结构查询相差并非很大,先按时间戳的形式查询,待表结构太大能够进行表的分区,结构若仍是很大,能够进行分表。函数
固然这全部都是我我的(菜鸟)的想法,统计次数也较少,没法彻底表明真实状况。但愿你们多提出想法测试