Mysql百万数据查询之时间查询

Mysql百万数据查询之时间查询

在校期间,一直没有接触过百万级的数据,一直没有百万数据下查询的感受。
近期作毕业设计,须要用到较大的数据来进行图表分析和订单推荐。
一直疑惑如何设计表才能让普通的sql更高效,所以便如下尝试并记录于此。
本篇文章着重于表的设计对查询sql的影响
小弟不才,文章仅作记录,有更好的想法或有误请留言指出,必当积极回应。

优化设想

  1. 时间以int时间戳的形式保存,并创建相关索引
  2. 时间拆分,以year、month、day进行保存,并以year+month和year+month+day作索引(空间换时间)
  3. 以时间戳保存,并按月进行分区

测试环境

- 表引擎:innoDB
- 字符集:utf8mb4
- 数据库版本:5.7.18
数据库引擎为默认的innoDB、由于实际业务考虑事务,所以就默认innoDB
为了避免受其余环节影响,用的是`腾讯云mysql基础版1核1000MB/50GB`

建表

建立相关Order表

-- 建立order表
CREATE TABLE tb_order (
    `id` BIGINT ( 20 ) NOT NULL AUTO_INCREMENT,
    `item_name` VARCHAR ( 255 ) NOT NULL,
    `item_price` INT ( 11 ) UNSIGNED NOT NULL,
    `item_state` TINYINT ( 1 ) NOT NULL,
    `create_time` INT ( 11 ) UNSIGNED NOT NULL,
    `time_year` CHAR ( 4 ) NOT NULL,
    `time_month` CHAR ( 2 ) NOT NULL,
    `time_day` CHAR ( 2 ) NOT NULL,
    PRIMARY KEY ( id ) 
) ENGINE = INNODB DEFAULT CHARSET = utf8mb4;
CREATE INDEX idx_order_ctime ON tb_order ( create_time );
CREATE INDEX idx_order_state ON tb_order ( `item_state` );
CREATE INDEX idx_order_day ON tb_order ( `time_year`, `time_month` );
CREATE INDEX idx_order_dmonth ON tb_order ( `time_year`, `time_month`, `time_day`);

建立Mock存储过程

-- 随机字符串函数
CREATE DEFINER=`root`@`%` FUNCTION `rand_string`(n INT) RETURNS varchar(255) CHARSET utf8
BEGIN 
    DECLARE chars_str varchar(100) DEFAULT 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'; 
    DECLARE return_str varchar(255) DEFAULT '' ;
    DECLARE i INT DEFAULT 0; 
WHILE i < n DO 
    SET return_str = concat(return_str,substring(chars_str , FLOOR(1 + RAND()*62 ),1)); 
    SET i = i +1; 
END WHILE; 
    RETURN return_str; 
END

-- 随机范围时间函数
CREATE DEFINER=`root`@`%` FUNCTION `rand_date`(`startDate` date,`endDate` date) RETURNS datetime
BEGIN
    #Routine body goes here...
    DECLARE sec INT DEFAULT 0;
    DECLARE ret DATETIME;
    SET sec = ABS(UNIX_TIMESTAMP(endDate) - UNIX_TIMESTAMP(startDate));
    SET ret = DATE_ADD(startDate, INTERVAL FLOOR( 1+RAND ( ) * ( sec-1))SECOND);
    RETURN ret;
END

-- 模拟订单存储过程
CREATE DEFINER=`root`@`%` PROCEDURE `mock_order`(IN `size` int UNSIGNED,IN `sd` date,IN `ed` date)
BEGIN
    #Routine body goes here...
    DECLARE i int UNSIGNED DEFAULT 1;
    DECLARE randOrderName VARCHAR(10);
    DECLARE randOrderTime DATETIME;
    WHILE i<= size DO
        SELECT rand_string(10) INTO randOrderName;
        SELECT rand_date(sd,ed) INTO randOrderTime;
        INSERT INTO tb_order(`item_name`,`item_price`,`item_state`,`create_time`,`time_year`,`time_month`,`time_day`)
        VALUES(randOrderName,RAND()*100,ROUND(RAND()),UNIX_TIMESTAMP(randOrderTime),DATE_FORMAT(randOrderTime,'%Y'),DATE_FORMAT(randOrderTime,'%m'),DATE_FORMAT(randOrderTime,'%d'));
        SET i = i+1;

    END WHILE;
END

-- 执行存储过程
CALL mock_order(1200000,'2020-03-01','2020-12-31')

-- CALL mock_order(FLOOR(RAND() * 9999999),'2020-03-01','2020-12-31')

执行结果

数据总量:mysql

SELECT count(*) FROM tb_order
> OK
> 时间: 0.607s

执行结果.jpg)sql

数据分布:数据库

SELECT 
    count(*) as count,
    DATE_FORMAT(FROM_UNIXTIME(create_time),'%Y-%m') as date
FROM
    tb_order
GROUP BY 
    date desc
> OK
> 时间: 5.007s

数据分布:

结果分布express

测试过程

  1. 统计月订单数量
  2. 统计月订单总额
  3. 根据月份统计订单状态
  4. 查询日订单
  5. 统计日订单总额
  6. 分页查询月订单

时间戳

1. 统计月订单数量

SELECT
    count( 1 ) 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 6.348s

20200308204156.png

插曲:由于不肯定 使用 UNIX_TIMESTAMP() 函数对本来的查询结果是否有影响,决定多测测量使用 UNIX_TIMESTAMP() 与直接使用时间戳的时间对比
次数 使用转换函数 直接使用时间戳
1 6.105s 5.768s
2 6.239s 6.199s
3 5.681s 6.161s
4 5.687s 5.605s
5 6.118s 5.621s
平均值 5.966s 5.870s
结果显示 相差并非不少 0.09s 并且单纯的执行 select UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 须要0.045s 因此下列均使用 UNIX_TIMESTAMP() 进行测试

2. 统计月订单总额

SELECT
    sum(item_price) as total
FROM
    tb_order 
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 5.901s

20200308205618.png

3. 根据月份统计订单状态

SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
GROUP BY
    item_state
> OK
> 时间: 12.126s

20200308205805.png

时间慢的有点不可接受

4. 查询日订单

SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.54s

20200308212150.png

5. 统计日订单总额

SELECT
    SUM(item_price) as total
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.154s

20200308212401.png

6. 分页查询月订单

SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 时间: 0.107s

20200308211153.png

时间拆分的形式

1. 统计月订单数量

SELECT
    count(1) 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND time_year = '2020' and time_month = '04'
> OK
> 时间: 1.329s

20200308211318.png

2. 统计月订单总额

SELECT
    sum( item_price ) AS total 
FROM
    tb_order 
WHERE
    item_state = 1 
    AND time_year = '2020' 
    AND time_month = '04'
> OK
> 时间: 1.23s

20200308211634.png

3. 根据月份统计订单状态

SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order 
WHERE
    time_year = '2020' 
    AND time_month = '04'
GROUP BY
    item_state
> OK
> 时间: 1.429s

20200308211725.png

4. 查询日订单

SELECT
    * 
FROM
    tb_order 
WHERE
    time_year = '2020' and time_month = '04' and time_day = '01'
> OK
> 时间: 0.663s

20200308212225.png

5. 统计日订单总额

SELECT
    SUM(item_price) as total
FROM
    tb_order 
WHERE
    time_year = '2020' AND time_month = '04' AND time_day = '01'
> OK
> 时间: 0.091s

20200308212430.png

6. 分页查询月订单

SELECT
    * 
FROM
    tb_order 
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 时间: 0.107s

20200308212613.png

因为 根据 时间戳来查询的结果是按照时间戳索引排序的,所以是从小到大
而单纯的拆分查询并无有序
所以 添加一个 order by 来进行排序
SELECT
    * 
FROM
    tb_order 
WHERE
    time_year = '2020' 
    AND time_month = '04' 
ORDER BY
    create_time  
    LIMIT 10000,
    20
> OK
> 时间: 1.599s

20200308213113.png

时间分区

由于添加分区会改变表的文件结构,所以copy一个表

1. 新建主键

由于range分区创建的字段必须为主键或惟一键,所以须要删除原先主键并新建主键
ALTER TABLE tb_order_2 CHANGE COLUMN id id BIGINT(20) UNSIGNED NOT NULL;
alter table tb_order_2 DROP PRIMARY key;
alter table tb_order_2 add PRIMARY key(id,create_time);

2. 新建表分区

ALTER table tb_order_2 PARTITION BY RANGE(create_time)(
PARTITION  p_2020_01  VALUES LESS THAN (1580486400),
PARTITION  p_2020_02  VALUES LESS THAN (1582992000),
PARTITION  p_2020_03  VALUES LESS THAN (1585670400),
PARTITION  p_2020_04  VALUES LESS THAN (1588262400),
PARTITION  p_2020_05  VALUES LESS THAN (1590940800),
PARTITION  p_2020_06  VALUES LESS THAN (1593532800),
PARTITION  p_2020_07  VALUES LESS THAN (1596211200),
PARTITION  p_2020_08  VALUES LESS THAN (1598889600),
PARTITION  p_2020_09  VALUES LESS THAN (1601481600),
PARTITION  p_2020_10  VALUES LESS THAN (1604160000),
PARTITION  p_2020_11  VALUES LESS THAN (1606752000),
PARTITION  p_2020_12  VALUES LESS THAN MAXVALUE
)

20200312105829.png

3. 查看表分区结构

SELECT
    partition_name part,
    partition_expression expr,
    partition_description descr,
    table_rows 
FROM
    information_schema.PARTITIONS 
WHERE
    table_schema = SCHEMA () 
    AND table_name = 'tb_order_2';

20200312105949.png

4. 统计月订单

SELECT
    count( 1 ) 
FROM
    tb_order_2
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 0.133s
能够看出时间相比之间查询时间戳快不少。explain一下
20200312110146.png
用了分区进行查询

5. 统计月订单总额

SELECT
    sum(item_price) as total
FROM
    tb_order_2
WHERE
    item_state = 1 
    AND create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' )
> OK
> 时间: 0.526s

20200312110252.png

6. 根据月份统计订单状态

SELECT
    count( * ) AS count,
    item_state AS state 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
GROUP BY
    item_state
> OK
> 时间: 0.273s

20200312112123.png

7. 查询日订单

SELECT
    * 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.587s
所有查询的话 并 差距并不大
20200312112257.png

8. 统计日订单总额

SELECT
    SUM(item_price) as total
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-1 23:59:59' )
> OK
> 时间: 0.068s

20200312112342.png

9. 分页查询月订单

SELECT
    * 
FROM
    tb_order_2
WHERE
    create_time BETWEEN UNIX_TIMESTAMP( '2020-04-01 00:00:00' ) 
    AND UNIX_TIMESTAMP( '2020-04-30 23:59:59' ) 
    LIMIT 10000,20
> OK
> 时间: 0.073s

20200312112415.png

图表

测试类型 统计月订单数 统计月订单总额 根据月份统计订单状态 查询日订单 统计日订单总额 分页查询月订单
时间戳 6.348s 5.901s 12.126s 0.54s 0.154s 0.107s
时间拆分 1.329s 1.23s 1.429s 0.663s 0.091s 0.107s
分区 0.133s 0.526s 0.273s 0.587s 0.068s 0.073s

总结

整体看来,就时间查询来讲,效率是表分区>时间拆分>时间戳,时间戳的表结构为int结构,由于之间看过其余文章有对比过效率 int > datetime > timestamp,所以我在这里设计表的时候就采用了int的形式。

其实怎么选择都应该按需求而定,当咱们的表不大,数量级还没上去,哪一种结构查询相差并非很大,先按时间戳的形式查询,待表结构太大能够进行表的分区,结构若仍是很大,能够进行分表。函数

固然这全部都是我我的(菜鸟)的想法,统计次数也较少,没法彻底表明真实状况。但愿你们多提出想法测试

相关文章
相关标签/搜索