【决战西二旗】|Redis面试热点之工程架构篇

时间 2019-12-18

标签决战西二旗 redis 面试热点工程架构栏目 Redis 繁體版

原文原文链接

前言html

前面用两篇文章大体介绍了Redis热点面试中的底层实现相关的问题，感兴趣的能够回顾一下：
【决战西二旗】|Redis面试热点之底层实现篇
 【决战西二旗】|Redis面试热点之底层实现篇(续)git

接下来咱们继续来一块儿研究下Redis工程架构相关的问题，这部份内容出现的几率相对大一些，由于并非全部人都会去研究源码，若是面试一味问源码那么可能注定是一场尬聊。github

面试时在不要求候选人对Redis很是熟练的前提下，工程问题将是不二之选，工程问题相对较多，所以本号将分几篇学习完，今天先来一块儿学习第一篇。面试

经过本文你将了解到如下内容：
1.Redis的内存回收详解
2.Redis的持久化机制redis

Q1:了解Redis的内存回收吗？讲讲你的理解算法

1.1 为何要回收内存？

Redis做为内存型数据库，若是单纯的只进不出迟早就撑爆了，事实上不少把Redis当作主存储DB用的家伙们迟早会尝到这个苦果，固然除非你家厂子确实不差钱，数T级别的内存都毛毛雨，或者数据增加必定程度以后再也不增加的场景，就另当别论了。数据库

对于咱们这种把节约成本当作KPI的普通厂子，仍是把Redis当缓存用比较符合家里的经济条件，因此这么看面试官的问题还算是比较贴合实际，比起那些手撕RBTree好一些，若是问题恰好在你知识射程范围内，先给面试官点个赞再说！缓存

为了让Redis服务安全稳定的运行，让使用内存保持在必定的阈值内是很是有必要的，所以咱们就须要删除该删除的，清理该清理的，把内存留给须要的键值对，试想一条大河须要设置几个警惕水位来确保不决堤不枯竭，Redis也是同样的，只不过Redis只关心决堤便可，来一张图：安全

图中设定机器内存为128GB，占用64GB算是比较安全的水平，若是内存接近80%也就是100GB左右，那么认为Redis目前承载能力已经比较大了，具体的比例能够根据公司和我的的业务经验来肯定。bash

笔者只是想表达出于安全和稳定的考虑，不要以为128GB的内存就意味着存储128GB的数据，都是要打折的。

1.2 内存从哪里回收？

Redis占用的内存是分为两部分：存储键值对消耗和自己运行消耗。显而后者咱们没法回收，所以只能从键值对下手了，键值对能够分为几种：带过时的、不带过时的、热点数据、冷数据。对于带过时的键值是须要删除的，若是删除了全部的过时键值对以后内存仍然不足怎么办？那只能把部分数据给踢掉了。

人生无处不取舍，这个让笔者脑海浮现了《泰坦尼克》，邮轮撞到了冰山顷刻间海水涌入，面临数量不足的救生艇，人们作出了抉择：让女士和孩童先走，绅士们选择留下，海上逃生场景如图：

1.3 如何实施过时键值对的删除？

要实施对键值对的删除咱们须要明白以下几点：

带过时超时的键值对存储在哪里？
如何判断带超时的键值对是否能够被删除了？
删除机制有哪些以及如何选择？

1.3.1 键值对的存储

老规矩来到github看下源码，src/server.h中给的redisDb结构体给出了答案：

typedef struct redisDb {
    dict *dict;                 /* The keyspace for this DB */
    dict *expires;              /* Timeout of keys with a timeout set */
    dict *blocking_keys;        /* Keys with clients waiting for data (BLPOP)*/
    dict *ready_keys;           /* Blocked keys that received a PUSH */
    dict *watched_keys;         /* WATCHED keys for MULTI/EXEC CAS */
    int id;                     /* Database ID */
    long long avg_ttl;          /* Average TTL, just for stats */
    unsigned long expires_cursor; /* Cursor of the active expire cycle. */
    list *defrag_later;         /* List of key names to attempt to defrag one by one, gradually. */
} redisDb;
复制代码

Redis本质上就是一个大的key-value，key就是字符串，value有是几种对象：字符串、列表、有序列表、集合、哈希等，这些key-value都是存储在redisDb的dict中的，来看下黄健宏画的一张很是赞的图：

看到这里，对于删除机制又清晰了一步，咱们只要把redisDb中dict中的目标key-value删掉就行，不过貌似没有这么简单，Redis对于过时键值对确定有本身的组织规则，让咱们继续研究吧！

redisDb的expires成员的类型也是dict，和键值对是同样的，本质上expires是dict的子集，expires保存的是全部带过时的键值对，称之为过时字典吧，它才是咱们研究的重点。

对于键，咱们能够设置绝对和相对过时时间、以及查看剩余时间：

使用EXPIRE和PEXPIRE来实现键值对的秒级和毫秒级生存时间设定，这是相对时长的过时设置
使用EXPIREAT和EXPIREAT来实现键值对在某个秒级和毫秒级时间戳时进行过时删除，属于绝对过时设置
经过TTL和PTTL来查看带有生存时间的键值对的剩余过时时间

上述三组命令在设计缓存时用处比较大，有心的读者能够留意。

过时字典expires和键值对空间dict存储的内容并不彻底同样，过时字典expires的key是指向Redis对应对象的指针，其value是long long型的unix时间戳，前面的EXPIRE和PEXPIRE相对时长最终也会转换为时间戳，来看下过时字典expires的结构，笔者画了个图：

1.3.2 键值对的过时删除判断

判断键是否过时可删除，须要先查过时字典是否存在该值，若是存在则进一步判断过时时间戳和当前时间戳的相对大小，作出删除判断，简单的流程如图：

1.3.3 键值对的删除策略

通过前面的几个环节，咱们知道了Redis的两种存储位置：键空间和过时字典，以及过时字典expires的结构、判断是否过时的方法，那么该如何实施删除呢？

先抛开Redis来想一下可能的几种删除策略：

定时删除：在设置键的过时时间的同时，建立定时器，让定时器在键过时时间到来时，即刻执行键值对的删除；
按期删除：每隔特定的时间对数据库进行一次扫描，检测并删除其中的过时键值对；
惰性删除：键值对过时暂时不进行删除，至于删除的时机与键值对的使用有关，当获取键时先查看其是否过时，过时就删除，不然就保留；

在上述的三种策略中定时删除和按期删除属于不一样时间粒度的主动删除，惰性删除属于被动删除。

三种策略都有各自的优缺点：定时删除对内存使用率有优点，可是对CPU不友好，惰性删除对内存不友好，若是某些键值对一直不被使用，那么会形成必定量的内存浪费，按期删除是定时删除和惰性删除的折中。

Reids采用的是惰性删除和定时删除的结合，通常来讲能够借助最小堆来实现定时器，不过Redis的设计考虑到时间事件的有限种类和数量，使用了无序链表存储时间事件，这样若是在此基础上实现定时删除，就意味着O(N)遍历获取最近须要删除的数据。

可是我以为antirez若是非要使用定时删除，那么他确定不会使用原来的无序链表机制，因此我的认为已存在的无序链表不能做为Redis不使用定时删除的根本理由，冒昧猜想惟一可能的是antirez以为没有必要使用定时删除。

1.3.4 按期删除的实现细节

按期删除听着很简单，可是如何控制执行的频率和时长呢？

试想一下若是执行频率太少就退化为惰性删除了，若是执行时间太长又和定时删除相似了，想一想还确实是个难题！而且执行按期删除的时机也须要考虑，因此咱们继续来看看Redis是如何实现按期删除的吧！笔者在src/expire.c文件中找到了activeExpireCycle函数，按期删除就是由此函数实现的，在代码中antirez作了比较详尽的注释，不过都是英文的，试着读了一下模模糊糊弄个大概，因此学习英文并阅读外文资料是很重要的学习途径。

先贴一下代码，核心部分算上注释大约210行，具体看下：

#define ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP 20 /* Keys for each DB loop. */
#define ACTIVE_EXPIRE_CYCLE_FAST_DURATION 1000 /* Microseconds. */
#define ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC 25 /* Max % of CPU to use. */
#define ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE 10 /* % of stale keys after which we do extra efforts. */

void activeExpireCycle(int type) {
    /* Adjust the running parameters according to the configured expire * effort. The default effort is 1, and the maximum configurable effort * is 10. */
    unsigned long
    effort = server.active_expire_effort-1, /* Rescale from 0 to 9. */
    config_keys_per_loop = ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP +
                           ACTIVE_EXPIRE_CYCLE_KEYS_PER_LOOP/4*effort,
    config_cycle_fast_duration = ACTIVE_EXPIRE_CYCLE_FAST_DURATION +
                                 ACTIVE_EXPIRE_CYCLE_FAST_DURATION/4*effort,
    config_cycle_slow_time_perc = ACTIVE_EXPIRE_CYCLE_SLOW_TIME_PERC +
                                  2*effort,
    config_cycle_acceptable_stale = ACTIVE_EXPIRE_CYCLE_ACCEPTABLE_STALE-
                                    effort;

    /* This function has some global state in order to continue the work * incrementally across calls. */
    static unsigned int current_db = 0; /* Last DB tested. */
    static int timelimit_exit = 0;      /* Time limit hit in previous call? */
    static long long last_fast_cycle = 0; /* When last fast cycle ran. */

    int j, iteration = 0;
    int dbs_per_call = CRON_DBS_PER_CALL;
    long long start = ustime(), timelimit, elapsed;

    /* When clients are paused the dataset should be static not just from the * POV of clients not being able to write, but also from the POV of * expires and evictions of keys not being performed. */
    if (clientsArePaused()) return;

    if (type == ACTIVE_EXPIRE_CYCLE_FAST) {
        /* Don't start a fast cycle if the previous cycle did not exit * for time limit, unless the percentage of estimated stale keys is * too high. Also never repeat a fast cycle for the same period * as the fast cycle total duration itself. */
        if (!timelimit_exit &&
            server.stat_expired_stale_perc < config_cycle_acceptable_stale)
            return;

        if (start < last_fast_cycle + (long long)config_cycle_fast_duration*2)
            return;

        last_fast_cycle = start;
    }

    /* We usually should test CRON_DBS_PER_CALL per iteration, with * two exceptions: * * 1) Don't test more DBs than we have. * 2) If last time we hit the time limit, we want to scan all DBs * in this iteration, as there is work to do in some DB and we don't want * expired keys to use memory for too much time. */
    if (dbs_per_call > server.dbnum || timelimit_exit)
        dbs_per_call = server.dbnum;

    /* We can use at max 'config_cycle_slow_time_perc' percentage of CPU * time per iteration. Since this function gets called with a frequency of * server.hz times per second, the following is the max amount of * microseconds we can spend in this function. */
    timelimit = config_cycle_slow_time_perc*1000000/server.hz/100;
    timelimit_exit = 0;
    if (timelimit <= 0) timelimit = 1;

    if (type == ACTIVE_EXPIRE_CYCLE_FAST)
        timelimit = config_cycle_fast_duration; /* in microseconds. */

    /* Accumulate some global stats as we expire keys, to have some idea * about the number of keys that are already logically expired, but still * existing inside the database. */
    long total_sampled = 0;
    long total_expired = 0;

    for (j = 0; j < dbs_per_call && timelimit_exit == 0; j++) {
        /* Expired and checked in a single loop. */
        unsigned long expired, sampled;

        redisDb *db = server.db+(current_db % server.dbnum);

        /* Increment the DB now so we are sure if we run out of time * in the current DB we'll restart from the next. This allows to * distribute the time evenly across DBs. */
        current_db++;

        /* Continue to expire if at the end of the cycle more than 25% * of the keys were expired. */
        do {
            unsigned long num, slots;
            long long now, ttl_sum;
            int ttl_samples;
            iteration++;

            /* If there is nothing to expire try next DB ASAP. */
            if ((num = dictSize(db->expires)) == 0) {
                db->avg_ttl = 0;
                break;
            }
            slots = dictSlots(db->expires);
            now = mstime();

            /* When there are less than 1% filled slots, sampling the key * space is expensive, so stop here waiting for better times... * The dictionary will be resized asap. */
            if (num && slots > DICT_HT_INITIAL_SIZE &&
                (num*100/slots < 1)) break;

            /* The main collection cycle. Sample random keys among keys * with an expire set, checking for expired ones. */
            expired = 0;
            sampled = 0;
            ttl_sum = 0;
            ttl_samples = 0;

            if (num > config_keys_per_loop)
                num = config_keys_per_loop;

            /* Here we access the low level representation of the hash table * for speed concerns: this makes this code coupled with dict.c, * but it hardly changed in ten years. * * Note that certain places of the hash table may be empty, * so we want also a stop condition about the number of * buckets that we scanned. However scanning for free buckets * is very fast: we are in the cache line scanning a sequential * array of NULL pointers, so we can scan a lot more buckets * than keys in the same time. */
            long max_buckets = num*20;
            long checked_buckets = 0;

            while (sampled < num && checked_buckets < max_buckets) {
                for (int table = 0; table < 2; table++) {
                    if (table == 1 && !dictIsRehashing(db->expires)) break;

                    unsigned long idx = db->expires_cursor;
                    idx &= db->expires->ht[table].sizemask;
                    dictEntry *de = db->expires->ht[table].table[idx];
                    long long ttl;

                    /* Scan the current bucket of the current table. */
                    checked_buckets++;
                    while(de) {
                        /* Get the next entry now since this entry may get * deleted. */
                        dictEntry *e = de;
                        de = de->next;

                        ttl = dictGetSignedIntegerVal(e)-now;
                        if (activeExpireCycleTryExpire(db,e,now)) expired++;
                        if (ttl > 0) {
                            /* We want the average TTL of keys yet * not expired. */
                            ttl_sum += ttl;
                            ttl_samples++;
                        }
                        sampled++;
                    }
                }
                db->expires_cursor++;
            }
            total_expired += expired;
            total_sampled += sampled;

            /* Update the average TTL stats for this database. */
            if (ttl_samples) {
                long long avg_ttl = ttl_sum/ttl_samples;

                /* Do a simple running average with a few samples. * We just use the current estimate with a weight of 2% * and the previous estimate with a weight of 98%. */
                if (db->avg_ttl == 0) db->avg_ttl = avg_ttl;
                db->avg_ttl = (db->avg_ttl/50)*49 + (avg_ttl/50);
            }

            /* We can't block forever here even if there are many keys to * expire. So after a given amount of milliseconds return to the * caller waiting for the other active expire cycle. */
            if ((iteration & 0xf) == 0) { /* check once every 16 iterations. */
                elapsed = ustime()-start;
                if (elapsed > timelimit) {
                    timelimit_exit = 1;
                    server.stat_expired_time_cap_reached_count++;
                    break;
                }
            }
            /* We don't repeat the cycle for the current database if there are * an acceptable amount of stale keys (logically expired but yet * not reclained). */
        } while ((expired*100/sampled) > config_cycle_acceptable_stale);
    }

    elapsed = ustime()-start;
    server.stat_expire_cycle_time_used += elapsed;
    latencyAddSampleIfNeeded("expire-cycle",elapsed/1000);

    /* Update our estimate of keys existing but yet to be expired. * Running average with this sample accounting for 5%. */
    double current_perc;
    if (total_sampled) {
        current_perc = (double)total_expired/total_sampled;
    } else
        current_perc = 0;
    server.stat_expired_stale_perc = (current_perc*0.05)+
                                     (server.stat_expired_stale_perc*0.95);
}复制代码

说实话这个代码细节比较多，因为笔者对Redis源码了解很少，只能作个模糊版本的解读，因此不免有问题，仍是建议有条件的读者自行前往源码区阅读，抛砖引玉看下笔者的模糊版本：

该算法是个自适应的过程，当过时的key比较少时那么就花费不多的cpu时间来处理，若是过时的key不少就采用激进的方式来处理，避免大量的内存消耗，能够理解为判断过时键多就多跑几回，少则少跑几回；
因为Redis中有不少数据库db，该算法会逐个扫描，本次结束时继续向后面的db扫描，是个闭环的过程；
按期删除有快速循环和慢速循环两种模式，主要采用慢速循环模式，其循环频率主要取决于server.hz，一般设置为10，也就是每秒执行10次慢循环按期删除，执行过程当中若是耗时超过25%的CPU时间就中止；
慢速循环的执行时间相对较长，会出现超时问题，快速循环模式的执行时间不超过1ms，也就是执行时间更短，可是执行的次数更多，在执行过程当中发现某个db中抽样的key中过时key占比低于25%则跳过；

主体意思：按期删除是个自适应的闭环而且几率化的抽样扫描过程，过程当中都有执行时间和cpu时间的限制，若是触发阈值就中止，能够说是尽可能在不影响对客户端的响应下润物细无声地进行的。

1.3.5 DEL删除键值对

在Redis4.0以前执行del操做时若是key-value很大，那么可能致使阻塞，在新版本中引入了BIO线程以及一些新的命令，实现了del的延时懒删除，最后会有BIO线程来实现内存的清理回收。

以前写过一篇4.0版本的LazyFree相关的文章，能够看下浅析Redis 4.0新特性之LazyFree

1.4 内存淘汰机制

为了保证Redis的安全稳定运行，设置了一个max-memory的阈值，那么当内存用量到达阈值，新写入的键值对没法写入，此时就须要内存淘汰机制，在Redis的配置中有几种淘汰策略能够选择，详细以下：

noeviction: 当内存不足以容纳新写入数据时，新写入操做会报错；
allkeys-lru：当内存不足以容纳新写入数据时，在键空间中移除最近最少使用的 key；
allkeys-random：当内存不足以容纳新写入数据时，在键空间中随机移除某个 key；
volatile-lru：当内存不足以容纳新写入数据时，在设置了过时时间的键空间中，移除最近最少使用的 key；
volatile-random：当内存不足以容纳新写入数据时，在设置了过时时间的键空间中，随机移除某个 key；
volatile-ttl：当内存不足以容纳新写入数据时，在设置了过时时间的键空间中，有更早过时时间的 key 优先移除；

后三种策略都是针对过时字典的处理，可是在过时字典为空时会noeviction同样返回写入失败，毫无策略地随机删除也不太可取，因此通常选择第二种allkeys-lru基于LRU策略进行淘汰。

我的认为antirez一贯都是工程化思惟，善于使用几率化设计来作近似实现，LRU算法也不例外，Redis中实现了近似LRU算法，而且通过几个版本的迭代效果已经比较接近理论LRU算法的效果了，这个也是个不错的内容，因为篇幅限制，本文计划后续单独讲LRU算法时再进行详细讨论。

1.5 过时键删除和内存淘汰的关系

过时健删除策略强调的是对过时健的操做，若是有健过时而内存足够，Redis不会使用内存淘汰机制来腾退空间，这时会优先使用过时健删除策略删除过时健。

内存淘汰机制强调的是对内存数据的淘汰操做，当内存不足时，即便有的健没有到达过时时间或者根本没有设置过时也要根据必定的策略来删除一部分，腾退空间保证新数据的写入。

Q2:讲讲你对Redis持久化机制的理解。

我的认为Redis持久化既是数据库自己的亮点，也是面试的热点，主要考察的方向包括：RDB机制原理、AOF机制原理、各自的优缺点、工程上的对于RDB和AOF的取舍、新版本Redis混合持久化策略等，如能把握要点，持久化问题就过关了。

以前写过一篇持久化的文章：理解Redis持久化,基本上也涵盖了上面的几个点，能够看一下。

巨人的肩膀

www.hoohack.me/2019/06/24/…

redisbook.readthedocs.io/en/latest/i…