Redis研究-3.2 数据结构之关联数组（字典）

时间 2019-11-10

标签 redis 研究 3.2 数据结构关联数组字典栏目 Redis 繁體版

原文原文链接

这个章节要学习到的源码都是在dict.h和dict.c两个文件中 java

在java语言或者其余支持关联数组的的语言中，咱们最早知道的就是关联数组（字典）就是key-value的“数组”，那么，在Redis中又是如何一步一步来实现的呢？咱们先分解一下，关联数组（字典）就是key-value的“数组”，这句话，首先必需要有key-value这个结构数组

//key-value结构
typedef struct dictEntry {
    
    // 键
    void *key;

    // 值
    union {
        void *val;
        uint64_t u64;
        int64_t s64;
    } v;

    // 为何须要这个呢？这是用来解决键冲突的问题的
    struct dictEntry *next;

} dictEntry;

上面定义的这个结构，key表明键，值能够是一个指针，也能够是一个uint64_t的整数，也能够是一个int64_t的整数。那么，next的具体做用是什么呢？这个指针的做用是能够将多个哈希值相同的键值对链接在一块儿，能够用来解决键冲突的问题。安全

接下来的问题就是，如何构建一个“数组”？在Redis中的定义见下面的代码：函数

typedef struct dictht {
    
    // 数组
    dictEntry **table;

    // 大小
    unsigned long size;
    unsigned long sizemask;

    //已有节点的数量
    unsigned long used;

} dictht;

上面的table就是一个数组，每一个数组的元素就是一个指向dictEntry的指针。而size属性则记录了table中的大小，为何会有这个玩意儿呢？咱们平时常常听到有叫“哈希桶”，这个的做用就是“哈希桶”的做用，用来标明这个哈希表有多少个桶，那么，used又是什么呢？他表明了table中如今的元素个数（不过，我以为更应该叫作已经占用了多少个索引了）。如今还差一个sizemask，他是神马呢？他是和哈希是密切相关的，sizemark的大小始终等于size-1,至于和哈希有关的东西，后面用到再来讲。性能

下一步，就应该是咱们的终极实现目标-关联数组（字典），在Redis中，他是这样来定义的：学习

typedef struct dict {

    dictType *type;

    void *privdata;

    dictht ht[2];

    int rehashidx; /* rehashing not in progress if rehashidx == -1 */

    int iterators; /* number of iterators currently running */

} dict;

咱们知道，要实现一个通用的字典，你定义的时候，是不能使用具体类型的，于是，也就不能指定特定的操做，所以，在在Redis的字典里，针对不一样的类型，你是能够本身配置本身的操做的，type属性就是起到这个做用，他的定义以下： ui

//针对不一样的字典类型，绑定不一样的操做函数
typedef struct dictType {

    // 计算哈希值的函数
    unsigned int (*hashFunction)(const void *key);

    // 复制键的函数
    void *(*keyDup)(void *privdata, const void *key);

    // 复制值的函数
    void *(*valDup)(void *privdata, const void *obj);

    // 对比键的函数
    int (*keyCompare)(void *privdata, const void *key1, const void *key2);

    // 销毁键的函数
    void (*keyDestructor)(void *privdata, void *key);
    
    // 销毁值的函数
    void (*valDestructor)(void *privdata, void *obj);

} dictType;

那么，privdata属性用来干什么呢？咱们从针对不一样的类型能够绑定的不一样的函数来看，咱们先把这个属性看作一个存储通常数据的属性了。 this

真正用来存储数据的就是ht数组，他有两个dictht类型的元素，为何须要两个？其中的一个用来存储真实的key-value，另一个是用来rehash用的。 spa

rehashidx这个整数用来干吗呢？用来标明rehash的进度，若是这个字典没有rehash，那么他的值就是-1. 线程

iterators整数用来记录正在使用在当前字典上的迭代器。

从key-value结构定义到key-value的数组（table）定义，再到字典定义，实现路线已经很清楚了。根据上面的定义咱们能够看到，到目前为止，咱们还有三个关键的实现或者概念没有讲清楚，分别是哈希和冲突、重哈希。

什么是哈希？

举个简单的例子，当咱们要把一个键值对k1-v1加入到一个字典dict中，从上面咱们知道，真正存储数据的是这个dict中的ht数组，而这个ht素组的元素是dictht，也是一个数组，对于数组的话，最经常使用的一个属性就是数组的索引，所以，你要把这个键值对加入到这个字典的数组中，就须要计算出来这个键值对应该放在字典的数组的哪个索引上。

针对上面的描述，当咱们要把一个键值对加入到字典中的时候，须要经历下面的步骤：

1.用这个dict（字典）的type中的hashFunction来计算这个键值的哈希值：

keyHashValue=dict->type->hashFunction(k1);

2.咱们前面说过，哈希表中有两个很重要的属性，一个是size（用来标明有多少个哈希桶），另一个就是sizemark属性（他的值等于size-1），用sizemark和上面获得的hash值，就能够获得数组的索引：

index=keyHashValue&ht[0].sizemark;//咱们指定存储数据的是ht的第一个哈希表

从上面的两个步骤来看，这里的性能和数据分布状况主要取决于你绑定的哈希函数。

什么是哈希冲突？
为何会存在哈希冲突？咱们从上面添加新的键值对的步骤来看，咱们极有可能会遇到不一样的键计算出来的数组的索引是相同的，这个时候咱们就说存在了哈希冲突。那么，在Redis中，他是怎么来解决这个问题的呢？答案就是咱们提到的在dictEntry中定义的next指针啦。使用这个指针，有相同的哈希值的不一样的键值对会造成一个链表。而咱们看到，造成的这种链表是没有head和tail的，所以为了性能考虑，新增的具备相同的哈希值的不一样的键值对会放在这个链表的首部，从而下降复杂度。

什么是重哈希（rehash）？

在说重哈希以前，咱们应该先明白什么是负载因子。所谓负载因子就是你的散列表中已经存储的节点的数量(N)除以散列表所能容纳的能力(M),这里的M>=N，那么负载因子就是N/M，这个比值说明了，你的散列表的装满程度。

明白了负载因子后咱们更容易明白，为何会存在重哈希了。在咱们对字典的操做中，会致使字典存储的键值对愈来愈多或者愈来愈少，进而会致使负载因子出现大范围波动，为了保证这个负载因子是在咱们的范围内，咱们须要进行重哈希。怎么作呢？

在知足必定状况下（这种状况在之后的章节学习中再来说），程序会触动冲哈希操做，操做的步骤是：

1.为字典的ht[1]分配空间，这个空间的大小是第一个大于ht[0].used*2的2的n次方。（好比used=4，那么4*2=8，而8正好是2的3次方。若是used=5,5*2=10，而大于10的2的n次方中的n应该取4，故ht[1]的大小应该是2^4=16,以此类推）。

2.将ht[0]中的键值对从新计算hash放到ht[1]上。

3.当ht[0]中的键值对所有已经转移到了ht[1]上后，释放ht[0]，并将ht[1]设置为ht[0],并在ht[1]上新建一个空白的哈希表，供下一次使用。

可是，这里就会存在一个问题，当ht[0]上的键值对超级多的时候，是否是中止响应，只作rehash了？那这样子的话，Redis就没有必要存在了，所以，在Redis中就采用一种渐进式的Rehash。怎么玩呢？关键就是dict->rehashidx这个计数器起到的做用。

1.为ht[1]分配空间，这个dict同时拥有ht[0]和ht[1]两个哈希表；

2.在进行冲哈希的时候，将rehashidx设定为正在冲哈希的索引；

3.将ht[0]上的键值对冲哈希到ht[1]上，重哈希完成后，rehashidx设置为-1；

所以，在冲哈希期间，全部的操做都是针对两个哈希表的。

大致已经说清楚了，下面就是经常使用的API啦

//建立一个新的字典
dict *dictCreate(dictType *type,
        void *privDataPtr)
{
    dict *d = zmalloc(sizeof(*d));

    _dictInit(d,type,privDataPtr);

    return d;
}

上面的函数用到了一个私有函数_dictInit。定义以下：

//初始化字典
int _dictInit(dict *d, dictType *type,
        void *privDataPtr)
{
    // 初始化，从下面的函数能够看到，这里并无分配空间
    _dictReset(&d->ht[0]);
    _dictReset(&d->ht[1]);

    // 设置类型特定函数
    d->type = type;

    // 设置私有数据
    d->privdata = privDataPtr;

    // 设置哈希表 rehash 状态
    d->rehashidx = -1;

    // 设置字典的安全迭代器数量
    d->iterators = 0;

    return DICT_OK;
}

里面用到了_dictReset私有函数：

static void _dictReset(dictht *ht)
{
    ht->table = NULL;
    ht->size = 0;
    ht->sizemask = 0;
    ht->used = 0;
}

//添加新的键值对
int dictAdd(dict *d, void *key, void *val)
{
    
    dictEntry *entry = dictAddRaw(d,key);

    // 键已存在
    if (!entry) return DICT_ERR;

    // 键不存在
    dictSetVal(d, entry, val);

    // 添加成功
    return DICT_OK;
}

dictEntry *dictAddRaw(dict *d, void *key)
{
    int index;
    dictEntry *entry;
    dictht *ht;

    // 若是dict正在进行hash，那么就进行单步 rehash
    if (dictIsRehashing(d)) _dictRehashStep(d);

    /* Get the index of the new element, or -1 if
     * the element already exists. */
    // 计算键在哈希表中的索引值
    // 若是值为 -1 ，那么表示键已经存在
    if ((index = _dictKeyIndex(d, key)) == -1)
        return NULL;

    /* Allocate the memory and store the new entry */
    // 若是字典正在 rehash ，那么将新键添加到 1 号哈希表
    // 不然，将新键添加到 0 号哈希表
    ht = dictIsRehashing(d) ? &d->ht[1] : &d->ht[0];
    // 为新节点分配空间
    entry = zmalloc(sizeof(*entry));
    // 将新节点插入到链表表头
    entry->next = ht->table[index];
    ht->table[index] = entry;
    // 更新哈希表已使用节点数量
    ht->used++;

    /* Set the hash entry fields. */
    // 设置新节点的键  dictSetKey(d, entry, key);

    return entry;
}

static void _dictRehashStep(dict *d) {
    if (d->iterators == 0) dictRehash(d,1);
}

int dictRehash(dict *d, int n) {
     //并非线程安全的哦
    // dict没有在rehash的时候就直接返回
    if (!dictIsRehashing(d)) return 0;

    // 进行 n 步迁移
    while(n--) {
        dictEntry *de, *nextde;

        /* Check if we already rehashed the whole table... */
        // 若是 0 号哈希表为空，那么表示 rehash 执行完毕
        if (d->ht[0].used == 0) {
            // 释放 0 号哈希表
            zfree(d->ht[0].table);
            // 将原来的 1 号哈希表设置为新的 0 号哈希表
            d->ht[0] = d->ht[1];
            // 重置旧的 1 号哈希表
            _dictReset(&d->ht[1]);
            // 关闭 rehash 标识
            d->rehashidx = -1;
            // rehash 已经完成
            return 0;
        }

        /* Note that rehashidx can't overflow as we are sure there are more
         * elements because ht[0].used != 0 */
        // 确保 rehashidx 没有越界
        assert(d->ht[0].size > (unsigned)d->rehashidx);

        // 略过数组中为空的索引，找到下一个非空索引
        while(d->ht[0].table[d->rehashidx] == NULL) d->rehashidx++;

        // 指向该索引的链表表头节点
        de = d->ht[0].table[d->rehashidx];
        /* Move all the keys in this bucket from the old to the new hash HT */
        // 将链表中的全部节点迁移到新哈希表
        while(de) {
            unsigned int h;

            // 保存下个节点的指针
            nextde = de->next;

            /* Get the index in the new hash table */
            // 计算新哈希表的哈希值，以及节点插入的索引位置
            h = dictHashKey(d, de->key) & d->ht[1].sizemask;

            // 插入节点到新哈希表
            de->next = d->ht[1].table[h];
            d->ht[1].table[h] = de;

            // 更新计数器
            d->ht[0].used--;
            d->ht[1].used++;

            // 继续处理下个节点
            de = nextde;
        }
        // 将刚迁移完的哈希表索引的指针设为空
        d->ht[0].table[d->rehashidx] = NULL;
        // 更新 rehash 索引
        d->rehashidx++;
    }

    return 1;
}

dictEntry *dictFind(dict *d, const void *key)
{
    dictEntry *he;
    unsigned int h, idx, table;

    // 字典为空，直接返回NULL
    if (d->ht[0].size == 0) return NULL; /* We don't have a table at all */

    // 若是dict正在rehash，那么就进行rehash
    if (dictIsRehashing(d)) _dictRehashStep(d);

    // 计算键的哈希值
    h = dictHashKey(d, key);
    // 在字典的哈希表中查找这个键，这里的有两个哈希表
    for (table = 0; table <= 1; table++) {

        // 计算索引值
        idx = h & d->ht[table].sizemask;

        // 遍历给定索引上的链表的全部节点，查找 key
        he = d->ht[table].table[idx];
        while(he) {
			//找到就返回
            if (dictCompareKeys(d, key, he->key))
                return he;

            he = he->next;
        }
         //若是运行到这里还没找到，首先要判断dict是否是在rehash，若是是，则要去另一个哈希表中找，找不到才返回NULL
        if (!dictIsRehashing(d)) return NULL;
    }

    // 进行到这里时，说明两个哈希表都没找到
    return NULL;
}

//在dict中得到指定的key对应的value
void *dictFetchValue(dict *d, const void *key) {
    dictEntry *he;

    he = dictFind(d,key);

    return he ? dictGetVal(he) : NULL;
}

上面已经说了增、查，下面还有改、删

static int dictGenericDelete(dict *d, const void *key, int nofree)
{
    unsigned int h, idx;
    dictEntry *he, *prevHe;
    int table;

    // dict为空的话，返回删除错误
    if (d->ht[0].size == 0) return DICT_ERR; /* d->ht[0].table is NULL */

    // 进行单步rehash
    if (dictIsRehashing(d)) _dictRehashStep(d);

    // 计算哈希值
    h = dictHashKey(d, key);

    // 遍历哈希表
    for (table = 0; table <= 1; table++) {

        // 计算索引值 
        idx = h & d->ht[table].sizemask;
        // 指向该索引上的链表
        he = d->ht[table].table[idx];//这有可能就是一个链表
        prevHe = NULL;
        // 遍历链表上的全部节点
        while(he) {
        
            if (dictCompareKeys(d, key, he->key)) {
                // 查找目标节点

                /* Unlink the element from the list */
                // 从链表中删除
                if (prevHe)
                    prevHe->next = he->next;
                else
                    d->ht[table].table[idx] = he->next;

                // 释放调用键和值的释放函数？
                if (!nofree) {
                    dictFreeKey(d, he);
                    dictFreeVal(d, he);
                }
                
                // 释放节点自己
                zfree(he);

                // 更新已使用节点数量，我的以为这里是有问题的，由于一个节点上可能存在一个链表，而此次删除的有可能只是链表中的一部分，所以，节点数是不能少的
                d->ht[table].used--;

                // 返回已找到信号
                return DICT_OK;
            }

            prevHe = he;
            he = he->next;
        }

        // 若是执行到这里，说明在 0 号哈希表中找不到给定键
        // 那么根据字典是否正在进行 rehash ，决定要不要查找 1 号哈希表
        if (!dictIsRehashing(d)) break;
    }

    // 没找到
    return DICT_ERR; /* not found */
}
int dictDelete(dict *ht, const void *key) {
    return dictGenericDelete(ht,key,0);//要调用释放节点的函数
}
 
 
 
int dictDeleteNoFree(dict *ht, const void *key) {
    return dictGenericDelete(ht,key,1);//不调用释放函数
}

int dictReplace(dict *d, void *key, void *val)
{
    dictEntry *entry, auxentry;

    /* Try to add the element. If the key
     * does not exists dictAdd will suceed. */
    // 尝试直接将键值对添加到字典
    // 若是键 key 不存在的话，添加会成功
    if (dictAdd(d, key, val) == DICT_OK)
        return 1;

    /* It already exists, get the entry */
    // 运行到这里，说明键 key 已经存在，那么找出包含这个 key 的节点
    entry = dictFind(d, key);
    /* Set the new value and free the old one. Note that it is important
     * to do that in this order, as the value may just be exactly the same
     * as the previous one. In this context, think to reference counting,
     * you want to increment (set), and then decrement (free), and not the
     * reverse. */
    // 先保存原有的值的指针
    auxentry = *entry;
    // 而后设置新的值
    dictSetVal(d, entry, val);
    // 而后释放旧值
    dictFreeVal(d, &auxentry);

    return 0;
}

在咱们学习java的集合类的时候，最经常使用的一个武器就是迭代器，在Redis的dict中，也实现了迭代器，分为安全的和不安全的

typedef struct dictIterator {
        
    // 被迭代的字典
    dict *d;

    // table ：正在被迭代的哈希表号，值能够是 0 或 1 。
    // index ：迭代器当前所指向的哈希表索引位置。
    // safe 迭代器是否安全，当为1的时候，他是安全的，不然为不安全的
    int table, index, safe;

    // entry ：当前迭代到的节点的指针
    // nextEntry ：当前迭代节点的下一个节点， 由于在安全迭代器运做时， entry所只带的节点有可能被修改，因此须要一个额外的指针来保存下一节点的位置，从而防止指针丢失
    dictEntry *entry, *nextEntry;

    long long fingerprint; /* unsafe iterator fingerprint for misuse detection */
} dictIterator;

//生成一个不安全的迭代器
dictIterator *dictGetIterator(dict *d)
{
    dictIterator *iter = zmalloc(sizeof(*iter));

    iter->d = d;
    iter->table = 0;
    iter->index = -1;
    iter->safe = 0;
    iter->entry = NULL;
    iter->nextEntry = NULL;

    return iter;
}

//生成安全的迭代器
dictIterator *dictGetSafeIterator(dict *d) {
    dictIterator *i = dictGetIterator(d);

    i->safe = 1;

    return i;
}

好啦，这一节有点多，请见谅，若是有疑问，请咨询QQ:359311095