HashMap原理及实现

时间 2019-11-13

标签 hashmap 原理实现繁體版

原文原文链接

以前从java的集合接口入手开始看，一脸懵逼。最近直接看了网上的文章Hashmap的工做原理及实现，才对hashmap的原理有所了解。html

1. 概述

从本文你能够学习到：java

何时会使用HashMap？他有什么特色？

你知道HashMap的工做原理吗？

你知道get和put的原理吗？equals()和hashCode()的都有什么做用？

你知道hash的实现吗？为何要这样实现？

若是HashMap的大小超过了负载因子(load factor)定义的容量，怎么办？

当咱们执行下面的操做时：node

HashMap<String, Integer> map = new HashMap<String, Integer>();
map.put("语文", 1);
map.put("数学", 2);
map.put("英语", 3);
map.put("历史", 4);
map.put("政治", 5);
map.put("地理", 6);
map.put("生物", 7);
map.put("化学", 8);
for(Entry<String, Integer> entry : map.entrySet()) {
	System.out.println(entry.getKey() + ": " + entry.getValue());
}

运行结果是数组

政治: 5
生物: 7
历史: 4
数学: 2
化学: 8
语文: 1
英语: 3
地理: 6app

发生了什么呢？下面是一个大体的结构，但愿咱们对HashMap的结构有一个感性的认识：
ide

在官方文档中是这样描述HashMap的：函数

Hash table based implementation of the Map interface. This implementation provides all of the optional map operations, and permits null values and the null key. (The HashMap class is roughly equivalent to Hashtable, except that it is unsynchronized and permits nulls.) This class makes no guarantees as to the order of the map; in particular, it does not guarantee that the order will remain constant over time.性能

几个关键的信息：基于Map接口实现、容许null键/值、非同步、不保证有序(好比插入的顺序)、也不保证序不随时间变化。学习

2. 两个重要的参数

在HashMap中有两个很重要的参数，容量(Capacity)和负载因子(Load factor)测试

Initial capacity The capacity is the number of buckets in the hash table, The initial capacity is simply the capacity at the time the hash table is created.

Load factor The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased.

简单的说，Capacity就是bucket的大小，Load factor就是bucket填满程度的最大比例。若是对迭代性能要求很高的话不要把capacity设置过大，也不要把load factor设置太小。当bucket中的entries的数目大于capacity*load factor时就须要调整bucket的大小为当前的2倍。

3. put函数的实现

put函数大体的思路为：

对key的hashCode()作hash，而后再计算index;
若是没碰撞直接放到bucket里；
若是碰撞了，以链表的形式存在buckets后；
若是碰撞致使链表过长(大于等于TREEIFY_THRESHOLD)，就把链表转换成红黑树；
若是节点已经存在就替换old value(保证key的惟一性)
若是bucket满了(超过load factor*current capacity)，就要resize。

具体代码的实现以下：

public V put(K key, V value) {
    // 对key的hashCode()作hash
    return putVal(hash(key), key, value, false, true);
}

final V putVal(int hash, K key, V value, boolean onlyIfAbsent,
               boolean evict) {
    Node<K,V>[] tab; Node<K,V> p; int n, i;
    // tab为空则建立
    if ((tab = table) == null || (n = tab.length) == 0)
        n = (tab = resize()).length;
    // 计算index，并对null作处理
    if ((p = tab[i = (n - 1) & hash]) == null)
        tab[i] = newNode(hash, key, value, null);
    else {
        Node<K,V> e; K k;
        // 节点存在
        if (p.hash == hash &&
            ((k = p.key) == key || (key != null && key.equals(k))))
            e = p;
        // 该链为树
        else if (p instanceof TreeNode)
            e = ((TreeNode<K,V>)p).putTreeVal(this, tab, hash, key, value);
        // 该链为链表
        else {
            for (int binCount = 0; ; ++binCount) {
                if ((e = p.next) == null) {
                    p.next = newNode(hash, key, value, null);
                    if (binCount >= TREEIFY_THRESHOLD - 1) // -1 for 1st
                        treeifyBin(tab, hash);
                    break;
                }
                if (e.hash == hash &&
                    ((k = e.key) == key || (key != null && key.equals(k))))
                    break;
                p = e;
            }
        }
        // 写入
        if (e != null) { // existing mapping for key
            V oldValue = e.value;
            if (!onlyIfAbsent || oldValue == null)
                e.value = value;
            afterNodeAccess(e);
            return oldValue;
        }
    }
    ++modCount;
    // 超过load factor*current capacity，resize
    if (++size > threshold)
        resize();
    afterNodeInsertion(evict);
    return null;
}

4. get函数的实现

在理解了put以后，get就很简单了。大体思路以下：

bucket里的第一个节点，直接命中；
若是有冲突，则经过key.equals(k)去查找对应的entry
若为树，则在树中经过key.equals(k)查找，O(logn)；
若为链表，则在链表中经过key.equals(k)查找，O(n)。

具体代码的实现以下：

public V get(Object key) {
    Node<K,V> e;
    return (e = getNode(hash(key), key)) == null ? null : e.value;
}

final Node<K,V> getNode(int hash, Object key) {
    Node<K,V>[] tab; Node<K,V> first, e; int n; K k;
    if ((tab = table) != null && (n = tab.length) > 0 &&
        (first = tab[(n - 1) & hash]) != null) {
        // 直接命中
        if (first.hash == hash && // always check first node
            ((k = first.key) == key || (key != null && key.equals(k))))
            return first;
        // 未命中
        if ((e = first.next) != null) {
            // 在树中get
            if (first instanceof TreeNode)
                return ((TreeNode<K,V>)first).getTreeNode(hash, key);
            // 在链表中get
            do {
                if (e.hash == hash &&
                    ((k = e.key) == key || (key != null && key.equals(k))))
                    return e;
            } while ((e = e.next) != null);
        }
    }
    return null;
}

5. hash函数的实现

在get和put的过程当中，计算下标时，先对hashCode进行hash操做，而后再经过hash值进一步计算下标，以下图所示：

在对hashCode()计算hash时具体实现是这样的：

static final int hash(Object key) {
    int h;
    return (key == null) ? 0 : (h = key.hashCode()) ^ (h >>> 16);
}

能够看到这个函数大概的做用就是：高16bit不变，低16bit和高16bit作了一个异或。其中代码注释是这样写的：

Computes key.hashCode() and spreads (XORs) higher bits of hash to lower. Because the table uses power-of-two masking, sets of hashes that vary only in bits above the current mask will always collide. (Among known examples are sets of Float keys holding consecutive whole numbers in small tables.) So we apply a transform that spreads the impact of higher bits downward. There is a tradeoff between speed, utility, and quality of bit-spreading. Because many common sets of hashes are already reasonably distributed (so don’t benefit from spreading), and because we use trees to handle large sets of collisions in bins, we just XOR some shifted bits in the cheapest possible way to reduce systematic lossage, as well as to incorporate impact of the highest bits that would otherwise never be used in index calculations because of table bounds.

在设计hash函数时，由于目前的table长度n为2的幂，而计算下标的时候，是这样实现的(使用&位操做，而非%求余)：

(n - 1) & hash

设计者认为这方法很容易发生碰撞。为何这么说呢？不妨思考一下，在n - 1为15(0x1111)时，其实散列真正生效的只是低4bit的有效位，固然容易碰撞了。

所以，设计者想了一个顾全大局的方法(综合考虑了速度、做用、质量)，就是把高16bit和低16bit异或了一下。设计者还解释到由于如今大多数的hashCode的分布已经很不错了，就算是发生了碰撞也用O(logn)的tree去作了。仅仅异或一下，既减小了系统的开销，也不会形成的由于高位没有参与下标的计算(table长度比较小时)，从而引发的碰撞。

若是仍是产生了频繁的碰撞，会发生什么问题呢？做者注释说，他们使用树来处理频繁的碰撞(we use trees to handle large sets of collisions in bins)，在JEP-180中，描述了这个问题：

Improve the performance of java.util.HashMap under high hash-collision conditions by using balanced trees rather than linked lists to store map entries. Implement the same improvement in the LinkedHashMap class.

以前已经提过，在获取HashMap的元素时，基本分两步：

首先根据hashCode()作hash，而后肯定bucket的index；
若是bucket的节点的key不是咱们须要的，则经过keys.equals()在链中找。

在Java 8以前的实现中是用链表解决冲突的，在产生碰撞的状况下，进行get时，两步的时间复杂度是O(1)+O(n)。所以，当碰撞很厉害的时候n很大，O(n)的速度显然是影响速度的。

所以在Java 8中，利用红黑树替换链表，这样复杂度就变成了O(1)+O(logn)了，这样在n很大的时候，可以比较理想的解决这个问题，在Java 8：HashMap的性能提高一文中有性能测试的结果。

6. resize的实现

当put时，若是发现目前的bucket占用程度已经超过了Load Factor所但愿的比例，那么就会发生resize。在resize的过程，简单的说就是把bucket扩充为2倍，以后从新计算index，把节点再放到新的bucket中。resize的注释是这样描述的：

Initializes or doubles table size. If null, allocates in accord with initial capacity target held in field threshold. Otherwise, because we are using power-of-two expansion, the elements from each bin must either stay at same index, or move with a power of two offset in the new table.

大体意思就是说，当超过限制的时候会resize，然而又由于咱们使用的是2次幂的扩展(指长度扩为原来2倍)，因此，元素的位置要么是在原位置，要么是在原位置再移动2次幂的位置。

怎么理解呢？例如咱们从16扩展为32时，具体的变化以下所示：

所以元素在从新计算hash以后，由于n变为2倍，那么n-1的mask范围在高位多1bit(红色)，所以新的index就会发生这样的变化：

所以，咱们在扩充HashMap的时候，不须要从新计算hash，只须要看看原来的hash值新增的那个bit是1仍是0就行了，是0的话索引没变，是1的话索引变成“原索引+oldCap”。能够看看下图为16扩充为32的resize示意图：

这个设计确实很是的巧妙，既省去了从新计算hash值的时间，并且同时，因为新增的1bit是0仍是1能够认为是随机的，所以resize的过程，均匀的把以前的冲突的节点分散到新的bucket了。

下面是代码的具体实现：

final Node<K,V>[] resize() {
    Node<K,V>[] oldTab = table;
    int oldCap = (oldTab == null) ? 0 : oldTab.length;
    int oldThr = threshold;
    int newCap, newThr = 0;
    if (oldCap > 0) {
        // 超过最大值就再也不扩充了，就只好随你碰撞去吧
        if (oldCap >= MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return oldTab;
        }
        // 没超过最大值，就扩充为原来的2倍
        else if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                 oldCap >= DEFAULT_INITIAL_CAPACITY)
            newThr = oldThr << 1; // double threshold
    }
    else if (oldThr > 0) // initial capacity was placed in threshold
        newCap = oldThr;
    else {               // zero initial threshold signifies using defaults
        newCap = DEFAULT_INITIAL_CAPACITY;
        newThr = (int)(DEFAULT_LOAD_FACTOR * DEFAULT_INITIAL_CAPACITY);
    }
    // 计算新的resize上限
    if (newThr == 0) {

        float ft = (float)newCap * loadFactor;
        newThr = (newCap < MAXIMUM_CAPACITY && ft < (float)MAXIMUM_CAPACITY ?
                  (int)ft : Integer.MAX_VALUE);
    }
    threshold = newThr;
    @SuppressWarnings({"rawtypes","unchecked"})
        Node<K,V>[] newTab = (Node<K,V>[])new Node[newCap];
    table = newTab;
    if (oldTab != null) {
        // 把每一个bucket都移动到新的buckets中
        for (int j = 0; j < oldCap; ++j) {
            Node<K,V> e;
            if ((e = oldTab[j]) != null) {
                oldTab[j] = null;
                if (e.next == null)
                    newTab[e.hash & (newCap - 1)] = e;
                else if (e instanceof TreeNode)
                    ((TreeNode<K,V>)e).split(this, newTab, j, oldCap);
                else { // preserve order
                    Node<K,V> loHead = null, loTail = null;
                    Node<K,V> hiHead = null, hiTail = null;
                    Node<K,V> next;
                    do {
                        next = e.next;
                        // 原索引
                        if ((e.hash & oldCap) == 0) {
                            if (loTail == null)
                                loHead = e;
                            else
                                loTail.next = e;
                            loTail = e;
                        }
                        // 原索引+oldCap
                        else {
                            if (hiTail == null)
                                hiHead = e;
                            else
                                hiTail.next = e;
                            hiTail = e;
                        }
                    } while ((e = next) != null);
                    // 原索引放到bucket里
                    if (loTail != null) {
                        loTail.next = null;
                        newTab[j] = loHead;
                    }
                    // 原索引+oldCap放到bucket里
                    if (hiTail != null) {
                        hiTail.next = null;
                        newTab[j + oldCap] = hiHead;
                    }
                }
            }
        }
    }
    return newTab;
}

7. 总结

咱们如今能够回答开始的几个问题，加深对HashMap的理解：

1. 何时会使用HashMap？他有什么特色？
是基于Map接口的实现，存储键值对时，它能够接收null的键值，是非同步的，HashMap存储着Entry(hash, key, value, next)对象。

2. 你知道HashMap的工做原理吗？
经过hash的方法，经过put和get存储和获取对象。存储对象时，咱们将K/V传给put方法时，它调用hashCode计算hash从而获得bucket位置，进一步存储，HashMap会根据当前bucket的占用状况自动调整容量(超过Load Facotr则resize为原来的2倍)。获取对象时，咱们将K传给get，它调用hashCode计算hash从而获得bucket位置，并进一步调用equals()方法肯定键值对。若是发生碰撞的时候，Hashmap经过链表将产生碰撞冲突的元素组织起来，在Java 8中，若是一个bucket中碰撞冲突的元素超过某个限制(默认是8)，则使用红黑树来替换链表，从而提升速度。

3. 你知道get和put的原理吗？equals()和hashCode()的都有什么做用？
经过对key的hashCode()进行hashing，并计算下标( n-1 & hash)，从而得到buckets的位置。若是产生碰撞，则利用key.equals()方法去链表或树中去查找对应的节点

4. 你知道hash的实现吗？为何要这样实现？
在Java 1.8的实现中，是经过hashCode()的高16位异或低16位实现的：(h = k.hashCode()) ^ (h >>> 16)，主要是从速度、功效、质量来考虑的，这么作能够在bucket的n比较小的时候，也能保证考虑到高低bit都参与到hash的计算中，同时不会有太大的开销。

5. 若是HashMap的大小超过了负载因子(load factor)定义的容量，怎么办？
若是超过了负载因子(默认0.75)，则会从新resize一个原来长度两倍的HashMap，而且从新调用hash方法。

关于Java集合的小抄中是这样描述的：

以Entry[]数组实现的哈希桶数组，用Key的哈希值取模桶数组的大小可获得数组下标。

插入元素时，若是两条Key落在同一个桶(好比哈希值1和17取模16后都属于第一个哈希桶)，Entry用一个next属性实现多个Entry以单向链表存放，后入桶的Entry将next指向桶当前的Entry。

查找哈希值为17的key时，先定位到第一个哈希桶，而后以链表遍历桶里全部元素，逐个比较其key值。

当Entry数量达到桶数量的75%时(不少文章说使用的桶数量达到了75%，但看代码不是)，会成倍扩容桶数组，并从新分配全部原来的Entry，因此这里也最好有个预估值。

取模用位运算(hash & (arrayLength-1))会比较快，因此数组的大小永远是2的N次方，你随便给一个初始值好比17会转为32。默认第一次放入元素时的初始值是16。

iterator()时顺着哈希桶数组来遍历，看起来是个乱序。

在JDK8里，新增默认为8的閥值，当一个桶里的Entry超过閥值，就不以单向链表而以红黑树来存放以加快Key的查找速度。

参考资料

HashMap的工做原理
 Java 8：HashMap的性能提高
 JEP 180: Handle Frequent HashMap Collisions with Balanced Trees
ConurrentHashMap和Hashtable的区别
 HashMap和Hashtable的区别