Java基础系列之(三) - HashMap深度分析

时间 2019-11-17

标签 java 基础系列 hashmap 深度分析栏目 Java 繁體版

原文原文链接

此次主要是分析下HashMap的工做原理，为何我会拿这个东西出来分析，缘由很简单，之前我面试的时候，偶尔问起HashMap，99%的程序员都知道HashMap，基本都会用Hashmap，这其中不只仅包括刚毕业的大学生，也包括已经工做5年，甚至是10年的程序员。HashMap涉及的知识远远不止put和get那么简单。本次的分析但愿对于面试的人起码对于面试官的问题有所应付程序员

1、先来回忆下个人面试过程面试

问：“你用过HashMap，你能跟我说说它吗？”算法

答：“固然用过，HashMap是一种<key,value>的存储结构，可以快速将key的数据put方式存储起来，而后很快的经过get取出来”，而后说“HashMap不是线程安全的，
HashTable是线程安全的，经过synchronized实现的。HashMap取值很是快”等等。这个时候说明他已经很熟练使用HashMap的工具了。数组

问：“你知道HashMap 在put和get的时候是怎么工做的吗？”安全

答：“HashMap是经过key计算出Hash值，而后将这个Hash值映射到对象的引用上，get的时候先计算key的hash值，而后找到对象”。这个时候已经显得不自信了。多线程

问：“HashMap的key为何通常用字符串比较多，能用其余对象，或者自定义的对象吗？为何？”app

答：“这个没研究过，通常习惯用String。”less

问：“你刚才提到HashMap不是线程安全的，你怎么理解线程安全。原理是什么？几种方式避免线程安全的问题。”函数

答：“线程安全就是多个线程去访问的时候，会对对象形成不是预期的结果，通常要加锁才能线程安全。”工具

其实，问了以上那些问题，我基本能断定这个程序员的基本功了，通常技术中等，接下来的问题不必问了。

从个人我的角度来看，HashMap的面试问题可以考察面试者的线程问题、Java内存模型问题、线程可见与不可变问题、Hash计算问题、链表结构问题、二进制的&、|、<<、>>等问题。因此一个HashMap就能考验一我的的技术功底了。

2、概念分析

一、HashMap的类图结构

　此处的类图是根据JDK1.6版本画出来的。以下图1:

　　　　图(一)

二、HashMap存储结构

HashMap的使用那么简单，那么问题来了，它是怎么存储的，他的存储结构是怎样的，不少程序员都不知道，其实当你put和get的时候，稍稍往前一步，你看到就是它的真面目。其实简单的说HashMap的存储结构是由数组和链表共同完成的。如图：

从上图能够看出HashMap是Y轴方向是数组，X轴方向就是链表的存储方式。你们都知道数组的存储方式在内存的地址是连续的，大小固定，一旦分配不能被其余引用占用。它的特色是查询快，时间复杂度是O(1)，插入和删除的操做比较慢，时间复杂度是O(n)，链表的存储方式是非连续的，大小不固定，特色与数组相反，插入和删除快，查询速度慢。HashMap能够说是一种折中的方案吧。

三、HashMap基本原理

一、首先判断Key是否为Null，若是为null，直接查找Enrty[0]，若是不是Null，先计算Key的HashCode，而后通过二次Hash。获得Hash值，这里的Hash特征值是一个int值。

二、根据Hash值，要找到对应的数组啊，因此对Entry[]的长度length求余，获得的就是Entry数组的index。

三、找到对应的数组，就是找到了所在的链表，而后按照链表的操做对Value进行插入、删除和查询操做。

四、HashMap概念介绍

变量	术语	说明
size	大小	HashMap的存储大小
threshold	临界值	HashMap大小达到临界值，须要从新分配大小。
loadFactor	负载因子	HashMap大小负载因子，默认为75%。
modCount	统一修改	HashMap被修改或者删除的次数总数。
Entry	实体	HashMap存储对象的实际实体，由Key，value，hash，next组成。

五、HashMap初始化

默认状况下，大多数人都调用new HashMap()来初始化的，我在这里分析new HashMap(int initialCapacity, float loadFactor)的构造函数，代码以下：

public HashMap(int initialCapacity, float loadFactor) {
　　　　　// initialCapacity表明初始化HashMap的容量，它的最大容量是MAXIMUM_CAPACITY = 1 << 30。 if (initialCapacity < 0)
            throw new IllegalArgumentException("Illegal initial capacity: " +
                                               initialCapacity);
        if (initialCapacity > MAXIMUM_CAPACITY)
            initialCapacity = MAXIMUM_CAPACITY;

　　　　 // loadFactor表明它的负载因子，默认是是DEFAULT_LOAD_FACTOR=0.75，用来计算threshold临界值的。 if (loadFactor <= 0 || Float.isNaN(loadFactor))
            throw new IllegalArgumentException("Illegal load factor: " +
                                               loadFactor);

        // Find a power of 2 >= initialCapacity
        int capacity = 1;
        while (capacity < initialCapacity)
            capacity <<= 1;

        this.loadFactor = loadFactor;
        threshold = (int)(capacity * loadFactor);
        table = new Entry[capacity];
        init();
    }

由上面的代码能够看出，初始化的时候须要知道初始化的容量大小，由于在后面要经过按位与的Hash算法计算Entry数组的索引，那么要求Entry的数组长度是2的N次方。

六、HashMap中的Hash计算和碰撞问题

HashMap的hash计算时先计算hashCode(),而后进行二次hash。代码以下：

// 计算二次Hash    
int hash = hash(key.hashCode());

// 经过Hash找数组索引
int i = indexFor(hash, table.length);

先不忙着学习HashMap的Hash算法，先来看看JDK的String的Hash算法。代码以下：

    /** * Returns a hash code for this string. The hash code for a * <code>String</code> object is computed as * <blockquote><pre> * s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] * </pre></blockquote> * using <code>int</code> arithmetic, where <code>s[i]</code> is the * <i>i</i>th character of the string, <code>n</code> is the length of * the string, and <code>^</code> indicates exponentiation. * (The hash value of the empty string is zero.) * * @return a hash code value for this object. */
    public int hashCode() { int h = hash; if (h == 0 && value.length > 0) { char val[] = value; for (int i = 0; i < value.length; i++) { h = 31 * h + val[i]; } hash = h; } return h; }

从JDK的API能够看出，它的算法等式就是s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]，其中s[i]就是索引为i的字符，n为字符串的长度。这里为何有一个固定常量31呢，关于这个31的讨论不少，基本就是优化的数字，主要参考Joshua Bloch's Effective Java的引用以下：

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

大致意思是说选择31是由于它是一个奇素数，若是它作乘法溢出的时候，信息会丢失，并且当和2作乘法的时候至关于移位，在使用它的时候优势仍是不清楚，可是它已经成为了传统的选择，31的一个很好的特性就是作乘法的时候能够被移位和减法代替的时候有更好的性能体现。例如31*i至关因而i左移5位减去i，即31*i == (i<<5)-i。现代的虚拟内存系统都使用这种自动优化。

如今进入正题，HashMap为何还要作二次hash呢? 代码以下：

    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

回答这个问题以前，咱们先来看看HashMap是怎么经过Hash查找数组的索引的。

    /**
     * Returns index for hash code h.
     */
    static int indexFor(int h, int length) {
        return h & (length-1);
    }

其中h是hash值，length是数组的长度，这个按位与的算法其实就是h%length求余，通常什么状况下利用该算法，典型的分组。例如怎么将100个数分组16组中，就是这个意思。应用很是普遍。

既然知道了分组的原理了，那咱们看看几个例子，代码以下：

        int h=15,length=16;
        System.out.println(h & (length-1));
        h=15+16;
        System.out.println(h & (length-1));
        h=15+16+16;
        System.out.println(h & (length-1));
        h=15+16+16+16;
        System.out.println(h & (length-1));

运行结果都是15，为何呢?咱们换算成二进制来看看。

System.out.println(Integer.parseInt("0001111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("0011111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("0111111", 2) & Integer.parseInt("0001111", 2));

System.out.println(Integer.parseInt("1111111", 2) & Integer.parseInt("0001111", 2));

这里你就发现了，在作按位与操做的时候，后面的始终是低位在作计算，高位不参与计算，由于高位都是0。这样致使的结果就是只要是低位是同样的，高位不管是什么，最后结果是同样的，若是这样依赖，hash碰撞始终在一个数组上，致使这个数组开始的链表无限长，那么在查询的时候就速度很慢，又怎么算得上高性能的啊。因此hashmap必须解决这样的问题，尽可能让key尽量均匀的分配到数组上去。避免形成Hash堆积。

回到正题，HashMap怎么处理这个问题，怎么作的二次Hash。

    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

这里就是解决Hash的的冲突的函数，解决Hash的冲突有如下几种方法：

(1)、开放定址法（线性探测再散列，二次探测再散列，伪随机探测再散列）

　(2)、再哈希法

(3)、链地址法

(4)、创建一公共溢出区

而HashMap采用的是链地址法，这几种方法在之后的博客会有单独介绍，这里就不作介绍了。

七、HashMap的put()解析

以上说了一些基本概念，下面该进入主题了，HashMap怎么存储一个对象的，代码以下：

 /**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

从代码能够看出，步骤以下：

(1) 首先判断key是否为null，若是是null，就单独调用putForNullKey(value)处理。代码以下：

    /**
     * Offloaded version of put for null keys
     */
    private V putForNullKey(V value) {
        for (Entry<K,V> e = table[0]; e != null; e = e.next) {
            if (e.key == null) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }
        modCount++;
        addEntry(0, null, value, 0);
        return null;
    }

从代码能够看出，若是key为null的值，默认就存储到table[0]开头的链表了。而后遍历table[0]的链表的每一个节点Entry，若是发现其中存在节点Entry的key为null，就替换新的value，而后返回旧的value，若是没发现key等于null的节点Entry，就增长新的节点。

(2) 计算key的hashcode，再用计算的结果二次hash，经过indexFor(hash, table.length);找到Entry数组的索引i。

(3) 而后遍历以table[i]为头节点的链表，若是发现有节点的hash，key都相同的节点时，就替换为新的value，而后返回旧的value。

(4) modCount是干吗的啊? 让我来为你解答。众所周知，HashMap不是线程安全的，但在某些容错能力较好的应用中，若是你不想仅仅由于1%的可能性而去承受hashTable的同步开销，HashMap使用了Fail-Fast机制来处理这个问题，你会发现modCount在源码中是这样声明的。

    transient volatile int modCount;

volatile关键字声明了modCount，表明了多线程环境下访问modCount，根据JVM规范，只要modCount改变了，其余线程将读到最新的值。其实在Hashmap中modCount只是在迭代的时候起到关键做用。

private abstract class HashIterator<E> implements Iterator<E> {
        Entry<K,V> next;    // next entry to return
        int expectedModCount;    // For fast-fail
        int index;        // current slot
        Entry<K,V> current;    // current entry

        HashIterator() {
           expectedModCount = modCount;
            if (size > 0) { // advance to first entry
                Entry[] t = table;
                while (index < t.length && (next = t[index++]) == null)
                    ;
            }
        }

        public final boolean hasNext() {
            return next != null;
        }

        final Entry<K,V> nextEntry() {
　　　　　　　　// 这里就是关键 if (modCount != expectedModCount)
                throw new ConcurrentModificationException();
            Entry<K,V> e = next;
            if (e == null)
                throw new NoSuchElementException();

            if ((next = e.next) == null) {
                Entry[] t = table;
                while (index < t.length && (next = t[index++]) == null)
                    ;
            }
        current = e;
            return e;
        }

        public void remove() {
            if (current == null)
                throw new IllegalStateException();
            if (modCount != expectedModCount)
                throw new ConcurrentModificationException();
            Object k = current.key;
            current = null;
            HashMap.this.removeEntryForKey(k);
          expectedModCount = modCount;
        }

    }

使用Iterator开始迭代时，会将modCount的赋值给expectedModCount，在迭代过程当中，经过每次比较二者是否相等来判断HashMap是否在内部或被其它线程修改，若是modCount和expectedModCount值不同，证实有其余线程在修改HashMap的结构，会抛出异常。

因此HashMap的put、remove等操做都有modCount++的计算。

(5) 若是没有找到key的hash相同的节点，就增长新的节点addEntry(),代码以下：

  void addEntry(int hash, K key, V value, int bucketIndex) {
    Entry<K,V> e = table[bucketIndex];
        table[bucketIndex] = new Entry<K,V>(hash, key, value, e);
        if (size++ >= threshold)
            resize(2 * table.length);
    }

这里增长节点的时候取巧了，每一个新添加的节点都增长到头节点，而后新的头节点的next指向旧的老节点。

(6) 若是HashMap大小超过临界值，就要从新设置大小，扩容，见第9节内容。

八、HashMap的get()解析

理解上面的put，get就很好理解了。代码以下：

    public V get(Object key) {
        if (key == null)
            return getForNullKey();
        int hash = hash(key.hashCode());
        for (Entry<K,V> e = table[indexFor(hash, table.length)];
             e != null;
             e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
                return e.value;
        }
        return null;
    }

别看这段代码，它带来的问题是巨大的，千万记住,HashMap是非线程安全的，因此这里的循环会致使死循环的。为何呢?当你查找一个key的hash存在的时候，进入了循环，偏偏这个时候，另一个线程将这个Entry删除了，那么你就一直由于找不到Entry而出现死循环，最后致使的结果就是代码效率很低，CPU特别高。必定记住。

九、HashMap的size()解析

HashMap的大小很简单，不是实时计算的，而是每次新增长Entry的时候，size就递增。删除的时候就递减。空间换时间的作法。由于它不是线程安全的。彻底能够这么作。效力高。

九、HashMap的reSize()解析

当HashMap的大小超过临界值的时候，就须要扩充HashMap的容量了。代码以下：

    void resize(int newCapacity) {
        Entry[] oldTable = table;
        int oldCapacity = oldTable.length;
        if (oldCapacity == MAXIMUM_CAPACITY) {
            threshold = Integer.MAX_VALUE;
            return;
        }

        Entry[] newTable = new Entry[newCapacity];
        transfer(newTable);
        table = newTable;
        threshold = (int)(newCapacity * loadFactor);
    }

从代码能够看出，若是大小超过最大容量就返回。不然就new 一个新的Entry数组，长度为旧的Entry数组长度的两倍。而后将旧的Entry[]复制到新的Entry[].代码以下：

    void transfer(Entry[] newTable) {
        Entry[] src = table;
        int newCapacity = newTable.length;
        for (int j = 0; j < src.length; j++) {
            Entry<K,V> e = src[j];
            if (e != null) {
                src[j] = null;
                do {
                    Entry<K,V> next = e.next;
                    int i = indexFor(e.hash, newCapacity);
                    e.next = newTable[i];
                    newTable[i] = e;
                    e = next;
                } while (e != null);
            }
        }
    }

在复制的时候数组的索引int i = indexFor(e.hash, newCapacity);从新参与计算。

至此，HashMap还有一些迭代器的代码，这里不一一作介绍了，在JDK1.7版本中HashMap也作了一些升级，具体有Hash因子的参与。

今天差很少完成了HashMap的源码解析，下一步将会分析ConcurrencyHashMap的源码。ConcurrencyHashMap弥补了HashMap线程不安全、HashTable性能低的缺失。是目前高性能的线程安全的HashMap类。

很晚了，但愿对你们有所帮助，晚安。