Java HashMap与示例

According to Object Oriented programming, objects are intended to behave like ‘black boxes’ that developers can use without the need to know how they work internally, but only using their public interface. Unfortunately this is not always true, sometimes it is important to have at least a basic idea of the internal implementation in order to use objects properly. This becomes true especially when dealing with concurrency execution and multi-threading. Then it becomes important to know if a class is designed to be ‘thread safe’. A set of classes that must be dealt with caution are the Java Collections. In this article we will describe the Java HahsMap class, how it is implemented and what are it behaves in a multi-threading environment. 

Story Behind Java HashMap

The Java HashMap class is a implementation of what is called a hash table. But before delving into hash tables let’s recall what are some of the basic data structures in Java.

When we face the need of storing and retrieving set of objects, the first data structure that comes in mind in Java or other programming languages is an array. An array allows us to store a set of values of a single data type contiguously in memory with each element associated with an index.

We can randomly access each element of an array by its index, i.e.  we can access every element in a single step which is an ideal situation from the standpoint of performance. A downside of arrays is that their size is fixed.  You must specify the array size when you declare the array.

A different data structure is the so called linked list. Linked lists have elements that are not contiguous in memory and can grow. Each element of a linked list contains the stored object as well as the pointer to next element which would be null if there are no further elements. A downside of the linked list is that it cannot be accessed in a random way, i.e. to reach a specific element in the list we must traverse the list from the first element until we found it.

Array and Linked List

The limitations of these two data structures can be overcome using a data model known as “hash table”. A hash table is an array coupled with a so called hash function. A hash function gets an object that is intended to play the role of a key and returns an integer called the hash value. The hash value is then used to calculate the index of the hash table in which to put the key and the corresponding value. The positions of the hash table in which the key and value pairs are stored are called “buckets”.

To be more precise each bucket actually contains the pointer of a linked list and the key and value are put into the first available element of this linked list. The hash function must always give the same hash value for identical keys otherwise the retrieving operation would be impossible, but it could return the same hash value for different keys. When this happens we talk about “collision”. This is where linked lists associated with hash table come in handy, that is to solve the issue related to possible “collisions”. In the figure below the hash function returns the same hash value and bucket for key 1 and key 2. The result is that in the bucket number 2 the two entries related to key 1 and key 2 are stored in the linked list one after the other and entry 1 points to entry 2.

Then when we want to retrieve for instance Entry 2 we use the hash function to calculate the hash value to Key 2, from the hash value we calculate the table’s index and then we traverse the linked list checking the key contained in each entry until we find it equal to Key 2.

The approach to deal with collisions described above is called “separate chaining”. For sake of information we must say that other strategies exist to deal with collisions which do not involve linked lists. One of these strategies is called ‘linear probing’. Using this model if a collision occurs the information record is assigned the next available slot in the table. But linear probing suffers from what is called “clustering”. Clustering is the tendency of having contiguously filled slots and has the effect of negatively affecting performances.

An important factor in hash table performance is the ratio between the number of entries and the number of buckets, called Load Factor. If the load factor reaches a certain threshold the performances begin to decrease. If the load factor is low and the hash function is made well the hash values are spread uniformly in the table, there are few collisions and the performances are at their best.

How Java HashMap Implements a Hash Table?

The Java HashMap class implements the Map interface and provide a hash table data structure. Its public interface main methods are essentially  put(K key, V value) and get(Object key) and it has the following constructors.

  • HashMap()
  • HashMap(int initialCapacity)
  • HashMap(int initialCapacity, float loadFactor)

The method put() get the key and corresponding value, creates an instance of an implementation of the Entry interface with them and stores it into a bucket (The Entry object represents just the pair of key and value),  or to be more precise it stores it in a linked list pointed by the bucket, in the first element available (in case of no collisions the linked list will have only one element). 

It is advisable to create a HashMap instance with an initial capacity big enough to avoid its resizing when the load factor reaches its threshold.

The part of the hash function is played by the method hashCode() which is inherited by the Object class, so every class has it. Any object that is supposed to be used as a key in the HashMap must be an instance of a class that override the hashCode() method in order to give a proper hash value. If the hashCode()  implementation is good enough we would expect that hash values are spread uniformly and collisions are minimized.

The Java HashMap uses an internal function to get the hash value returned by the key’s hashCode method and returns a final improved value (a value that is supposed to be distributed in a better way along the table) that is eventually used to calculate the index of the bucket in which to put the entry.

The method get() retrieves an object by its key using the hashCode() method of the key object to generate the corresponding hash value and from it the array index, traverses the linked list from the beginning and  uses the equals() method of the keys already stored in the linked list to find the matching key.  The equals() method is also inherited by the Object class and must be overridden accordingly, here is its signature:

It gets an object in input and compares it with the object it belongs to, if they are equal return true.

The default implementation provided by the Object class returns true if the two objects have the same reference which means they are exactly the same object. This is not what we want when comparing keys:  the right implementation must guarantee that  objects having the same internal state are found equal.    

The get method returns null it the map does not contain a matching entry for the key.

It is important to note that good keys are only provided by immutable objects. If an object is not immutable its internal state can change and it is not guaranteed that the get() method will be able to retrieve a previously inserted object.

The HashMap has a load factor’s threshold of 0.75. If it reaches this limit the HashMap object will be re-sized creating a new array twice the size of the original, using again the hash function (hashCode method) to redistribute the objects in the new buckets locations, this is called re-hashing. 

Accessing Java HashMap Concurrently

The Java HashMap class is not thread safe and it is not fitted for multithreaded environments unless the access to its instances are synchronized from outside.

There is also a particular situation which can lead to a race condition with an infinite loop: when the HashMap instance data reaches the threshold for the load factor (see previous paragraph), which is 0.75, a resize of the hashmap is triggered.  

The resizing works by creating a new array having twice the size of the original one. Each original linked list is then assigned to the new table but its elements are in the reverse order, because they are taken and put from the head of the list in order to avoid the extra work to traverse the all list. The reverse order could be the cause the cause of a possible infinite loop that can happen if more than one thread is involved in the resizing.

This behavior is caused by the aforementioned reverse order of the re-hashed linked lists and the actual implementation of the function that does the re-hashing. If two threads run this function concurrently one of them could find the variables used to contain the linked lists pointers in a inconsistent state and that eventually causes the last element of the list pointing to the first causing an infinite loop in any subsequent operations.

Java HashMap Example

Here is an example of using HashMap in Java. The MyKey class below represents our key. It contains an integer value and it overrides the hashCode() and equals() methods. The implementations of hashCode() and equals() have been generated in this example using the utilities of the Eclipse platform (see the screenshot below).

The reason behind the hash code calculation by the use of a prime number is to minimize the possibility of collisions. From the hash value the final table’s index in which to store the entry can be calculated by <hash value> % <number of buckets>, i.e. the modulus operator applied to the hash value and table’s size, but for performance reasons some sort of bitwise calculation is used instead. That calculation, based on the fact that the HashMap has normally a number of buckets which is a power of 2 (16 is the default) can lead for even values to a bigger probability of collisions unless a prime number is used to correct it. 

The equals method implementation is indeed self-explanatory: if the comparing objects are exactly the same instance or the keyValue values of the two objects are equal, i.e. if the two objects internal status is equal, returns true, otherwise returns false.

The HashMapExample class below creates a HashMap in its main method, put a number of entries into it using MyKey’s instances as keys, retrieves a single value by a particular key and then retrieves all the values iterating over the set of entries of the HashMap object.

Conclusion

As an implementation of the HashTable data structure HashMap is a good choice in a non-concurrent environment since it guarantees good performances, but when concurrency is a concern then it is important to get the application behavior free of race conditions and data corruption.  Some of the possible choices are:

  • Syncronize the access outside the HashMap
  • Use ConcurrentHashMap

Another choice would be the HashTable class since it is thread safe, but it has the downside that its performances are poor. ConcurrentHashMap performances are better and it is thread-safe as HashTable, so they are interchangeable except for the details about their locking behavior.

Author Bio: Mario Casari is a software architect with experience in complex architectures, mainly in Java. He has worked in many different fields, among others banking unattended systems, navy messaging, e-health and pharmacovigilance. He has a blog not only Java where he writes about Java programming.

根据面向对象的编程,对象的行为就像开发人员可以使用的“黑匣子”一样,而无需知道其内部工作方式,而仅使用其公共接口即可。 不幸的是,这并不总是正确的,有时至少要对内部实现有一个基本的了解以便正确使用对象很重要。 尤其是在处理并发执行和多线程处理时,这是正确的。 然后,知道某个类是否设计为“线程安全”就变得很重要。 Java集合是一组必须谨慎处理的类。 在本文中,我们将描述Java HahsMap类,如何实现以及在多线程环境中的行为。

Java HashMap背后的故事

Java HashMap类是所谓的哈希表的实现。 但是在深入研究哈希表之前,让我们回顾一下Java中的一些基本数据结构。

当我们需要存储和检索对象集时,用Java或其他编程语言想到的第一个数据结构是数组。 数组允许我们将一组单一数据类型的值连续存储在内存中,每个元素都与索引相关联。

我们可以通过其索引随机访问数组的每个元素,即,我们可以在单个步骤中访问每个元素,从性能的角度来看,这是理想的情况。 数组的缺点是它们的大小是固定的。 声明数组时,必须指定数组大小。

另一个不同的数据结构是所谓的链表。 链接列表的元素在内存中不连续,并且可以增长。 链表的每个元素都包含存储的对象以及指向下一个元素的指针,如果没有其他元素,则该指针将为null。 链接列表的缺点是不能以随机方式访问它,即,要到达列表中的特定元素,我们必须从第一个元素开始遍历该列表,直到找到它为止。

可以使用称为“哈希表”的数据模型来克服这两个数据结构的局限性。 哈希表是与所谓的哈希函数耦合的数组。 哈希函数获取旨在充当键角色的对象,并返回一个称为哈希值的整数。 然后,将哈希值用于计算哈希表的索引,在该索引中将**和相应的值放入其中。 存储键和值对的哈希表的位置称为“存储桶”。

更精确地说,每个存储桶实际上都包含一个链表的指针,并且键和值被放入此链表的第一个可用元素中。 哈希函数必须始终为相同的键提供相同的哈希值,否则检索操作将是不可能的,但它可能为不同的键返回相同的哈希值。 当这种情况发生时,我们将讨论“冲突”。 这是与哈希表关联的链表派上用场的地方,即可以解决与可能的“冲突”相关的问题。 在下图中,哈希函数为键1和键2返回相同的哈希值和存储桶。结果是,在存储桶号2中,与键1和键2相关的两个条目一个接一个地存储在链接列表中。条目1指向条目2。

Java HashMap

然后,当我们要检索实例2的条目时,我们使用哈希函数来计算键2的哈希值,然后从哈希值中计算表的索引,然后遍历链接列表,检查每个条目中包含的键,直到找到它等于键2。

上述处理冲突的方法称为“单独链接”。 为了提供信息,我们必须说存在其他策略来处理不涉及链表的冲突。 这些策略之一称为“线性探测”。 如果发生冲突,则使用此模型,信息记录将分配给表中的下一个可用插槽。 但是线性探测会遭受所谓的“聚类”。 群集是插槽连续填充的趋势,并且会对性能产生负面影响。

哈希表性能的一个重要因素是条目数与存储桶数之间的比率,称为负载系数。 如果负载系数达到某个阈值,性能将开始下降。 如果负载因数很低并且哈希函数做得很好,则哈希值在表中均匀分布,则冲突很少,并且性能处于最佳状态。

Java HashMap如何实现哈希表?

Java HashMap类实现Map接口并提供哈希表数据结构。 它的公共接口主要方法本质上是put(K键,V值)和get(Object键),并且具有以下构造函数。

  • HashMap()
  • HashMap(int initialCapacity)
  • HashMap(int initialCapacity,float loadFactor)

方法put()获取键和相应的值,并使用它们创建Entry接口的实现的实例,并将其存储到存储桶中(Entry对象仅表示键和值对),或者更精确地说,将其存储在存储桶指向的链表中,可用的第一个元素中(如果没有冲突,链表将只有一个元素)。

建议创建一个初始容量足够大的HashMap实例,以免在负载系数达到阈值时调整其大小。

哈希函数的一部分由Object类继承的hashCode()方法播放,因此每个类都有它。 假定在HashMap中用作键的任何对象都必须是重写hashCode()方法的类的实例,以便提供适当的哈希值。 如果hashCode()实现足够好,我们可以期望哈希值均匀分布并且将冲突最小化。

Java HashMap使用内部函数来获取键的hashCode方法返回的哈希值,并返回最终的改进值(该值应沿表以更好的方式分配),该值最终用于计算的索引。放置条目的存储桶。

方法get()使用键对象的hashCode()方法通过其键检索对象,以生成对应的哈希值,并从中获取数组索引,从头开始遍历链接列表,并使用equals()方法已存储在链接列表中的**以找到匹配的**。 equals()方法也由Object类继承,必须相应地重写,这是其签名:

它在输入中获取一个对象,并将其与其所属的对象进行比较(如果它们相等),则返回true。

如果两个对象具有相同的引用,这意味着它们是完全相同的对象,则由Object类提供的默认实现将返回true。 比较键时,这不是我们想要的:正确的实现必须保证内部状态相同的对象相等。

如果映射不包含键的匹配条目,则get方法返回null。

重要的是要注意,好的**仅由不可变的对象提供。 如果对象不是不可变的,则其内部状态可以更改,并且不能保证get()方法将能够检索先前插入的对象。

HashMap的负载因子阈值为0.75。 如果达到此限制,则将重新调整HashMap对象的大小,以创建大小为原始数组两倍的新数组,并再次使用哈希函数(hashCode方法)在新存储桶位置重新分配对象,这称为重新哈希。

并发访问Java HashMap

Java HashMap类不是线程安全的,并且不适合用于多线程环境,除非从外部对其实例的访问是同步的。

还有一种特殊情况可能导致无限循环竞争:当HashMap实例数据达到负载因子的阈值(请参见上一段),即0.75时,将触发哈希映射的调整大小。

调整大小的方法是创建一个新数组,其大小是原始数组的两倍。 然后,将每个原始链表分配给新表,但是其元素是相反的顺序,因为它们是从列表的开头取出并放置的,以便避免遍历所有列表的额外工作。 如果调整大小涉及多个线程,则相反的顺序可能是可能导致无限循环的原因。

Java HashMap

此行为是由前面提到的重新哈希链表的相反顺序以及执行重新哈希的函数的实际实现引起的。 如果两个线程同时运行此函数,则其中一个可以找到用于包含处于不稳定状态的链表指针的变量,并最终导致列表的最后一个元素指向第一个元素,从而在任何后续操作中导致无限循环。

Java HashMap示例

这是在Java中使用HashMap的示例。 下面的MyKey类表示我们的**。 它包含一个整数值,并覆盖hashCode()和equals()方法。 在本示例中,使用Eclipse平台的实用程序生成了hashCode()和equals()的实现(请参见下面的屏幕截图)。

使用质数计算哈希码的原因是为了最大程度地减少冲突的可能性。 根据哈希值,可以通过<哈希值>%<存储桶数>计算最终表在其中存储条目的索引,即应用于哈希值和表大小的模运算符,但出于性能原因,有些按位计算而是使用计算。 该计算基于HashMap通常具有数量为2的幂(默认值为16)的存储桶的事实,除非使用质数对其进行校正,否则偶数值可能导致更大的冲突概率。

equals方法的实现确实是不言而喻的:如果比较对象是完全相同的实例,或者两个对象的keyValue值相等,即,如果两个对象的内部状态相等,则返回true,否则返回false。

下面的HashMapExample类在其main方法中创建一个HashMap,使用MyKey的实例作为键将多个条目放入其中,通过特定键检索单个值,然后检索对HashMap对象的条目集进行迭代的所有值。

结论

作为HashTable数据结构的实现,HashMap在非并发环境中是一个不错的选择,因为它可以保证良好的性能,但是当并发成为问题时,确保应用程序行为不受竞争条件和数据损坏的影响就很重要。 一些可能的选择是:

  • 同步HashMap之外的访问
  • 使用ConcurrentHashMap

另一个选择是HashTable类,因为它是线程安全的,但缺点是性能差。 ConcurrentHashMap的性能更好,并且像HashTable一样是线程安全的,因此它们可以互换,除了有关其锁定行为的详细信息。

作者简介: Mario Casari是一位软件架构师,在复杂架构(主要是Java)方面拥有丰富的经验。 他曾在许多不同领域工作,其中包括银行无人值守系统,海军通讯,电子医疗和药物警戒。 不仅拥有Java博客,还撰写有关Java编程的文章。

翻译自: https://www.thecrazyprogrammer.com/2015/06/java-hashmap-with-example.html