I love ‘Concurrency in Practice’. Its a Java concurrency book written by
Brian Goetz,
Doug Lea and
Joshua Bloch. It is
considered a definitive guide on the subject. These fine folks were also involved in JSR166
and have authored many of the concurrency/collection classes in JDK. It is fascinating to walk through their code. There is lot to learn about code structure, performance, trade-offs etc.
Let’s start with HashMap and its cousin LinkedHashMap.
In short, HashMap is backed by an array. During put operation, hashcode of key is calculated, and Entry (key+value) is
inserted in array (based on hashcode % array’s size). If more keys are added with same hashcode, linkedlist is formed with previously added keys.
Let’s focus on the interesting parts of the class (whole code here: HashMap & LinkedHashMap)
Initialization
If the initial capacity and load-factors are not provided, they default to 16 and 0.75 respectively.
Closest factor of 2
If initial capacity is given, it is increased to the closest factor of 2 by using this bit twiddling algorithm.
Why factor of 2: When table is resized (doubled), the elements in linked list can easily be assigned to new indexes
without performing modulo operation. This is awesome!
Eg: If hash=9, oldTableSize=8, oldIndex=1 (9%8), newTableSize=16, newIndex=9 (9%16) = oldTableSize + oldIndex
Table not initialized yet: Interesting that the array (variable table) is not initialized in constructor. So no memory allocated yet.
Linked-Lists vs Trees
If there are multiple elements with same hashcode, they are linked to each other forming linked-list (also called as Bins). Starting with Java 8 as part this JEP, if number of elements with same hashcode crosses certain threshold, the list is converted to a balanced Red-Black tree.
Red-Black tree is a sorted tree, in which search takes max log(n) operations. Though, for the sorting to work, all the keys need to be comparable to each other (i.e. they need to implement Comparable interface).
If keys are not comparable then ofcourse it will be a tree with each node with only 1 child. This consumes twice the
space is generally twice as slow.
Hashcode
Best Case: O(1). Hashcode of all the keys are distinct, then get and put operations run in O(1) i.e. constant time
(independent of the number of keys in the map).
Worst Case (Bins): O(n). Hashcodes of all keys are same then same operations can take O(n) time (n = number of keys in the map).
Worst Case (Trees): O(log(n)). Hashcodes of all keys are same then same operations can take O(log(n)) time (n = number of keys in the map).
Note: Technically hashcode % table-length need to be distinct, because even distinct hashcodes can end up in same bin/tree i.e. collide.
String class: Special hash function and a new private field hash32 (to cache the hashcode) was added to String class in Java 7 to to use more complex hashing to generate unique hashes. This was specifically to improve performance in HashMaps (all variants). After implementing Trees instead of Bins (see section below), the field was removed since now the worst case operations take O(log(n)) instead of O(n).
Probability of list of size k
Unless there are programming mistakes or malicious code, hashcodes generally follow poisson distribution
In layman terms, typical keys inserted in the HashMap have hashcodes which are generally unique.
The probability of 2 elements having same hashcode is low, 3 of them having same is even lower,
and 8 elements having same hashcode is extremely rare (1 in 10 million).
Thus probability of having to create trees is very low (although, if you use custom class as key, and its hashcode is
badly written then its still very much probable).
Code for computing hash
So in addition to the hashcode that we write for our keys, hashcode itself is (bitwise) shifted.
This because some values with same lower bits can increase probability of same hashcodes.
Example of such floats:
Put Key-Value
Lets skip hash function for now.
(n - 1) & hash is same as hash % n, who knew!
Code comments inline
Tree
A linked list is converted to Red-Black tree only if list’s size exceeds threshold (8) and
table size is greater than threshold (64).
Code comments inline
Get value from key
Not much to see here. Straight forward code.
If its a list traverse till the end (if its a tree, traverse the tree).
Code comments inline.
Resizing table
Used for both resizing and initializing table for first time.
Code Comments inline
LinkedHashMap
Its Node extends HashMap node and adds before, after Entry references to keep track of insertion order.
Note that these new references form a doubly linked list, and is completely unrelated to the HashMap’s list,
which keeps track of all elements with same hashcode.
Code comments inline
Skipped
Iterators
Serialization
Red-Black Tree balancing
Splitting the Tree
EntrySet and Values methods
Conclusion
For a core data structure being used billions (probably trillions) of times, its okay to introduce complexity
to gain extra performance.
There are always trade offs between performance and space overhead.
Bitwise operators are performant and powerful.
HashMap class has 2734 lines of code (& comments)!
Seemingly simple looking operations can involve huge amount of code.
Cake and the cherry
I got the following Twitter reply from Joshua Bloch which just made my day.
Goes to show the power of Twitter in democratizing communication (eg: between rockstar and average programmers) and
also that programmer community at large is kind, helpful and generous.
Thanks: Special thanks to my friend Neerav Vadodaria for proof-reading the article.