Data Structure & Algorithm Takeaway: Hash Algorithm

What is hash algorithm?

Hash algorithm is a method that turns an one-dimension array into a quicker data structure called hash table, and correspondingly, turns the operations (search, insert, modify, delete) on the array into operations on the hash table.

As we know, the operations on the array cost $O (n)$ time, where $n$ is the length of the array. However, the operations on the hash table reduces time complexity to $O (1)$ . This is what hash algorithm is all about.

The hash algorithm comprises two parts:

Construct the hash table beforehand;
Operation on the hash table when requested.

which makes the hash algorithm more of a data structure rather than an algorithm. That’s why it is often called hash table interchangeably.

What is the requirement for the array?

The array can contain any type of data (other than nested arrays), including integer, string, float, etc. Any type of array can be turned into a hash table.

What is the data structure of hash table?

Hash table is still a one-dimension array. However, it works together with the hash function, or else it is just an normal array.

What is hash function?

Hash function is a function or map $h$ paired with hash table to work. It maps the data itself to non-negative integer, which is called hash value. We call the data to be hashed as key. We usually denote them as $h = h (k)$ .

This hash value is used as the index of the data in the hash table, which we will cover later. The index of an array is always a non-negative integer, so the hash value must be a non-negative integer.

How hash algorithm turns an array into a hash table?

Hash algorithm takes an array and applies a hash function to each element, generating a hash value for each element. These hash values are then used as indices to store the elements in the hash table, mapping the original array elements to their corresponding positions in the hash table.

Why is it called “hash”?

In English, “hash” means “to chop up into small pieces”, or a kind of dish that is hashed from leftover food, with the extended meaning “to confuse or mix up”. The word “hash” was chosen because a hash function “messes up” or “scrambles” data — just like chopping and mixing ingredients in a hash meal.

What is the relation of hash algorithm, hash function and hash table?

Hash function is pre-defined. Hash function is not a specific function but any function that can work in the hash algorithm. Hash algorithm is all about choosing a hash function. When the hash function is set, the hash algorithm is set. Hash table is the result of the hash algorithm turning an array into.

Why does hash table achieve operation time complexity

O (1)

We take search operation as an example. In conventional array, as we know nothing about where the query data is stored, we have to brutally loop through the array to find the data, which is $O (n)$ complexity. In hash table, since the table is created by the hash value output by the hash function as the index, we can directly feed the query data into the hash function to get the index, and then directly access the data in the hash table. A function is always $O (1)$ , indexing directly from index is $O (1)$ , so the complexity of search operation in hash table is $O (1)$ .

What is the requirement of the hash function?

Ideally, it must be an injective function, which means different data must have different hash values (if $a \neq b$ , $h (a) \neq h (b)$ ), so that the data can be put in unique positions without collisions. In an array, one element can only occupy one index.

However, in practice, hash function cannot achieve this property and collisions do happen. In this case, the hash algorithm must handle collisions.

What does hash algorithm handling collisions do?

In this case, the hash algorithm must handle collisions, which is part of the algorithm. It includes:

When constructing the hash table: if a new element to be put into the hash table finds an existing element occupying its hash value position, the algorithm needs to provide a way to find a new position to store the new element.
When searching for an element (operating): as the hash table construction suggests, the element may not be in the position of its hash value, so the algorithm needs to provide a way to find the element in the hash table starting from its hash value position.

What are the methods for hash algorithm handling collisions?

Chaining: Each position in the hash table points to a linked list (or another data structure) that holds all elements that hash to the same index. When a collision occurs, the new element is simply added to the list at that index. In this case the hash table is not a one-dimension array anymore, but an array of linked lists.

An illustration of chaining, a collision handling strategy of hash algorithm. Source: GeeksForGeeks

Open Addressing: When a collision occurs, the algorithm searches for the next available position in the hash table to store the new element. The next position to probe for availability can be defined in various ways, such as:

Linear probing: checking the next index sequentially, for example, if the hash value is $h$ , the next index to check is $h + 1$ , then $h + 2$ , and so on.
Quadratic probing: checking indices in a quadratic manner, for example, if the hash value is $h$ , the next index to check is $h + 1$ , then $h + 4$ , then $h + 9$ , and so on.
Double hashing: it is still a linear probing, but the step size is not a fixed $1$ , but a flexible value depending on the element $k$ to be put. We denote this function as $h_{2}$ , the hash value is $h = h (k)$ is then the next index to check is $h (k) + h_{2} (k)$ , then $h (k) + 2 h_{2} (k)$ , and so on. Because the function $h_{2}$ is also a function mapping the element to a non-negative integer, it is also called a hash function. That’s why it is called double hashing.

Note that if the probing reach the end of the hash table, it will wrap around to the beginning of the table, so that the probing can continue until an empty position is found. Mathematically, it is $m \mod n$ , where $m$ is the index to check, and $n$ is the length of the hash table.

An illustration open addressing (linear probing), a collision handling strategy of hash algorithm. Source: LibreTexts Engineering

Is hash table always better than array?

Hash table reduces the operation complexity, but it could introduce some overhead, mainly the memory usage for storing the hash table. Because each value must be stored in the hash table, the length of the hash table is at least equal to the original array. It also depends on the quality of the hash function. Some bad hash functions can map to very large indices, leaving a lot of empty space in the hash table.

If the array is very small, the overhead of applying the hash function may not be worth it, and a simple array might be more efficient.

In practice, do we have to design the hash function ourselves?

In practice, we often use existing hash functions provided by programming languages or libraries, as they are well-tested and optimized. However, in some cases, especially when dealing with specific data types or structures, we may need to design a custom hash function to achieve better performance or to meet specific requirements.

What are the real applications of hash algorithm?

Python dictionary is implemented as hash table to store key-value pairs. That is why Python dictionary can retrieve the value of a key immediately in $O (1)$ time. The dictionary is a core data structure in this higher-level language: Python. This makes Python very easy to use. The similar data structure are also implemented in other programming languages, such as Java’s HashMap, C++’s unordered_map, and JavaScript’s Object or Map.
Caching: Hash tables are used in caching mechanisms to store and quickly retrieve frequently accessed data. The caching mechanism is widely used in web applications, databases, and operating systems to improve performance by reducing the need to recompute or fetch data repeatedly.
Database Indexing: Hash indexes allow for fast data retrieval in databases.
Cryptography: For some hash functions, it is very difficult to reverse the hash function. That’s why hash functions are used in passcodes. For example, the password of a user is stored in the database as a hash value, so that even if the database is compromised, the original password cannot be easily retrieved. This is called password hashing.
Data Deduplication and Integrity Check: Some hash functions can map any length of binary sequence. Because every file or data is a binary sequence in computer, they all have a unique hash value, which is exactly corresponding to their unique content. Therefore, if two files or data have the same hash value, they are exactly the same. This feature makes hash algorithm useful for checking file or data integrity.