Hash Functions

Overview

This post is part of the Cryptography Series

Prerequisite Knowledge: None

Extreme Basics

Hash functions at their core serve one purpose: They take an arbitrary value in (typically, of ANY LENGTH), and return a FIXED-LENGTH arbitrary value. This arbitrary returned value is commonly referred to as “the hash” of the input data. As an example, a hash function could receive any value (of any length), and return a value from 0-255 (a single byte). Note that this is just an example, hash functions have a wide variety of return values unique to the specific function.

Of course, this means that there are a lot of different ways to implement hash functions. For instance, a function that takes a numerical input and returns “0” if it’s even and “1” if it’s odd is technically a hash function. But it's not very useful, which brings me to the next topic:

Good (or Ideal Cryptographic) Hash Functions

Almost always, when we refer to hash functions in relation to blockchain, we’re referring to ideal cryptographic hash functions. In this post, I'll call them "good" hash functions, to illustrate the point that some hash functions are bad for secure cryptography, and so I don't have to keep typing the words "ideal cryptographic".

Radical Asymmetry

Good hash functions are designed such that the original input to the hash function is extremely difficult to calculate based on the output. However, given a specific input, its actually very easy to compute the hashed output. In other words, the concept of "radical asymmetry" applies. In the above example with even and odd numbers, you can learn something about the input given its hash (specifically, whether it is even or odd), so it is not a very good hash function. A good hash function should never allow hashes to impart any knowledge about the initial input.

Collision Resistance

Good hash functions also have the property of collision resistance. This means that, in a reasonable span of time, no two inputs should ever result in the same hash, AKA a collision. This reveals another failure of the even/odd hash function I mentioned above. Both inputs “2” and “6” will result in the same hash. For a simple example of why this matters, check out this article on checksums! The whole point of checksums is to detect whether two files match each other by checking their hashes. Collision resistance is a key feature that many other hash function applications depend upon.

You may now be thinking:

"But wait, can’t all hash functions accept infinite inputs and map them to a finite set of hashes? That means there HAVE to be collisions! There aren't enough hashes to go around!”

The answer to this question is technically yes, and practically no. Technically this is true, but hash algorithms like Keccak256 output hashes that are 256 bits long. If you were hashing at 1 quadrillion hashes per second (which is very, very fast), it would theoretically take roughly 10^16 years to find a collision.

Uniformity Across Degrees and Range

All good hash functions should have both of the above characteristics as they apply to each “degree” (or bit) of the hash. This property means that, if we were to treat a hash function's output as a series of 1-bit outputs of their own special hash functions, the special hash functions for each individual bit should not give away any information about the input. We should never have a bit that is "always 1 when the input is odd" or "always zero when the input starts with the letter Q", for instance. This is actually just the property of radical asymmetry as it applies to each degree.

This property also means that the likelihood of finding hash collisions should decrease as more bits are added to the hash. These characteristics should apply the same to each bit, regardless of their order within the hash. There should never be any correlation between two bits in the hash. For instance, the probability of bits 1,2,4,7,15, and 19 being the same for two hashes should be equal to the probability of bits 21,22,23,26,31, and 44 being the same, since in both cases we are dealing with a set of 6 different bits. And this should be true for any 6 bits you could pick from the hash. The features mentioned in this paragraph and the one immediately before it are part of a property known as "uniformity across degrees", as all degrees have the same properties to them.

The property of uniformity across degrees states that the likelihoods of a 0 or a 1 in each bit should be roughly 50%. This actually helps to form the property of uniformity across range, which states that there's an equal probability of getting any number within the entire potential output range of the function.

Determinism

A very important, but easy to understand concept about good hash functions is that they are deterministic. No matter how many times you hash the word "cat" with Sha256, you'll get:

77af778b51abd4a3c51c5ddd97204a9c3ae614ebccb75a606c3b6865aed6744e

It doesnt matter what time of day it is, whether this is the first or five-thousandth time you've hashed this, or what your favorite flavor of ice cream is. This hash will never change, which is great and is a cornerstone feature of good hash functions.

TL, DR

Simply put, the best way to describe a good (cryptographic) hash function is to treat it as a random number generator, EXCEPT that the same input will always produce the same output.

Sooo… What?

Blockchains were created to be an immutable record for all transactions within a system. If we want to ensure that the transactions are actually immutable, we can create a hash of all transactions on the blockchain. This way, everybody can check their entire copy of the blockchain by simply comparing this hash. Even if one transaction, or just one bit of one transaction is different, the hashes won’t match at all, and we will know that something went wrong. To be more precise, most blockchains use merkle trees, which are an application of hashes, to detect changes. But we won't get into that right now. Since hash functions shouldn’t allow two inputs to create the same output, it is theoretically impossible for an attacker to create a new version of the blockchain that results in the same hash as the original. This is how blockchains ensure transaction immutability.

In smart contracts, hashes can be implemented to obscure a value (a number, a string of text, etc.) so that it isn’t publicly readable. By using a commit-reveal scheme, one can “commit” a piece of data by putting its hash on the blockchain. Since it is the hash of the data, and we’re using a good hash function (:wink:), nobody can reconstruct the original input given its hash. Then, at a time when it is deemed acceptable to release the data, the original input can be "revealed", and interested parties can verify that it hashes to the value that was originally committed. This little trick is great for setting up games in which players don’t want to reveal their moves to other players immediately, or for elections in which voters don’t want to reveal their votes immediately.

There are a myriad of other different concepts within the general blockchain environment that utilize hash functions, including digital signatures, proof of work, decentralized storage retrieval (used in IPFS), and GitHub commit management. This is, of course, not close to an exhaustive list, and people are constantly discovering new and exciting applications of hash functions.

Thanks for reading! If you have any comments or questions, we absolutely 100% encourage you to leave them here.

5 Likes

Hi @burrrata,

Welcome to the community :wave:

I am loving the crypto in cryptocurrencies series.

Thanks for reading! There are a lot of lists I don’t want to be a part of, but I’d love to be on the Awesome Blockchain list if you end up finding it :smile:

2 Likes

Thanks a lot for this summary @DericEcourcy!

A few weeks ago we had a live session with @nikeshnazareth about this topic. I’ll share my notes here.

Properties of hash functions

  • Fixed size output, not dependent on the input.
  • Pre-image resistance: It’s a one-way function. The output looks pretty random. Small changes in the input give very different output. Given an output, it shouldn’t be possible to find the input.
  • Second pre-image resistance: There is no information to be gained after knowing a few examples of pre-images.
  • Collision resistance: It is computationally impossible to find two inputs that map to the same output.
  • All properties apply to all degrees: Every bit is equally resistant.

Hash functions can be seen as an oracle that returns random bits, except that when it’s a repeated input it will remember the previous answer and return it.

Use cases

  • Passwords: Don’t store the password directly. Store a hash of the password plus a salt.
  • Digital signatures: Usually what you sign is a hash of the message.
  • Pseudo-random number generator: Use a small random number and hash it to get a new bigger random number. There is an entropy caveat, because they are correlated.
  • Hash commitments.
1 Like

Would like to dig a bit deeper on Uniformity across degrees and range. Is there any good reference on theoretical proof or numerical tests on this property? Very curious about if every bit has roughly 50% chance of being either 0 or 1.