Go Back

Hash Functions

Topics

Introduction
MDx and SHAx families of Hash Functions
Attacks on Hash Functions

Def. Hash Function

A hash function is a mapping H such that:

H maps inputs of arbitrary lengths to outputs of length n, where n is fixed.
Mathematically: H: {0,1}^* → {0,1}ⁿ
- In general, H: S → T where |S| > |T|.
H(x) can be efficiently computed for all x ∈ {0,1}^*.

Security Requirements

Preimage resistance
2nd preimage resistance
Collision resistance

Notes: Preimage resistant hash function is called one-way hash function (OWHF), and preimage/2nd preimage/collision resistant hash function is called cryptographic hash function.

Applications of Hash Functions

Password protection on a multi-user computer system

Server stores (uid, H(passwd)) in a password file.
If an attacker gets the file, she does not learn any passwords.
Requires preimage resistance.

Modification Detection Codes (MDC)

To ensure that a message m is not modified by unauthorized means, one computes H(m) and protects H(m) from unauthorized modification. (e.g. virus protection).

Message digests for digital signature schemes

For reasons of efficiency, instead of signing a (long) message, the (much shorter) message digest is signed.
Requires cryptographic hash function.
Why collision resistance?

Suppose Alice find x₁, x₂ with H(x₁) = H(x₂) and x₁ != x₂.
Then Alice signs x₁ and later claim to have signed x₂.

Message Authentication Codes (MAC)

Provides data integrity and data origin authentication.

Pseudorandom bit generation

Distilling random bits s from several 'random' sources x₁, x₂, ..., x_t
Outputs s = H(x₁, x₂, ..., x_t)
Used in OpenSSL and /dev/random

Key derivation function (KDF)

Deriving a cryptographic key from a shared secret key.

Remarks.

Collision resistance is not always necessary.
Depending on the application, other properties may be needed, for example, 'near-collision resistance', or 'partial preimage resistance', etc.

General Attacks

For finding preimages

Given y ∈ {0,1}ⁿ, repeatedly select distinct x' ∈ {0,1}^* until H(x') = y.
Infeasible if n ≥ 128.
Remark. It has been proven that this generic attack for finding preimages is optimal.

For finding collision

Repeatedly select arbitrary distinct x ∈ {0,1}^* and store (H(x), x) in a table sorted by first entry.
Continue until a collision is found.
E(# of operations) ≈ sqrt(π2ⁿ/2) ≈ sqrt(2ⁿ) (by Birthday paradox)
E(space required) ≈ sqrt(π2ⁿ/2) ≈ sqrt(2ⁿ)
Infeasible if n ≥ 160
If n = 128:

E(run time) ≈ 2⁶⁴ (barely feasible)
E(space required) ≈ 7x10⁸ Tbytes (infeasible)

Remark. It has been proven that this generic attack for finding preimages is optimal.

Pollard's rho algorithm for finding collision

Set x_i+1 = H(x_i)
Search for x_i = x_j
Then H(x_i-1) = H(x_j-1)

Image Source: https://link.springer.com/chapter/10.1007/978-981-99-1588-0_4/figures/1

Problem: naive implementation needs to store all x_i values to know when it loops.

Floyd's Cycling-Finding algorithm for finding collision

To avoid storing all the x_i's:

Set y₀ = x₀
Set y_i+1 = H(H(y_i))
Then y_i = x_2i

Claim: x_i = y_i for some i

x_i is eventually periodic
Suppose ∃ N such that x_i = x_i+t for i > N
Let kt > N for some k
Then y_kt = x_2kt = x_{kt + kt} = x_kt

Hence one can detect collisions with only two elements of storage and 3 times the computational cost:

Store x_i and y_i
Compute x_i+1 and y_i+1
If x_i+1 == y_i+1: stop
Else: (x_i, y_i) → (x_i+1, y_i+1) and repeat

Example of Floyd's cycle implementation in python

VW Parallel Collision Search

Van Oorschot & Wiener (1993)
Very small space requirements
Easy to parallelize:

m-fold sppedup with m processors

Described a 1996 US $10 million machine which can find collisions for 128-bit hash functions in 21 days, collisions for 160-bit hash functions (such as SHA1) would take about 3500 years.
The collision-finding algorithm can easily be modified to find "meaningful" collisions
Define a distinguished point to be bitstring whose first k bits are zeros:

Compute x_i = H(x_i-1)
Store x_i if and only if x_i is a distinguished point.
If x_i is a distinguished point, compare it to all previously stored distinguished points.
If there is a match, backtrace to find a collision.

Compared to brute force search:

Reduce storage by a factor of 2^k
Increase computational time by an additive 2^k
Parallelizes trivially to multiple machines or processors.

Image Source: https://people.scs.carleton.ca/~paulv/papers/JoC97.pdf

Hash Function Construction

1. The Davies-Meyer Construction

Idea: Build a hash function from block cipher.

Let E_k be an m-bit block cipher with n-bit key.
Let IV be a fixed m-bit initiating value.
To compute H(x), do:

Break up x||1 into n-bit blocks: x̄ = x₁, x₂, ..., x_t
Padding out the last block with 0 bits if required.
Define H₀=IV.
Compute H_i = E_{x_i}(H_i-1) ⊕ H_i-1 for i = 1, ..., t
Define H(x) = H_t

https://fr.wikipedia.org/wiki/Construction_de_Davies-Meyer#/media/Fichier:Davies-meyer.png

2. The Merkle-Damgard Construction

Fixed IV ∈ {0,1}^m
Compression function f: {0,1}^n+r → {0,1}ⁿ (efficiently computable)

To compute H(m) where m has bitlength b < 2^r do:
1. Break up m into r-bit blocks: m̄ = m₀, m₁, ..., m_t-1
  Padding out the last block with 0's if necessary.
2. Define m_t, the length-block, to hold the right-justified binary representation of b.
3. Define H₀ = IV
4. Compute H_i+1 = f(H_i, m_i) for i = 1, ..., t
  H_i's are called chaining variables
5. Return H(m) = H_t+1

https://www.researchgate.net/figure/The-Merkle-Damgard-construction-of-SHA-0-1-and-2-hash-functions_fig2_339991930

MDx and SHAx families of Hash Functions

MDx-Family

MDx is a familiy of iterated hash functions using the Merkle-Damgard constuction.
MD4 was proposed by Ron Rivest in 1990.
MD4 has 128-bit output.
MD5 is a strengthened version of MD4 designed by Ron Rivest in 1991.
MD5 has 128-bit output

SHA-1 and SHA-2

Secure Hash Algorithm (SHA) was designed by NSA and published by NIST in 1993 (FIPS 180).
160-bit iterated hash function, based on MD4.
Slightly modified to SHA-1 (FIPS 180-1) in 1994 in order to fix an (undisclosed) security waekness.
SHA-2 family designed by NSA and published by NIST in 2001.

Based on MDx-family of designs, but more complicated.
Four security levels: SHA-22, SHA-256, SHA-384, SHA-512

General Structure of MDx and SHAx (except for SHA-3) hash functions

MD4, MD5, SHA-1 and SHA-2 are iterated hash functions
Input is divided into 512-bit blocks (1024-bit blocks for SHA-512).
Input is padded (1 followed by 0's) to a bitlength congruent to 448 mod 512 (896 mod 1024 for SHA-512).
A final (zero-padded) 64-bit number containing the bitlength of the input in binary is appended (128-bit for SHA-512)
A complicated function (compression function) f: {0,1}ⁿ x {0,1}⁵¹² → {0,1}ⁿ
IV is a public constant

https://www.researchgate.net/figure/General-structure-of-a-cryptographic-hash-function_fig12_330659532

possibly add SHA-1 algorithm here

SHA-3

The SHA-2 design is similar to SHA-1, so are concerns that the SHA-1 waekness would extend to SHA-2.
SHA-3: NIST hash function competition

64 candidates submitted by Oct. 31, 2008 deadline
51 were accepted for the first round
14 were selected for the second round (Jul, 2009)
Encouraged the public to study the hash functions
Selected a list of finalists (Third quaerter, 2010)
Announced a winner: Keccak (Bertoni, Daemen, Peeters, Van Assche) (Second quarter 2012)
Standard published by NIST in 2015.

High Level Description of SHA-3

SHA-3 is a sponge construction using the Keccak permutation.
Sponge construction:

Components:

State S ∈ {0,1}^b
function f: {0,1}^b → {0,1}^b (often a permutation)
a padding function.

The state S is divided into two parts:

R ∈ {0,1}^r and C ∈ {0,1}^c where b=r+c
r is called the rate and c is called the capacity.

To hash a message m:

Initialize state S=R||C to zero
Pad message to break into r-bit blocks
[Absorb Stage] For each r-bit block B:

Replace R←R⊕B
Replace S←f(S)

[Squeeze Stage] While more output bits are needed:

Output R
Replace S←f(S)

Keccak Permutation

The keccak permuation is 24 rrepetitions of:
XOR constant (affine) ∘ Bitwise combine (non-linear) ∘ Permuations ∘ Linear mixing

Remarks. NIST's Policy on Hash Functions (Dec, 2022)

https://csrc.nist.gov/projects/hash-functions/nist-policy-on-hash-functions
Should stop using SHA-1 for digital signatures and other applications that require collision resistance
May still use SHA-1 for HMAC, KDFs and random number generators.
May use HSA-2 for all applications that employ secure hash algorithms.
SHA-3 may also be used, but not a must
Withdraw SHA-1 from approved usage by end of 2030.

Attacks on MDx, SHA-1 and SHA-2

Generic Attacks

To find a preimage for an n-bit hash:

Try random inputs until desired hash is found.
O(2ⁿ) operations on average

To find a collision for an n-bit hash:

Try random inputs until two matches found
O(2^n/2) on average (Birthday paradox)

Non-generic Attacks

MD4 (RSA Laboratories)

Collision in 2¹⁵ steps (Bobbertin, 1996)
Collision in 2³ steps (Wang et al., 2005)
Collision in 2⁰ steps (Leurent, 2008)

MD5 (RSA Laboratories)

Collision in MD5 compression function (Dobbertin, 1996)
Collision in 2³⁹ stpes (Wang and Yu, 2004)
Collision in 31 seconds on notebook computer (Kilma, 2006)
Preimage (theoretical) in 2^123.4 (Sasaki and Aoki, 2009)

SHA (NIST/NSA based on MD4)

Collision (theoretical) in 2⁶¹ (Chabaud and Joux, 1998)
Collision found in 2⁵¹ (Joux et al., 2004)
Collision found in 2⁴⁰ (Wang et al. 2004)

SHA-1 (NIST/NSA based on SHA)

Collision (theoretical) in 2⁶³ (Wang et al., 2005)
Collision found in 2^63.1 (CWI/Google, 2017)
No preimage/2nd preimage attacks known on SHA-1

Remarks.

MD5 should not be used if collision resistance is required, but is not horrible as a one-way function.
MD5 remains widely used in legacy software.
MD5 shows up about 850 times in Windows source code.

Wang's Collision Finding Attack on SHA-1

Fix any n-bit string I.
Find two (different) 1-block message x=x' and y=y' such that F(I,x') = F(I,y')
By selecting I=IV, Wang's attack can be used to find two 1-block messages x and y such taht SHA-1(x)=SHA-1(y)
The attack takes about 2⁶³ steps
The attacker does not have much control over x and y, so these messages are essentially meaningless.

Remark. Similarly, Wang's collision finding attack on MD5 work where now she finds two (different) 2-block messages x=(x₁,x₂) and y=(y₁,y₂) such that F(I,x)=F(I,y). It takes about 2³⁹ steps and still essentially meaningless due to limited control on x and y.

Exploit a Single Hash Function

Let H be a hash function (MD5, SHA-1, ...)
Suppose we find x,y such that H(x)=H(y)
Suppose the collision finding method does not allow us to control the structure of x and y.
Suppose the method takes considerable (but feasible) time.
Question: How can an attacker, who has expended considerable resources to find two meaningless messages x and y that collide, make repreated use of this collision in a practical setting?

Extending a Collision

Notion: For a t-block messages x=(x₁,...,x_t) and n-bit string I, define F(I,x)=H_t where H₀=I and H_i=f(H_i-1,x_i) for i=1,2,...,t
Observation: Suppose x and y are messages of the same block-length such that F(I,x)=F(I,y). Then F(I,x||z)=F(I,y||z) for any message z. Furthermore, if I=IV, then H(x,z)=H(y,z) for any message z.

Exploiting Wang's Collision

Suppose MD5(or SHA-1) is being used in a hash-then-sign signature scheme.
Let M₁ and M₂ be two documents in postscript format. Suppose M₁ is harmless for Alice and M₂ is harmful for Alice
Daum and Lucks (2005): showed that how Wang's attack on MD5(or possibly SHA-1) can be used to find two new postscript files M̂₁ and M̂₂ such that:

When M̂₁ is viewed or printed, it looks the same as M₁. Similarly for M̂₂.
MD5(M̂₁)=MD5(M̂₂)

The attacker Eve sends the postscript file M̂₁ to Alice. Alice views or prints the file, and then signs it and returns to Eve. Since MD(M̂₁)=MD5(M̂₂), Eve also has Alice's signatuer on M̂₂.