H maps inputs of arbitrary lengths to outputs of length n, where n is fixed.
Mathematically: H: {0,1}* → {0,1}n
In general, H: S → T where |S| > |T|.
H(x) can be efficiently computed for all x ∈ {0,1}*.
Security Requirements
Preimage resistance
Hard to invert given just an output.
2nd preimage resistance
Hard to find a second input that has the same hash value as a given first input.
Collision resistance
Hard to find two different inputs that have the same hash values.
Notes: Preimage resistant hash function is called one-way hash function (OWHF), and preimage/2nd preimage/collision resistant hash function is called cryptographic hash function.
Applications of Hash Functions
Password protection on a multi-user computer system
Server stores (uid, H(passwd)) in a password file.
If an attacker gets the file, she does not learn any passwords.
Requires preimage resistance.
Modification Detection Codes (MDC)
To ensure that a message m is not modified by unauthorized means, one computes H(m) and protects H(m) from unauthorized modification. (e.g. virus protection).
Message digests for digital signature schemes
For reasons of efficiency, instead of signing a (long) message, the (much shorter) message digest is signed.
Requires cryptographic hash function.
Why collision resistance?
Suppose Alice find x1, x2 with H(x1) = H(x2) and x1 != x2.
Then Alice signs x1 and later claim to have signed x2.
Message Authentication Codes (MAC)
Provides data integrity and data origin authentication.
Pseudorandom bit generation
Distilling random bits s from several 'random' sources x1, x2, ..., xt
Outputs s = H(x1, x2, ..., xt)
Used in OpenSSL and /dev/random
Key derivation function (KDF)
Deriving a cryptographic key from a shared secret key.
Remarks.
Collision resistance is not always necessary.
Depending on the application, other properties may be needed, for example, 'near-collision resistance', or 'partial preimage resistance', etc.
General Attacks
For finding preimages
Given y ∈ {0,1}n, repeatedly select distinct x' ∈ {0,1}* until H(x') = y.
Infeasible if n ≥ 128.
Remark. It has been proven that this generic attack for finding preimages is optimal.
For finding collision
Repeatedly select arbitrary distinct x ∈ {0,1}* and store (H(x), x) in a table sorted by first entry. Continue until a collision is found.
E(# of operations) ≈ sqrt(π2n/2) ≈ sqrt(2n) (by Birthday paradox)
E(space required) ≈ sqrt(π2n/2) ≈ sqrt(2n)
Infeasible if n ≥ 160
If n = 128:
E(run time) ≈ 264 (barely feasible)
E(space required) ≈ 7x108 Tbytes (infeasible)
Remark. It has been proven that this generic attack for finding preimages is optimal.
Pollard's rho algorithm for finding collision
Set xi+1 = H(xi)
Search for xi = xj
Then H(xi-1) = H(xj-1)
Image Source: https://link.springer.com/chapter/10.1007/978-981-99-1588-0_4/figures/1Problem: naive implementation needs to store all xi values to know when it loops.
Floyd's Cycling-Finding algorithm for finding collision
To avoid storing all the xi's:
Set y0 = x0
Set yi+1 = H(H(yi))
Then yi = x2i
Claim: xi = yi for some i
xi is eventually periodic
Suppose ∃ N such that xi = xi+t for i > N
Let kt > N for some k
Then ykt = x2kt = xkt + kt = xkt
Hence one can detect collisions with only two elements of storage and 3 times the computational cost:
Store xi and yi
Compute xi+1 and yi+1
If xi+1 == yi+1: stop
Else: (xi, yi) → (xi+1, yi+1) and repeat
Example of Floyd's cycle implementation in python
VW Parallel Collision Search
Van Oorschot & Wiener (1993)
Very small space requirements
Easy to parallelize:
m-fold sppedup with m processors
Described a 1996 US $10 million machine which can find collisions for 128-bit hash functions in 21 days, collisions for 160-bit hash functions (such as SHA1) would take about 3500 years.
The collision-finding algorithm can easily be modified to find "meaningful" collisions
Define a distinguished point to be bitstring whose first k bits are zeros:
Compute xi = H(xi-1)
Store xi if and only if xi is a distinguished point.
If xi is a distinguished point, compare it to all previously stored distinguished points.
If there is a match, backtrace to find a collision.
Compared to brute force search:
Reduce storage by a factor of 2k
Increase computational time by an additive 2k
Parallelizes trivially to multiple machines or processors.
Should stop using SHA-1 for digital signatures and other applications that require collision resistance
May still use SHA-1 for HMAC, KDFs and random number generators.
May use HSA-2 for all applications that employ secure hash algorithms.
SHA-3 may also be used, but not a must
Withdraw SHA-1 from approved usage by end of 2030.
Attacks on MDx, SHA-1 and SHA-2
Generic Attacks
To find a preimage for an n-bit hash:
Try random inputs until desired hash is found.
O(2n) operations on average
To find a collision for an n-bit hash:
Try random inputs until two matches found
O(2n/2) on average (Birthday paradox)
Non-generic Attacks
MD4 (RSA Laboratories)
Collision in 215 steps (Bobbertin, 1996)
Collision in 23 steps (Wang et al., 2005)
Collision in 20 steps (Leurent, 2008)
MD5 (RSA Laboratories)
Collision in MD5 compression function (Dobbertin, 1996)
Collision in 239 stpes (Wang and Yu, 2004)
Collision in 31 seconds on notebook computer (Kilma, 2006)
Preimage (theoretical) in 2123.4 (Sasaki and Aoki, 2009)
SHA (NIST/NSA based on MD4)
Collision (theoretical) in 261 (Chabaud and Joux, 1998)
Collision found in 251 (Joux et al., 2004)
Collision found in 240 (Wang et al. 2004)
SHA-1 (NIST/NSA based on SHA)
Collision (theoretical) in 263 (Wang et al., 2005)
Collision found in 263.1 (CWI/Google, 2017)
No preimage/2nd preimage attacks known on SHA-1
Remarks.
MD5 should not be used if collision resistance is required, but is not horrible as a one-way function.
MD5 remains widely used in legacy software.
MD5 shows up about 850 times in Windows source code.
Wang's Collision Finding Attack on SHA-1
Fix any n-bit string I.
Find two (different) 1-block message x=x' and y=y' such that F(I,x') = F(I,y')
By selecting I=IV, Wang's attack can be used to find two 1-block messages x and y such taht SHA-1(x)=SHA-1(y)
The attack takes about 263 steps
The attacker does not have much control over x and y, so these messages are essentially meaningless.
Remark.
Similarly, Wang's collision finding attack on MD5 work where now she finds two (different) 2-block messages x=(x1,x2) and y=(y1,y2) such that F(I,x)=F(I,y). It takes about 239 steps and still essentially meaningless due to limited control on x and y.
Exploit a Single Hash Function
Let H be a hash function (MD5, SHA-1, ...)
Suppose we find x,y such that H(x)=H(y)
Suppose the collision finding method does not allow us to control the structure of x and y.
Suppose the method takes considerable (but feasible) time.
Question: How can an attacker, who has expended considerable resources to find two meaningless messages x and y that collide, make repreated use of this collision in a practical setting?
Extending a Collision
Notion: For a t-block messages x=(x1,...,xt) and n-bit string I, define F(I,x)=Ht where H0=I and Hi=f(Hi-1,xi) for i=1,2,...,t
Observation: Suppose x and y are messages of the same block-length such that F(I,x)=F(I,y). Then F(I,x||z)=F(I,y||z) for any message z. Furthermore, if I=IV, then H(x,z)=H(y,z) for any message z.
Exploiting Wang's Collision
Suppose MD5(or SHA-1) is being used in a hash-then-sign signature scheme.
Let M1 and M2 be two documents in postscript format. Suppose M1 is harmless for Alice and M2 is harmful for Alice
Daum and Lucks (2005): showed that how Wang's attack on MD5(or possibly SHA-1) can be used to find two new postscript files M̂1 and M̂2 such that:
When M̂1 is viewed or printed, it looks the same as M1. Similarly for M̂2.
MD5(M̂1)=MD5(M̂2)
The attacker Eve sends the postscript file M̂1 to Alice. Alice views or prints the file, and then signs it and returns to Eve. Since MD(M̂1)=MD5(M̂2), Eve also has Alice's signatuer on M̂2.