ST
StringTools
Back to Blog
SecurityApril 9, 2026·11 min read·StringTools Team

Hash Functions Explained — MD5, SHA-256, and Beyond

What is a Hash Function

A hash function is a mathematical algorithm that takes an input of arbitrary size and produces a fixed-size output, commonly called a hash value, digest, or checksum. The fundamental concept is straightforward: feed any amount of data into the function, and you always get back a string of a predetermined length. A SHA-256 hash, for example, always produces a 256-bit output regardless of whether the input is a single character, a full novel, or a multi-gigabyte file. This fixed-size output acts as a digital fingerprint of the input data, providing a compact representation that can be used for comparison, verification, and identification purposes.

Hash functions are deterministic, meaning the same input always produces the same output. If you hash the word hello with SHA-256 today, tomorrow, or on any other computer, you will get the identical hash value every time. This determinism is what makes hash functions useful for verification: you can compare hash values to confirm that two copies of data are identical without comparing the data itself byte by byte. This property is leveraged across countless applications, from verifying software downloads to detecting duplicate files to indexing data in hash tables.

The concept of hashing extends far beyond cryptography. Non-cryptographic hash functions are used in data structures like hash tables and hash maps, where the goal is fast data lookup rather than security. Cryptographic hash functions add additional security properties that make them suitable for applications where an adversary might try to manipulate the data or forge a matching hash. Understanding the distinction between cryptographic and non-cryptographic hash functions is important because using the wrong type in a security context can leave systems vulnerable to attack.

Hash functions are one-way functions, meaning it is computationally infeasible to reverse the process and recover the original input from the hash output. Given a hash value, there is no mathematical operation that reveals what input produced it. The only way to find an input that produces a specific hash is to try different inputs until one matches, which is computationally impractical for well-designed hash functions with large output sizes. This one-way property is what makes hash functions suitable for storing passwords, because even if an attacker obtains the hash values, they cannot directly compute the original passwords.

Properties of Cryptographic Hashes

Pre-image resistance is the property that makes it computationally infeasible to find any input that produces a given hash output. This is the formal statement of the one-way nature of cryptographic hash functions. If an attacker knows a hash value and wants to find an input that hashes to it, the best available approach should be brute force: trying random inputs until one matches. For a 256-bit hash, this means searching through an astronomical number of possibilities, making the attack practically impossible with current or foreseeable computing technology. Pre-image resistance is essential for password hashing because it prevents attackers from reversing stolen hash values back into usable passwords.

Second pre-image resistance means that given a specific input and its hash, it is computationally infeasible to find a different input that produces the same hash. This property is critical for data integrity verification. When you download a file and verify its hash against a published value, second pre-image resistance guarantees that an attacker cannot create a malicious file with the same hash as the legitimate one. Without this property, an attacker could substitute a corrupted or backdoored file that passes the hash verification check, undermining the entire integrity verification process.

Collision resistance is the strongest of the three core properties and requires that it be computationally infeasible to find any two distinct inputs that produce the same hash output. Note the subtle but important difference from second pre-image resistance: collision resistance means the attacker is free to choose both inputs, while second pre-image resistance fixes one input. Due to the birthday paradox, finding collisions is significantly easier than finding pre-images. For a hash function with an n-bit output, a collision can theoretically be found in approximately 2^(n/2) operations rather than 2^n operations. This is why cryptographic hash functions need large output sizes to provide adequate collision resistance.

The avalanche effect is a desirable property where a small change in the input produces a dramatically different hash output. Changing a single bit of the input should change approximately half the bits of the output, making the new hash appear completely unrelated to the original. This property ensures that similar inputs do not produce similar hashes, which would leak information about the relationship between inputs. The avalanche effect is what makes hash functions useful for detecting even the tiniest modifications to data: a file with a single byte changed will produce a completely different hash, making the alteration immediately detectable.

MD5 History and Limitations

MD5, which stands for Message-Digest Algorithm 5, was designed by Ronald Rivest in 1991 as an improvement over its predecessor MD4. It produces a 128-bit hash value, typically represented as a 32-character hexadecimal string. For over a decade, MD5 was the dominant hash function used across security applications, including SSL certificates, digital signatures, password storage, and file integrity verification. Its speed and simplicity made it easy to implement and widely adopted, and for years it was considered cryptographically secure.

The first serious cracks in MD5 appeared in 1996 when Hans Dobbertin found collisions in the compression function, though not in the full MD5 algorithm. In 2004, a team of Chinese researchers led by Xiaoyun Wang demonstrated practical collision attacks against the full MD5 algorithm, generating two different inputs with the same MD5 hash in under an hour on ordinary hardware. Subsequent research reduced the time to find collisions to seconds and then to fractions of a second. By 2008, researchers demonstrated a practical attack against MD5 as used in SSL certificates, creating a rogue certificate authority certificate that browsers would trust.

Despite being cryptographically broken for over two decades, MD5 remains surprisingly common in non-security applications. Many systems still use MD5 as a checksum for detecting accidental data corruption, where the threat model does not include an adversary intentionally crafting collisions. File transfer systems, backup software, and content delivery networks sometimes use MD5 checksums for quick integrity verification because MD5 is fast and the checksums are compact. In these contexts, the known collision vulnerabilities do not matter because the goal is detecting transmission errors, not defending against targeted attacks.

Using MD5 for any security-critical purpose is strongly discouraged and has been for years. Passwords should never be hashed with MD5 because its speed makes brute-force attacks trivially fast, and rainbow tables for MD5 are widely available. Digital signatures and certificates must not use MD5 because the collision attacks allow an adversary to create a forged document with the same hash as a legitimate one. Any system still relying on MD5 for security should be migrated to a modern alternative like SHA-256 or SHA-3 as a matter of urgency. The transition away from MD5 is a case study in how long cryptographic migrations take even when the vulnerability is well-understood and widely publicized.

SHA Family Overview

The Secure Hash Algorithm family was developed by the National Security Agency and published by the National Institute of Standards and Technology as federal information processing standards. SHA-1, published in 1995, produces a 160-bit hash and was widely used as the successor to MD5. However, theoretical attacks against SHA-1 emerged in 2005, and in 2017 Google demonstrated a practical collision attack called SHAttered that produced two different PDF files with identical SHA-1 hashes. Major platforms including Git, browsers, and certificate authorities have since deprecated SHA-1, though its legacy persists in older systems.

SHA-2 is a family of hash functions published in 2001 that includes SHA-224, SHA-256, SHA-384, and SHA-512, named after their output sizes in bits. SHA-256 is the most widely used member and has become the standard hash function for most modern security applications. It is used in TLS certificates, Bitcoin mining, digital signatures, code signing, and countless other protocols. SHA-256 produces a 64-character hexadecimal string and provides 128 bits of collision resistance, which is considered sufficient security for the foreseeable future. SHA-512 offers a larger output for applications that need an extra security margin or benefit from the 64-bit arithmetic on modern processors.

SHA-3, standardized in 2015, was selected through an open competition organized by NIST to provide an alternative to the SHA-2 family. The winning algorithm, Keccak, uses a completely different internal structure called a sponge construction, which is fundamentally different from the Merkle-Damgaard construction used by MD5, SHA-1, and SHA-2. This structural diversity means that even if a vulnerability is discovered in the SHA-2 construction, SHA-3 would remain unaffected. SHA-3 includes the same output sizes as SHA-2 and adds SHAKE128 and SHAKE256, which are extendable-output functions that can produce hash values of any desired length.

BLAKE2 and BLAKE3 are modern hash functions that offer performance advantages over the SHA family while maintaining strong security properties. BLAKE2, finalized in 2012, is faster than MD5 on modern processors while providing security comparable to SHA-3. BLAKE3, released in 2020, leverages parallelism to achieve even higher speeds on multi-core processors. Both are widely used in applications where performance is critical, such as file hashing, deduplication, and content addressing. While not NIST standardized, BLAKE2 and BLAKE3 are well-analyzed and trusted by the cryptographic community, and they are increasingly adopted in new systems where the NIST pedigree is not a regulatory requirement.

Hash Functions in Password Storage

Storing passwords as plain text is a catastrophic security practice that exposes every user account the moment the database is compromised. Hashing passwords before storage means that even if an attacker gains access to the database, they obtain hash values rather than usable passwords. However, simply hashing passwords with a general-purpose hash function like SHA-256 is insufficient because these functions are designed to be fast, and speed is the enemy of password security. An attacker with a modern GPU can compute billions of SHA-256 hashes per second, testing an enormous number of potential passwords in a brute-force attack.

Salting is the practice of prepending or appending a random string to each password before hashing it. Each user gets a unique salt that is stored alongside their hash value in the database. Salting defeats precomputed attacks like rainbow tables, which are massive lookup tables mapping common passwords to their hash values. Without salts, an attacker who finds a hash matching an entry in their rainbow table immediately knows the password. With unique salts, the attacker would need a separate rainbow table for every possible salt value, making precomputation impractical. Salts do not need to be secret; their purpose is to ensure that identical passwords produce different hash values.

Purpose-built password hashing functions like bcrypt, scrypt, and Argon2 are specifically designed to be slow and resource-intensive, making brute-force attacks costly. Bcrypt, introduced in 1999, includes a configurable work factor that determines how many iterations of its internal function are performed. Increasing the work factor doubles the computation time, allowing the difficulty to scale with hardware improvements. Scrypt, published in 2009, adds memory-hardness to the equation, requiring significant amounts of RAM in addition to CPU time, which makes the function resistant to attacks using specialized hardware like GPUs and ASICs that have limited memory per core.

Argon2, the winner of the Password Hashing Competition in 2015, represents the current state of the art in password hashing. It comes in three variants: Argon2d is optimized for resistance against GPU-based attacks, Argon2i is optimized for resistance against side-channel attacks, and Argon2id combines both approaches. Argon2 allows independent configuration of time cost, memory cost, and parallelism degree, giving system administrators fine-grained control over the trade-off between security and user experience. The recommended approach is to set these parameters as high as the production hardware and acceptable response time allow, typically aiming for a hash computation time of several hundred milliseconds per password.

Hash Functions for File Integrity

File integrity verification is one of the most practical everyday applications of hash functions. When you download software, firmware, or any important file from the internet, the provider typically publishes a hash value alongside the download link. After downloading the file, you compute the hash locally and compare it to the published value. If the hashes match, you can be confident that the file was not corrupted during transfer and that you received the exact same file the provider intended to distribute. This process is especially important for software installers and system images, where even a single bit of corruption could cause installation failures or system instability.

Package managers for programming languages and operating systems rely heavily on hash functions to verify the integrity and authenticity of packages. When you run npm install, pip install, or apt-get install, the package manager downloads the requested package and compares its hash against a value recorded in the package registry. This verification happens automatically and transparently, protecting users from corrupted downloads and, when combined with digital signatures, from tampered packages. The security of the entire software supply chain depends on the collision resistance of the hash functions used in these verification systems.

Version control systems use hash functions as the foundation of their data model. Git, the most widely used version control system, identifies every object in its repository by its SHA-1 hash, including commits, trees, blobs, and tags. This content-addressable storage model means that any change to any file in the repository produces a different hash for the blob, which cascades up through the tree and commit objects, making it impossible to modify the repository history without detection. While Git is migrating to SHA-256 due to the known weaknesses in SHA-1, the architectural pattern of content-addressable storage using hash functions remains fundamental.

Forensic analysis and legal proceedings use hash functions to establish the integrity of digital evidence. When investigators collect digital evidence such as hard drive images, email archives, or document collections, they immediately compute hash values for all collected materials. These hash values are recorded in a chain of custody log and can be recomputed at any time to prove that the evidence has not been altered since collection. Courts accept hash-based integrity verification as proof that digital evidence is authentic and unmodified, making hash functions a cornerstone of digital forensics methodology.

Choosing the Right Algorithm

The choice of hash algorithm depends on the specific use case and threat model. For general-purpose data integrity verification where the threat is accidental corruption rather than intentional tampering, almost any modern hash function will suffice. SHA-256 is the safe default choice for most applications because it is widely supported, well-analyzed, performant on modern hardware, and provides a generous security margin. Unless you have a specific reason to choose something different, SHA-256 is the algorithm to use.

For password hashing, general-purpose hash functions are the wrong tool entirely. Use Argon2id as the first choice, bcrypt as a well-established alternative, or scrypt when memory-hardness is particularly important. Never use MD5, SHA-1, SHA-256, or any other fast hash function for password storage, even with salting. The critical requirement for password hashing is that the function must be deliberately slow and resource-intensive to make brute-force attacks economically infeasible. Configure the work factor parameters as high as your hardware and latency budget allow, and plan to increase them over time as hardware improves.

For high-performance applications like file deduplication, content addressing, and data pipeline checksums where billions of hashes may need to be computed, consider BLAKE3 or xxHash depending on whether cryptographic security is needed. BLAKE3 provides cryptographic security with exceptional performance by leveraging parallelism across multiple cores. xxHash is a non-cryptographic hash function that offers even higher speed for applications where collision resistance against adversaries is not required. Choosing between cryptographic and non-cryptographic functions requires honest assessment of whether an attacker could benefit from crafting collisions in your specific context.

When regulatory compliance or standards adherence is required, the choice may be constrained by the applicable framework. FIPS 140-2 and FIPS 140-3 compliance requires using NIST-approved algorithms, which means SHA-2 or SHA-3 family functions. PCI DSS, HIPAA, and other regulatory frameworks reference NIST guidelines for cryptographic standards. International standards like ISO 27001 are generally more flexible but still expect the use of recognized, well-analyzed algorithms. In regulated environments, the safest path is to use SHA-256 or SHA-3-256 and document the algorithm choice as part of your security architecture.