Hamming error-correcting codes

Introduction

Hamming codes are a class of schemes that add a small number of extra bits to a message so that some errors can be detected or corrected. The message length is arbitrary, and the maximum capability per codeword is 1 bit-error corrected and 3 bit-errors detected. For some codeword lengths, the Hamming code is perfect in the sense that the error detection/correction capability is achieved without wasting any codewords.

This page explains Hamming codes starting from basic mathematical definitions, continuing onto simple algorithms to build intuition, and finally fleshing out actual Hamming codes.

Definitions

Bit

Short for binary digit, a bit is a variable that has two possible values, 0 or 1.

Message

A string (finite sequence) of bits that a sender wants to convey to a receiver (whether through space or time). Messages can take on any value, and no assumptions can be made about them in general. For a given length \(n\), there are exactly \(2^n\) possible messages.

Codeword

A string of bits produced by an encoder and consumed by a decoder. The encoder produces a sent codeword, which obeys the code’s rules and is errorless by definition. Afterward, zero or more bits of the sent codeword (but not its length) get changed to become the received codeword. The decoder consumes a received codeword that may or may not follow the code’s rules.

Error-correcting/detecting code (ECC)

A pair of functions where the encoder takes a length-\(k\) message to produce a length-\(n\) codeword, and the decoder takes a length-\(n\) codeword to produce a length-\(k\) message or an error condition. Each codeword must be at least as long as its corresponding message (\(n ≥ k\)), otherwise information may be lost.

Data bit

For a given code scheme, a data bit is a position (at index \(i\)) in codewords such that there exists a position (at index \(j\)) in messages such that for every message, bit \(j\) of the message equals bit \(i\) of the encoded codeword. In other words, a data bit in the codeword is a place where a particular message bit gets copied to it. The set of data bits is all the positions where message bits simply get copied into the codeword.

Parity bit

For a given code scheme, a parity bit is a position (at index \(i\)) in codewords such its value is computed (i.e. not independent like a data bit) to satisfy the parity convention of its parity group.

Parity group

Given a non-negative integer \(m\), a parity group over length-\(m\) bit strings is a set of positions or indexes to extract bits from for the purpose of computing parity.

Parity value

The parity value of a {string, collection, or multiset} of bits counts the number of bits that are 1, and the result is either even or odd. Expressed in another way, the parity is the {sum of all the bits} modulo 2, where the result of 0 maps to even and 1 maps to odd.

Parity convention

The parity convention is either even or odd. A parity group is considered to be valid if its parity matches the convention. Although the convention can be specified on a group-by-group basis, there is no benefit in doing so. In fact, using even parity is more elegant algebraically because it makes the Hamming code linear, so there is no need to consider the choice of odd parity. Inverting any fixed set of data and/or parity bits is an isomorphism that does not increase or decrease error-correcting/detecting power.

Hamming distance

The Hamming distance between two bit-strings of equal length is the number of bits that differ between them.

The Hamming distance of a code scheme is the minimum Hamming distance between all pairs of distinct valid codewords in that scheme. The higher this number is, the greater the error detecting/correcting capability of the code.

Simple codes

Before diving into the full complexity of Hamming codes, it can be helpful to work our way up from simpler schemes to understand how their components play a role in Hamming codes.

For the interactive demos below, the workflow always consists of starting with a user-defined input message, running the encoding function to yield a codeword, allowing the user to manipulate the codeword to deliberately introduce errors, and running the decoding function to yield an output message or an error. In the codeword, data bits are colored green, parity bits are colored blue, and a decoding error makes all bits colored red.

Simple parity codes

As a warm-up, the simplest error-detecting code is to add one parity bit to a message, which we will put at the end.

If any single bit of the codeword is flipped (whether a data bit or parity bit), then the parity check fails and thus an error is detected. So all 1-bit errors can be detected, and in fact all errors with an odd number of bits changed are detected and all errors with an even number of bits changed are not detected. Furthermore, it is impossible to correct erroneous codewords because there is no way to locate which bit(s) are erroneous.

Taking any valid codeword, flipping 1 bit always produces an erroneous codeword, and 2 changes always produce a valid codeword. Hence, the parity code has a Hamming distance of 2.

Grid parity codes

Let’s try something a little clever: Arrange the message bits in a rectangular grid, then make a parity group for each row and each column, appending the parity bits to the right and bottom sides of the rectangle. The bottom right of the padded rectangle doesn’t contain a bit.

If one data bit in the codeword gets changed, then exactly two parity checks will fail – the one for that bit’s row and the one for that bit’s column; these let us easily locate and fix the error. Otherwise, if one parity bit is flipped, then only one parity check fails, so we choose to not change any data bits and simply discard the parity bits to yield the output message.

Furthermore, we can correct some more classes of errors. If one row-parity check fails and multiple column parity checks fail, then we can correct all these errors – which must be an odd number. Symmetrically, we can do the same when swapping the words row and column. But if at least 2 row parity checks and at least 2 column parity checks fail, then the situation is ambiguous and we cannot correct the codeword.

Given without rigorous proof, the Hamming distance of grid parity codes is 3. We know that it is at most 3 because the all-zero message encodes to the all-zero codeword, and the message with just an initial 1 encodes to the codeword with one data bit being 1 and two parity bits being 1 – and these two aforementioned codewords differ by 3 bits. The lower bound would require explaining about linear codes and the shortcut involving Hamming weight.

Because the code has a Hamming distance of 3, each valid codeword “owns” its neighborhood of 0- and 1-bit errors, which allows all 1-bit errors to be corrected. As noted earlier, some but not all higher-order errors can also be corrected. If we switch the decoder to only detecting errors without correcting them, then all 2-bit errors and some higher-order errors can be detected.

In terms of space efficiency, for messages of length \(rc\), the grid parity code adds \(r+c\) parity bits. If the message length is a square number \(n\), then the scheme adds \(2\sqrt{n}\) parity bits.

We can actually add a parity bit to the empty space at the bottom right of the padded rectangle and define its parity group to be all the bits (including itself). This increases the code’s Hamming distance to 4. If the decoder performs correction, then all 2-bit errors can be detected; otherwise without correction, the decoder can detect all 3-bit errors.

Parameters:

Message rows (\(r\)):

Message columns (\(c\)):

Message length (\(k\)):

bits

Codeword length (\(n\)):

bits

Input message:

Row ╲ Column

Codeword:

Row ╲ Column

Bits changed:

Output message:

Row ╲ Column

Detected error:
Matches input:

Almost Hamming codes

Let’s continue to increase the cleverness of the error-correcting code. The idea is to make parity groups based on the binary representation of the indexes of the data bits. Suppose that the message has length \(n\), and we index its bits starting from 0.

For the 0th parity group, we exclude the 0th data bit, include the 1st data bit, and repeat this pattern.
For the 1st parity group, we exclude the 0th and 1st data bits, include the 2nd and 3rd data bits, and repeat this pattern.
For the 2nd parity group, we exclude the {0th, 1st, 2nd, 3rd} data bits, include the {4th, 5th, 6th, 7th} data bits, and repeat this pattern.
... For the \(n\)th parity group, we exactly include any data bit whose index’s binary representation has a 1 in the \(n\)th position.

We keep making parity groups and stop before reaching a parity group that contains no data bits. (For example, if the message has length 3, then the 2nd parity group covers no data because it would start at the 4th bit.) For each parity group, we also append a parity bit to the codeword and include it in the group.

What we have accomplished is that if data bit \(i\) of the codeword is flipped, then zero or more parity checks will fail, and the indexes of the failing parity groups easily translate into the index of the erroneous bit. For example, if data bit 6 is flipped, then it causes parity groups 1 and 2 to fail. Mapping each failing parity group \(i\) to the number \(2^i\) and summing these numbers, we get \(2^1 + 2^2 = 6\), which is exactly the index of the erroneous data bit.

However, this naive scheme has a few fatal problems. First, the data bit at index 0 is not covered by any parity group because the index has no 1s in its binary representation. Another way to view this is that data bit 0 being erroneous cannot be distinguished from a valid codeword. This can be easily fixed by renumbering the message bit indexes to start from 1 instead of the usual 0. Second, if just a parity bit is flipped, then that would cause the decoder to conclude that some data bit is erroneous. Fixing this involves interleaving data and parity bits in the codeword such that parity bits exist at exactly the indexes that are powers of 2, and that would give us a real Hamming code.

Hamming codes

Building on the intuition from the simpler algorithms above, here is our coding scheme. First, choose the codeword length \(n\) as a positive integer. Next, assign indexes to the codeword bits starting from 1. Each codeword bit at an index that is not a power of 2 will be a data bit.

Each codeword bit at an index \(2^i\) will be a parity bit. That parity bit belongs to parity group \(i\), which consists of all codeword bits where the index \(m\)’s \(i\)th bit is 1. For example, parity group 0 has its parity bit at index \(2^0 = 1\), and it includes the indexes 1, 3, 5, 7, etc. Parity group 1 has its parity bit at index \(2^1 = 2\), and it includes the indexes 2, 3, 6, 7, 10, 11, etc. Parity group 2 has its parity bit at index \(2^2 = 4\), and it includes the indexes 4, 5, 6, 7, 12, 13, 14, 15, etc.

Given a codeword length of \(n\), the number of parity bits is \(\left\lfloor \log_2 n \right\rfloor + 1\). For example, \(n = 7\) has 3 parity bits, and \(n = 15\) has 4 parity bits. All the other bits are data bits, so \(k = n - \left\lfloor \log_2 n \right\rfloor - 1\). We call this the \((n, k)\) Hamming code.

The encoding procedure is straightforward given how we defined the parity groups, so now we talk about the decoding procedure. For each parity group \(i\), assign it the value 0 if the check passes or \(2^i\) if the check fails. Then, sum up all these values to yield the syndrome value. If the syndrome is zero, then the codeword is already valid; moreover, zero is not a valid bit index in the codeword. If the codeword has a 1-bit error, then syndrome correctly tells us the index of the bit that got flipped, so we flip it and extract the data bits from the modified codeword. Otherwise, with more than 1 flipped bit, the Hamming code is unable to fix the errors and rarely able to detect them either.

Index
Bit

Index
Bit

Index
Bit

Properties

Whenever the codeword length \(n\) is at least 3 and is one less than a power of 2 (e.g. 3, 7, 15, 31, 63), the Hamming code is “perfect”. This means that every valid codeword “owns” its entire neighborhood of itself (0-bit error) and 1-bit errors but nothing else, and no codewords are left unused. We can show that at these sizes, perfectness is possible but not guaranteed (i.e. a necessary but not sufficient condition). Let \(i\) be such that \(n = 2^i - 1\). We have \(i\) parity bits and \(k = n - i\) data bits (and the message length is \(k\)). There are \(2^k\) possible messages. Each message has \(n\) possible 1-bit errors, so each message owns \(n + 1\) codewords. That means \(2^k (n + 1)\) codewords are assigned. Substituting values, this equals \(2^{n - i} [(2^i - 1) + 1] = 2^n\), which means all possible codewords are accounted for. Moreover, every 2-bit error will necessarily get corrected to the wrong message. Examples of perfect Hamming codes include \((n, k)\) = (3, 1) (which matches the simplest repetition code), (7, 4), (15, 11), (31, 26), (63, 57).

In the Hamming code decoder, we can choose whether to correct errors or not. Note that when we correct an error, then by definition we have detected it. In correction mode, we can correct all 1-bit errors and no more. In detection mode, we can detect all 2-bit errors – because each valid codeword owns its neighborhood of 1-bit errors, to go from one valid codeword to another, we need to flip one bit to get to the edge of its neighborhood, flip another bit to enter another codeword’s neighborhood, and flip one more bit to become a valid different codeword – so the Hamming distance of the code is 3, and flipping 2 bits never produces a valid codeword.

If the length of the Hamming codeword is not one less than a power of 2, then the code is imperfect and some 2-bit errors can be detected but never corrected. For example, suppose \(n = 6\). If we take a valid codeword and flip the bits at indexes 3 and 4, then the syndrome value is 7, which is not a valid index (because the highest index is 6). But also if we take a valid codeword and flip the bits at indexes 2 and 5, we also get a syndrome value of 7, which means this particular value is inherently ambiguous.

For pedagogical purposes, we chose to put the parity bits at codeword indexes that are a power of 2 because it makes the structure easier to understand mathematically. But in the real world, we usually prefer all the data bits to be grouped together and arranged in the same order as the input message bits. Known as a systematic code, a codeword is formed by taking the message verbatim and then appending some parity bits. This rearrangement of codeword bits does not increase or decrease the power of error-correcting/detecting code.

Also, instead of going through the trouble to define unique terms like parity groups, we can view Hamming codes as matrix operations in the finite field \(\mathbb{Z}_2\) (or \(\text{GF}(2)\)). Every message is a length-\(k\) column vector, the encoder is an \(n×k\) binary matrix, and every codeword is a length-\(n\) column vector.

Just like in the grid parity code, we can wrap any Hamming code with a simple parity scheme by appending a parity bit that covers all the bits (including the new parity bit), forming an extended Hamming code. If we run the decoder in correction mode, then we can correct all 1-bit errors and now additionally detect all 2-bit errors. In detection mode, we can detect all 3-bit errors. This popular scheme is called “single error correction, double error detection” (SECDED) and is used in practical applications such as some computer memory modules (ECC RAM). In the pedagogical variant of the Hamming code, we can put the extra global parity bit at index 0 in the codeword (which was previously unused) and specialize-case its behavior.

Although Hamming codes can be scaled up to arbitrarily long messages, it is not wise to do so. The upside is that the number of added parity bits grows slowly, being logarithmic to the message length. The downside is that the error-correcting power is always 1 bit, regardless of how long the codeword is. If we make the reasonable assumption that the error rate per bit is a fixed value in any particular real-world scenario, then longer codewords always have a higher chance of having 2+ bit errors than shorter codewords do, and the Hamming code cannot correct these.

Source code

Java

Python

Project Nayuki