Knuth-Morris-Pratt (KMP) algorithm explained

Knuth-Morris-Pratt (KMP) algorithm explained

Almost every program has to somehow process and transform strings. Very often we want to identify some fixed pattern (substring) in the string, or, formally speaking, find an occurrence of β:=b0b1bn1\beta := b_0b_1 \dots b_{n-1} in α:=a0a1am1ε\alpha := a_0a_1 \dots a_{m-1} \neq \varepsilon. A naive algorithm would just traverse α\alpha and check if β\beta starts at the current character of α\alpha. Unfortunately, this approach leads us to an algorithm with a complexity of O(nm)O(n \cdot m). In this post we will discuss a more efficient algorithm solving this problem - the Knuth-Morris-Pratt (KMP) algorithm.

The Knuth-Morris-Pratt algorithm

Obviously, the substring search algorithm has to somehow compare both strings character-after-character. Suppose we are scanning the string from left to right and we found b0b_0. Then we can start comparing further characters of β\beta with the next characters of α\alpha. If the remaining characters match, we’ve found an occurrence of β\beta in α\alpha. So the key question is what the algorithm should do if there is a mismatch during this comparison process, as shown in this example:

Knuth-Morris-Pratt algorithm failure function motivation

If the characters we are currently comparing don’t match, then we need to somehow shift the green, “matching” area of β\beta. This green area that we expand if the characters match and shift otherwise is often referred to as the window. If we found a mismatch, it is possible that some suffix of the window together with the current character form a prefix of β\beta, so we must shift the window so that this prefix of β\beta ends at the current position. In the example above, we must shift the window by 5 characters to the right so that the arrow denoting the current position points at the first character b of the window.

Or, for example, in this situation

Another example illustrating why we need the failure function

we should shift the window by 2 characters to the right and get:

The failure function tells us how we should shift the window

The current elements (above the arrow) match so we can proceed by comparing further characters. By the way, it is important to shift only by 2 characters and not by 4, because otherwise it is possible that we skip too many characters and don’t detect a valid occurrence of β\beta in α\alpha.

To generalize and formalize how we shift the window when the current characters don’t match, we will define the so-called failure function:

Let f(j)f(j) be the length of the longest proper prefix of b0b1bjb_0b_1 \dots b_j, such that this prefix is also a suffix of b0b1bjb_0b_1 \dots b_j. In this context, proper means that it is not the string itself so the prefix is not trivial.

The idea behind the failure function is that when we find mismatching characters while comparing the window with α\alpha, we should shift the window such that the f(i)f(i)-th character of the window is at the position of the last character ii of the window (green area). In other words, the window should be shifted by wf(i)w - f(i) characters to the right, where ww is the size of the window. If, after shifting the window, there is still a mismatch, then it is possible that there is a shorter suffix that matches some prefix of b0bib_0 \dots b_i so we need to repeat the process again.

With this idea and assuming the failure function is computed and stored in the f array, we can implement the Knuth-Morris-Pratt algorithm the following way:

# returns the index of the first occurrence of b in a
def kmp(a: str, b: str, f: list) -> int:
    
    m = len(a)
    assert m >= 1
    n = len(b)
    p = 0
    
    for i in range(m):
        
        while p != 0 and a[i] != b[p]:
            p = f[p - 1]
        
        if a[i] == b[p]:
            p += 1

        # invariant formally defined below
        assert a[i + 1 - p:i + 1] == b[0:p]

        if p == n:
            return i + 1 - n
    
    return -1 # no match found

Proof of correctness: To formally prove the correctness of the above program, we will define the following invariant which is also verified with the assert statement in code: After every iteration of the for loop, before the last if statement, it holds that ai+1pai=b0bp1a_{i+1-p} \dots a_i = b_0 \dots b_{p-1}. Intuitively, it means that the window is correct:

Invariant of the implementation of the KMP algorithm visualized

Induction base: We excluded the case α=ε\alpha = \varepsilon in the formal definition of the problem, so the loop will be entered. On the first iteration, the window is empty (p=0p = 0) and the while loop will not be entered because we don’t need to shift the empty window. If the first characters match, then pp will be incremented establishing the invariant. Otherwise, the window will remain empty.

Induction step: By induction hypothesis, we assume that the invariant holds for some ii and pp. If p=np = n, it means that ai+1nai=b0bn1=βa_{i+1-n} \dots a_i = b_0 \dots b_{n-1} = \beta, so we found an occurrence of β\beta in α\alpha at the position i+1ni + 1 - n, and the algorithm terminates. If this is not the case and there is a further character in α\alpha, we enter a new iteration of the loop and shift the window by pf(p1)p - f(p - 1) characters on the right by assigning p to point to the f(p1)f(p - 1) character of the string. Intuitively by doing this we shift the window so that the new window is the greatest possible proper prefix of the old window, that is also it’s suffix. For this reason it is guaranteed that the window remains valid and we only need to compare the current elements: If a[i] != b[i], then it means that the prefix we took is probably too large and we should search for a smaller one by repeating the window shifting step, until we either find a match or pp becomes zero meaning that the window is empty. The body of the if statement after the loop expands the window in case the current characters match, establishing the invariant.

It is also important to note, that the window should be shifted by pf(p1)p - f(p - 1) characters to the right, because if we take some shorter prefix that is also a suffix of b0bp1b_0 \dots b_{p - 1}, it is possible that we don’t detect an occurrence of β\beta in α\alpha. This completes the proof.

In the example above, the window gets shifted by pf(p1)=5f(4)=53=2p - f(p - 1) = 5 - f(4) = 5 - 3 = 2 characters to the right, ii gets incremented and the window gets expanded (by incrementing pp) because a[i] == b[p]:

The invariant is established after 1 iteration

Computing the failure function

An important part of the KMP algorithm is, of course, the computation of the failure function we defined above. We can come up with an efficient algorithm computing it by approaching the problem with dynamic programming. Suppose we want to compute f(n1)f(n - 1) and we already computed f(i)f(i) for all 0in20 \le i \le n - 2. The value of f(n2)f(n - 2) gives us the length of the longest suffix ending at bn2b_{n-2} such that there is a corresponding prefix of this length (equal parts of the string are visualized with braces): β=b0bf(n2)1f(n2)bf(n2)bn2f(n2)bn1f(n2)bn2f(n2)bn1\beta = \underbrace{b_0 \dots b_{f(n-2)-1}}_{f(n-2)} b_{f(n-2)} \dots b_{n-2-f(n-2)} \underbrace{b_{n-1-f(n-2)} \dots b_{n-2}}_{f(n-2)} b_{n-1}

By the way, is is possible that the parts of the formula marked with braces intersect, because there is no guarantee that f(n2)1<n1f(n2)f(n2)<n2f(n-2) - 1 < n - 1 - f(n-2) \Leftrightarrow f(n-2) < \frac{n}{2}.

If the characters after the equal parts in braces are also the same, or, formally, if bf(n2)=bn1b_{f(n-2)} = b_{n-1}, then we can conclude that f(n1)=f(n2)+1f(n-1) = f(n-2) + 1.

If bf(n2)bn1b_{f(n-2)} \neq b_{n-1}, then the longest prefix of β\beta that is also it’s suffix must have length kf(n2)k \le f(n-2). Suppose we found such maximum k<f(n2)k < f(n-2) such that: b0bk1=bn1kbn2b0bk1kbkbf(n2)1f(n2)=bn1f(n2)bn2kbn1kbn2kf(n2)\begin{aligned} b_0 \dots b_{k-1} &= b_{n-1-k} \dots b_{n-2} \\ \underbrace{\overbrace{b_0 \dots b_{k-1}}^{k} b_{k} \dots b_{f(n-2)-1}}_{f(n-2)} &= \underbrace{b_{n-1-f(n-2)} \dots b_{n-2-k} \overbrace{b_{n-1-k} \dots b_{n-2}}^{k}}_{f(n-2)} \end{aligned}

Then, by combining these equations, it follows that there must be a suffix of length kk equal to the prefix of b0bk1bkbf(n2)1b_0 \dots b_{k-1} b_{k} \dots b_{f(n-2)-1}

So, the maximum possible suffix that is also a prefix has length k:=f(n2)k := f(n-2), if bk=bn1b_k = b_{n-1}. We can apply this important fact we derived multiple times — if bkbn1b_k \neq b_{n-1}, then the maximum suffix/prefix we are searching for has length l:=f(k1)l := f(k - 1), if bl=bn1b_l = b_{n-1}, and so on.

The base case for the dynamic programming approach is simple: f(0)=0f(0) = 0, because the only proper prefix of b0b_0 is the empty string ε\varepsilon.

We are now ready to implement the computation of the failure function:

def compute_f(b):

    n = len(b)
    f = [0]

    for i in range(1, n):
        
        k = f[i - 1]
        
        while k != 0 and b[i] != b[k]:
            k = f[k - 1]

        if b[i] == b[k]:
            k += 1

        f.append(k)  # f[i] = k

    return f

In the case of a mismatch the while-loop searches for potential shorter suffix, that is also a prefix of β\beta. If it was found, then the loop terminates with kk representing the first character after the prefix. If the loop terminated because of k=0k = 0, then it means that there is no shorter prefix that is also a suffix (it is empty), so the value of the failure function is either 0 or 1, depending on whether the current character is equal to the first one.

Of course, we can run the functions above and check the result:

a = "baabbbaabbaabbbabaabbbaabaabababba"
b = "baababa"
print("a:", a)
print("b:", b)
f = compute_f(b)
print("Failure function:")
print("i   :", [i for i in range(len(b))])
print("f(i):", f)  # [0, 0, 0, 1, 2, 1, 2]
found = kmp(a, b, f)
print("Result:", found)  # 24

Finding all matches

The Knuth-Morris-Pratt algorithm can be easily extended to detect all occurrences of β\beta in α\alpha:

def kmp(a: str, b: str, f: list) -> list:

    m = len(a)
    assert m >= 1
    n = len(b)
    p = 0

    matches = []

    for i in range(m):
        
        while p != 0 and a[i] != b[p]:
            p = f[p - 1]
        
        if a[i] == b[p]:
            p += 1

        if p == n:
            matches.append(i + 1 - n)
            p = f[p - 1]
    
    return matches

If the window has length nn and we thus found an occurrence of β\beta in α\alpha, all we need to do is to shift the window by nf(n1)n - f(n - 1) characters to the right. We need to do it in the body of the if p == n: statement, because the while loop expects that the window is a proper prefix of β\beta and thus pp is a valid index.

Complexity analysis

It may seem that both the computation of the failure function as well as the matching itself have a complexity of Θ(nm)\Theta(n \cdot m) because of the 2 nested loops in both algorithms. However, this is not the case. We can first show, that the failure function can be computed in linear time:

Failure function computation complexity

We’ve already argued above that f(i)f(i1)+1i{1,,n1}f(i) \le f(i - 1) + 1 \quad\forall i \in \{1, \dots, n - 1\} and f(0)=0f(0) = 0. It follows, that f(i)if(i) \le i also holds for all 0i<n0 \le i < n. Because of that, during a single iteration of the for loop, the while loop makes at most kk iterations where kk is the initial value of the k variable.

We will now prove that the body of of the while loop (in particular, the k = func[k - 1] statement) gets executed at most nn times during the entire algorithm: the maximum value of kk is n1n - 1, so if the while loop is entered, then it will make at most n1n - 1 iterations and because f(i)f(i1)+1f(i) \le f(i - 1) + 1, the value of kk will decrease by at least the amount of iteration of the while loop.

Therefore, the overall running time of the algorithm is in O(n)O(n).

Matching algorithm complexity

Assuming the failure function is computed and stored in an array so that the access time is constant, the complexity of the Knuth-Morris-Pratt algorithm is O(m)O(m). The proof of this fact is analogous to the previous proof.


Copyright © 2019 — 2021 Alexander Mayorov. All rights reserved.