cp-algorithms · jxu · Oct 13, 2024 · Oct 13, 2024 · Oct 13, 2024 · Oct 14, 2024
diff --git a/src/string/rabin-karp.md b/src/string/rabin-karp.md
@@ -6,17 +6,27 @@ e_maxx_link: rabin_karp
 
 # Rabin-Karp Algorithm for string matching
 
-This algorithm is based on the concept of hashing, so if you are not familiar with string hashing, refer to the [string hashing](string-hashing.md) article.
-
-This algorithm was authored by Rabin and Karp in 1987.
+Consider the string-matching problem: Given two strings - a length $m$ pattern to find in a length $n$ text, find all matches in $\Theta(m+n) = \Theta(n)$ average time.
 
-Problem: Given two strings - a pattern $s$ and a text $t$, determine if the pattern appears in the text and if it does, enumerate all its occurrences in $O(|s| + |t|)$ time.
+The naive solution is to simply check all length $m$ substrings in the $n$ length text, but that would take $\Theta(mn)$ time.
 
-Algorithm: Calculate the hash for the pattern $s$.
-Calculate hash values for all the prefixes of the text $t$.
-Now, we can compare a substring of length $|s|$ with $s$ in constant time using the calculated hashes.
-So, compare each substring of length $|s|$ with the pattern. This will take a total of $O(|t|)$ time.
-Hence the final complexity of the algorithm is $O(|t| + |s|)$: $O(|s|)$ is required for calculating the hash of the pattern and $O(|t|)$ for comparing each substring of length $|s|$ with the pattern.
+This algorithm was authored by Rabin and Karp in 1987 and is based on the concept of [string hashing](string-hashing.md).
+
+The Rabin-Karp algorithm uses the concept of a "rolling hash". In it, a hash function is chosen in such a way that the hash of the first text substring of size $m$ is computed in $\Theta(m)$ time, but the computation of hashes of subsequent substrings of length $m$ is done in $O(1)$ per substring (hence the term "rolling"). Then, the hash of the pattern is compared to hashes of each text substring, and if they are the same then there is a match with high probability.
+
+## The hash function
+
+The trick is the special polynomial hash function, also known as a Rabin fingerprint. It is defined as 
+
+$$h(S[i..j]) = S[i] x^{m-1} + S[i+1] x^{m-2} + \cdots + S[j-1] x + S[j]$$
+
+This is a polynomial in variable $x$, and $x$ is chosen to be as large as the alphabet, so that we have a unique hash for every fixed-length substring. For example, for a string of bytes, we can take $x = 256$. However, the resulting hash value is large, so we work modulo a large prime $p$. This creates a small chance of hash collisions that needs to be handled. In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$. 
+
+Now, to "roll" the hash forward,
+
+$$h(S[i+1..j+1) = x \cdot h(S[i..j]) - S[i] x^m + S[j+1]$$
+
+What we did is multiply all terms by $x$, then subtract off the $S[i]$ term, then add the new $S[j+1]$ term. This let us compute our next hash in constant time. 
 
 ## Implementation
 ```{.cpp file=rabin_karp}
@@ -54,3 +64,7 @@ vector<int> rabin_karp(string const& s, string const& t) {
 * [Codeforces - Palindromic characteristics](https://codeforces.com/problemset/problem/835/D)
 * [Leetcode - Longest Duplicate Substring](https://leetcode.com/problems/longest-duplicate-substring/)
 
+## Resources
+
+* [Sleator - String matching algorithms](https://contest.cs.cmu.edu/295/s20/tutorials/strings.mark)
+* [CLRS](https://en.wikipedia.org/wiki/Introduction_to_Algorithms) 3ed, Ch 32.2