-
-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Rewrite Rabin-Karp explanation for clarity #1356
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,17 +6,27 @@ e_maxx_link: rabin_karp | |
|
||
# Rabin-Karp Algorithm for string matching | ||
|
||
This algorithm is based on the concept of hashing, so if you are not familiar with string hashing, refer to the [string hashing](string-hashing.md) article. | ||
|
||
This algorithm was authored by Rabin and Karp in 1987. | ||
Consider the string-matching problem: Given two strings - a length $m$ pattern to find in a length $n$ text, find all matches in $\Theta(m+n) = \Theta(n)$ average time. | ||
|
||
Problem: Given two strings - a pattern $s$ and a text $t$, determine if the pattern appears in the text and if it does, enumerate all its occurrences in $O(|s| + |t|)$ time. | ||
The naive solution is to simply check all length $m$ substrings in the $n$ length text, but that would take $\Theta(mn)$ time. | ||
|
||
Algorithm: Calculate the hash for the pattern $s$. | ||
Calculate hash values for all the prefixes of the text $t$. | ||
Now, we can compare a substring of length $|s|$ with $s$ in constant time using the calculated hashes. | ||
So, compare each substring of length $|s|$ with the pattern. This will take a total of $O(|t|)$ time. | ||
Hence the final complexity of the algorithm is $O(|t| + |s|)$: $O(|s|)$ is required for calculating the hash of the pattern and $O(|t|)$ for comparing each substring of length $|s|$ with the pattern. | ||
This algorithm was authored by Rabin and Karp in 1987 and is based on the concept of [string hashing](string-hashing.md). | ||
|
||
The Rabin-Karp algorithm uses the concept of a "rolling hash". In it, a hash function is chosen in such a way that the hash of the first text substring of size $m$ is computed in $\Theta(m)$ time, but the computation of hashes of subsequent substrings of length $m$ is done in $O(1)$ per substring (hence the term "rolling"). Then, the hash of the pattern is compared to hashes of each text substring, and if they are the same then there is a match with high probability. | ||
|
||
## The hash function | ||
|
||
The trick is the special polynomial hash function, also known as a Rabin fingerprint. It is defined as | ||
adamant-pwn marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
$$h(S[i..j]) = S[i] x^{m-1} + S[i+1] x^{m-2} + \cdots + S[j-1] x + S[j]$$ | ||
|
||
This is a polynomial in variable $x$, and $x$ is chosen to be as large as the alphabet, so that we have a unique hash for every fixed-length substring. For example, for a string of bytes, we can take $x = 256$. However, the resulting hash value is large, so we work modulo a large prime $p$. This creates a small chance of hash collisions that needs to be handled. In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
It doesn't, hash collisions only lead to false matches. Besides, we should add:
But I'm generally not sure whether it should be covered right here, or in the general hashing article (in which case this paragraph should link to a corresponding paragraph). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The way it is presented usually is that for every hash match we still check for correctness. And the time complexity is in CLRS. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Then it diverges from the provided implementation, and searching for all matches of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the assumption in CLRS is that the number of valid matches is low. I'm not opposed to rewriting it to assume a hash collision never happens, but I would make that explicit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd say what we need is not the assumption that a collision never happens, but rather a proper estimate on the probability of it happening, so that people can adjust their randomization range and e.g. number of mods or bases to be checked to ensure a probability that they're fine with. |
||
|
||
Now, to "roll" the hash forward, | ||
|
||
$$h(S[i+1..j+1) = x \cdot h(S[i..j]) - S[i] x^m + S[j+1]$$ | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This isn't consistent with the computations in the code snippet, as the code snippet use prefix sums instead of rolling.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the rolling is more intuitive and that's how CLRS and Sleator present it. I can make a note saying the code uses prefix sums instead. If I make new code it'll probably be a separate PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think competitive programmers usually already know about prefix sums when they learn hashing, so explanation via prefix sums should be more intuitive, as it refers to a familiar concept, while rolling is, in essence, an ad hoc trick. Prefix sums are also more versatile and practical, as they allow to compare any two substrings without a need to compute hashes of all substrings of the same side, which is what you often need in many generic hash problems (so, they'd need to learn prefix sums approach anyway). So, mentioning rolling is fine of course, as knowing it broadens the horizons, but I'd also put at least as much emphasis on prefix sums just due to how practical and useful they are in the context of polynomial hashing. The only concern I have here is that it will probably make more sense to have a dedicated section about prefix sums in hashes own article, and link to that from here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Idk if using prefix sums is even Rabin-Karp any more. Maybe Rabin-Karp can be grouped in with other sliding-window type functions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. To some extent it is, e.g. this refers to prefix sums based approach as a variant of Rabin-Karp. Though it is not popular in literature at all, in competitive programming context, it's common to use "Rabin-Karp algorithm" to refer to the idea of using a polynomial hash for substring matches, rather than to specific rolling hash based variant. This, in turn, was partly popularized by the fact that original e-maxx.ru article took this approach... |
||
|
||
What we did is multiply all terms by $x$, then subtract off the $S[i]$ term, then add the new $S[j+1]$ term. This let us compute our next hash in constant time. | ||
|
||
## Implementation | ||
```{.cpp file=rabin_karp} | ||
|
@@ -54,3 +64,7 @@ vector<int> rabin_karp(string const& s, string const& t) { | |
* [Codeforces - Palindromic characteristics](https://codeforces.com/problemset/problem/835/D) | ||
* [Leetcode - Longest Duplicate Substring](https://leetcode.com/problems/longest-duplicate-substring/) | ||
|
||
## Resources | ||
|
||
* [Sleator - String matching algorithms](https://contest.cs.cmu.edu/295/s20/tutorials/strings.mark) | ||
* [CLRS](https://en.wikipedia.org/wiki/Introduction_to_Algorithms) 3ed, Ch 32.2 |
Uh oh!
There was an error while loading. Please reload this page.