Skip to content

Rewrite Rabin-Karp explanation for clarity #1356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 23 additions & 9 deletions src/string/rabin-karp.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,27 @@ e_maxx_link: rabin_karp

# Rabin-Karp Algorithm for string matching

This algorithm is based on the concept of hashing, so if you are not familiar with string hashing, refer to the [string hashing](string-hashing.md) article.

This algorithm was authored by Rabin and Karp in 1987.
Consider the string-matching problem: Given two strings - a length $m$ pattern to find in a length $n$ text, find all matches in $\Theta(m+n) = \Theta(n)$ average time.

Problem: Given two strings - a pattern $s$ and a text $t$, determine if the pattern appears in the text and if it does, enumerate all its occurrences in $O(|s| + |t|)$ time.
The naive solution is to simply check all length $m$ substrings in the $n$ length text, but that would take $\Theta(mn)$ time.

Algorithm: Calculate the hash for the pattern $s$.
Calculate hash values for all the prefixes of the text $t$.
Now, we can compare a substring of length $|s|$ with $s$ in constant time using the calculated hashes.
So, compare each substring of length $|s|$ with the pattern. This will take a total of $O(|t|)$ time.
Hence the final complexity of the algorithm is $O(|t| + |s|)$: $O(|s|)$ is required for calculating the hash of the pattern and $O(|t|)$ for comparing each substring of length $|s|$ with the pattern.
This algorithm was authored by Rabin and Karp in 1987 and is based on the concept of [string hashing](string-hashing.md).

The Rabin-Karp algorithm uses the concept of a "rolling hash". In it, a hash function is chosen in such a way that the hash of the first text substring of size $m$ is computed in $\Theta(m)$ time, but the computation of hashes of subsequent substrings of length $m$ is done in $O(1)$ per substring (hence the term "rolling"). Then, the hash of the pattern is compared to hashes of each text substring, and if they are the same then there is a match with high probability.

## The hash function

The trick is the special polynomial hash function, also known as a Rabin fingerprint. It is defined as

$$h(S[i..j]) = S[i] x^{m-1} + S[i+1] x^{m-2} + \cdots + S[j-1] x + S[j]$$

This is a polynomial in variable $x$, and $x$ is chosen to be as large as the alphabet, so that we have a unique hash for every fixed-length substring. For example, for a string of bytes, we can take $x = 256$. However, the resulting hash value is large, so we work modulo a large prime $p$. This creates a small chance of hash collisions that needs to be handled. In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$.
Copy link
Member

@adamant-pwn adamant-pwn Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$.

It doesn't, hash collisions only lead to false matches.

Besides, we should add:

  1. What should be randomized (either $p$ or $x$ should do) and how;
  2. What is the expected probability of a false match due to hash collision.

But I'm generally not sure whether it should be covered right here, or in the general hashing article (in which case this paragraph should link to a corresponding paragraph).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is presented usually is that for every hash match we still check for correctness. And the time complexity is in CLRS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it diverges from the provided implementation, and searching for all matches of $a^{n/2}$ in $a^n$ would be quadratic due to the checks for correctness...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the assumption in CLRS is that the number of valid matches is low. I'm not opposed to rewriting it to assume a hash collision never happens, but I would make that explicit.

Copy link
Member

@adamant-pwn adamant-pwn Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say what we need is not the assumption that a collision never happens, but rather a proper estimate on the probability of it happening, so that people can adjust their randomization range and e.g. number of mods or bases to be checked to ensure a probability that they're fine with.


Now, to "roll" the hash forward,

$$h(S[i+1..j+1) = x \cdot h(S[i..j]) - S[i] x^m + S[j+1]$$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't consistent with the computations in the code snippet, as the code snippet use prefix sums instead of rolling.

Copy link
Contributor Author

@jxu jxu Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rolling is more intuitive and that's how CLRS and Sleator present it. I can make a note saying the code uses prefix sums instead. If I make new code it'll probably be a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think competitive programmers usually already know about prefix sums when they learn hashing, so explanation via prefix sums should be more intuitive, as it refers to a familiar concept, while rolling is, in essence, an ad hoc trick. Prefix sums are also more versatile and practical, as they allow to compare any two substrings without a need to compute hashes of all substrings of the same side, which is what you often need in many generic hash problems (so, they'd need to learn prefix sums approach anyway).

So, mentioning rolling is fine of course, as knowing it broadens the horizons, but I'd also put at least as much emphasis on prefix sums just due to how practical and useful they are in the context of polynomial hashing. The only concern I have here is that it will probably make more sense to have a dedicated section about prefix sums in hashes own article, and link to that from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk if using prefix sums is even Rabin-Karp any more. Maybe Rabin-Karp can be grouped in with other sliding-window type functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To some extent it is, e.g. this refers to prefix sums based approach as a variant of Rabin-Karp. Though it is not popular in literature at all, in competitive programming context, it's common to use "Rabin-Karp algorithm" to refer to the idea of using a polynomial hash for substring matches, rather than to specific rolling hash based variant. This, in turn, was partly popularized by the fact that original e-maxx.ru article took this approach...


What we did is multiply all terms by $x$, then subtract off the $S[i]$ term, then add the new $S[j+1]$ term. This let us compute our next hash in constant time.

## Implementation
```{.cpp file=rabin_karp}
Expand Down Expand Up @@ -54,3 +64,7 @@ vector<int> rabin_karp(string const& s, string const& t) {
* [Codeforces - Palindromic characteristics](https://codeforces.com/problemset/problem/835/D)
* [Leetcode - Longest Duplicate Substring](https://leetcode.com/problems/longest-duplicate-substring/)

## Resources

* [Sleator - String matching algorithms](https://contest.cs.cmu.edu/295/s20/tutorials/strings.mark)
* [CLRS](https://en.wikipedia.org/wiki/Introduction_to_Algorithms) 3ed, Ch 32.2
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy