Skip to content

Rewrite Rabin-Karp explanation for clarity #1356

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 5 commits into from
Closed

Rewrite Rabin-Karp explanation for clarity #1356

wants to merge 5 commits into from

Conversation

jxu
Copy link
Contributor

@jxu jxu commented Oct 13, 2024

Copy link
Contributor

Visit the preview URL for this PR (for commit bb6fa74):

https://cp-algorithms--preview-1356-obqsdag6.web.app

(expires 2024-10-20T19:57:13.357100300Z)

This algorithm was authored by Rabin and Karp in 1987.

Problem: Given two strings - a pattern $s$ and a text $t$, determine if the pattern appears in the text and if it does, enumerate all its occurrences in $O(|s| + |t|)$ time.
Problem: Given two strings - a length $m$ pattern to find in a length $n$ text, find all matches in $\Theta(m+n) = \Theta(n)$ average time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the intro, it may be better to start with the problem formulation, and then go on to talk about the algorithm. E.g. as follows:

Consider the following problem: You are given a length $n$ text, and a length $m \leq n$ pattern to find in the text. You need to find all occurences of the pattern in the text in $\Theta(m+n) = \Theta(n)$ expected time.

The algorithm described below was authored by Rabin and Karp in 1987 and is based on the concept of [string hashing](string-hashing.md).

We should also write that here randomization affects the correctness of the algorithm, rather than its execution time, hence there should be analysis of collision probabilities.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the randomization affects the correctness if you check the hash collisions. If you ignore them then yes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the task is to find all occurrences of the pattern in the text, you can't check them all, it would lead to worst-time $O(nm)$ even with a perfect hash function. The current implementation and the task statement heavily imply that we just report all hash coincidences as a match.

Copy link
Contributor Author

@jxu jxu Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reported the worst-case time too. I do not think we should compromise the correctness because of the worst-case time which should be rare. Sleator goes over both options so maybe both should be mentioned.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checking occurrences is somewhat fine if you only want to check that whether the pattern occurs at all (and possibly report one specific match). We can suggest doing a full match check in this specific case, but then should also provide some bounds on the probability that the number of false matches is high.

For the formulation of finding all matches though we shouldn't assume that the number of valid matches is low at all, this is just a kind of assumptions that will be wrong in competitive programming context with a reasonably well-prepared test set.


$$h(S[i..j]) = S[i] x^{m-1} + S[i+1] x^{m-2} + \cdots + S[j-1] x + S[j]$$

This is a polynomial in variable $x$, and $x$ is chosen to be as large as the alphabet, so that we have a unique hash for every fixed-length substring. For example, for a string of bytes, we can take $x = 256$. However, the resulting hash value is large, so we work modulo a large prime $p$. This creates a small chance of hash collisions that needs to be handled. In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$.
Copy link
Member

@adamant-pwn adamant-pwn Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$.

It doesn't, hash collisions only lead to false matches.

Besides, we should add:

  1. What should be randomized (either $p$ or $x$ should do) and how;
  2. What is the expected probability of a false match due to hash collision.

But I'm generally not sure whether it should be covered right here, or in the general hashing article (in which case this paragraph should link to a corresponding paragraph).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way it is presented usually is that for every hash match we still check for correctness. And the time complexity is in CLRS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then it diverges from the provided implementation, and searching for all matches of $a^{n/2}$ in $a^n$ would be quadratic due to the checks for correctness...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the assumption in CLRS is that the number of valid matches is low. I'm not opposed to rewriting it to assume a hash collision never happens, but I would make that explicit.

Copy link
Member

@adamant-pwn adamant-pwn Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say what we need is not the assumption that a collision never happens, but rather a proper estimate on the probability of it happening, so that people can adjust their randomization range and e.g. number of mods or bases to be checked to ensure a probability that they're fine with.


Now, to "roll" the hash forward,

$$h(S[i+1..j+1) = x \cdot h(S[i..j]) - S[i] x^m + S[j+1]$$
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't consistent with the computations in the code snippet, as the code snippet use prefix sums instead of rolling.

Copy link
Contributor Author

@jxu jxu Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the rolling is more intuitive and that's how CLRS and Sleator present it. I can make a note saying the code uses prefix sums instead. If I make new code it'll probably be a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think competitive programmers usually already know about prefix sums when they learn hashing, so explanation via prefix sums should be more intuitive, as it refers to a familiar concept, while rolling is, in essence, an ad hoc trick. Prefix sums are also more versatile and practical, as they allow to compare any two substrings without a need to compute hashes of all substrings of the same side, which is what you often need in many generic hash problems (so, they'd need to learn prefix sums approach anyway).

So, mentioning rolling is fine of course, as knowing it broadens the horizons, but I'd also put at least as much emphasis on prefix sums just due to how practical and useful they are in the context of polynomial hashing. The only concern I have here is that it will probably make more sense to have a dedicated section about prefix sums in hashes own article, and link to that from here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idk if using prefix sums is even Rabin-Karp any more. Maybe Rabin-Karp can be grouped in with other sliding-window type functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To some extent it is, e.g. this refers to prefix sums based approach as a variant of Rabin-Karp. Though it is not popular in literature at all, in competitive programming context, it's common to use "Rabin-Karp algorithm" to refer to the idea of using a polynomial hash for substring matches, rather than to specific rolling hash based variant. This, in turn, was partly popularized by the fact that original e-maxx.ru article took this approach...

Copy link
Contributor

Visit the preview URL for this PR (for commit fec16e6):

https://cp-algorithms--preview-1356-obqsdag6.web.app

(expires 2024-10-20T23:55:51.192703689Z)

jxu added a commit that referenced this pull request Oct 14, 2024
Copy link
Contributor

Visit the preview URL for this PR (for commit 8833f67):

https://cp-algorithms--preview-1356-obqsdag6.web.app

(expires 2024-10-21T00:05:36.794629067Z)

jxu added a commit that referenced this pull request Oct 14, 2024
Copy link
Contributor

Visit the preview URL for this PR (for commit 18daf94):

https://cp-algorithms--preview-1356-obqsdag6.web.app

(expires 2024-10-21T00:14:58.619836763Z)

@jxu jxu marked this pull request as draft October 14, 2024 00:33
@adamant-pwn adamant-pwn deleted the branch main October 14, 2024 18:53
@adamant-pwn adamant-pwn reopened this Oct 14, 2024
@adamant-pwn adamant-pwn changed the base branch from master to main October 14, 2024 19:16
Copy link
Contributor

github-actions bot commented Oct 14, 2024

Preview the changes for PR #1356 (1c129a7) here: https://cp-algorithms.github.io/cp-algorithms/1356/

@jxu jxu closed this Oct 24, 2024
@jxu jxu deleted the jxu-patch-1 branch October 24, 2024 15:02
github-actions bot added a commit that referenced this pull request Oct 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rabin-Karp for String Matching: Mention rolling hash
2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy