Rewrite Rabin-Karp explanation for clarity #1356

jxu · 2024-10-13T19:50:17Z

Based on https://contest.cs.cmu.edu/295/s20/tutorials/strings.mark

Resolves #1323 also

github-actions · 2024-10-13T19:57:17Z

Visit the preview URL for this PR (for commit bb6fa74):

https://cp-algorithms--preview-1356-obqsdag6.web.app

(expires 2024-10-20T19:57:13.357100300Z)

src/string/rabin-karp.md

adamant-pwn · 2024-10-13T22:12:44Z

src/string/rabin-karp.md

 This algorithm was authored by Rabin and Karp in 1987.

-Problem: Given two strings - a pattern $s$ and a text $t$, determine if the pattern appears in the text and if it does, enumerate all its occurrences in $O(|s| + |t|)$ time.
+Problem: Given two strings - a length $m$ pattern to find in a length $n$ text, find all matches in $\Theta(m+n) = \Theta(n)$ average time.


I think for the intro, it may be better to start with the problem formulation, and then go on to talk about the algorithm. E.g. as follows:

Consider the following problem: You are given a length $n$ text, and a length $m \leq n$ pattern to find in the text. You need to find all occurences of the pattern in the text in $\Theta(m+n) = \Theta(n)$ expected time. The algorithm described below was authored by Rabin and Karp in 1987 and is based on the concept of [string hashing](string-hashing.md).

We should also write that here randomization affects the correctness of the algorithm, rather than its execution time, hence there should be analysis of collision probabilities.

I don't think the randomization affects the correctness if you check the hash collisions. If you ignore them then yes.

If the task is to find all occurrences of the pattern in the text, you can't check them all, it would lead to worst-time $O(nm)$ even with a perfect hash function. The current implementation and the task statement heavily imply that we just report all hash coincidences as a match.

I reported the worst-case time too. I do not think we should compromise the correctness because of the worst-case time which should be rare. Sleator goes over both options so maybe both should be mentioned.

Checking occurrences is somewhat fine if you only want to check that whether the pattern occurs at all (and possibly report one specific match). We can suggest doing a full match check in this specific case, but then should also provide some bounds on the probability that the number of false matches is high.

For the formulation of finding all matches though we shouldn't assume that the number of valid matches is low at all, this is just a kind of assumptions that will be wrong in competitive programming context with a reasonably well-prepared test set.

src/string/rabin-karp.md

adamant-pwn · 2024-10-13T23:16:19Z

src/string/rabin-karp.md

+
+$$h(S[i..j]) = S[i] x^{m-1} + S[i+1] x^{m-2} + \cdots + S[j-1] x + S[j]$$
+
+This is a polynomial in variable $x$, and $x$ is chosen to be as large as the alphabet, so that we have a unique hash for every fixed-length substring. For example, for a string of bytes, we can take $x = 256$. However, the resulting hash value is large, so we work modulo a large prime $p$. This creates a small chance of hash collisions that needs to be handled. In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$. 


In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$.

It doesn't, hash collisions only lead to false matches.

Besides, we should add:

What should be randomized (either $p$ or $x$ should do) and how;

What is the expected probability of a false match due to hash collision.

But I'm generally not sure whether it should be covered right here, or in the general hashing article (in which case this paragraph should link to a corresponding paragraph).

The way it is presented usually is that for every hash match we still check for correctness. And the time complexity is in CLRS.

Then it diverges from the provided implementation, and searching for all matches of $a^{n/2}$ in $a^n$ would be quadratic due to the checks for correctness...

Yes, the assumption in CLRS is that the number of valid matches is low. I'm not opposed to rewriting it to assume a hash collision never happens, but I would make that explicit.

I'd say what we need is not the assumption that a collision never happens, but rather a proper estimate on the probability of it happening, so that people can adjust their randomization range and e.g. number of mods or bases to be checked to ensure a probability that they're fine with.

adamant-pwn · 2024-10-13T23:20:28Z

src/string/rabin-karp.md

+
+Now, to "roll" the hash forward,
+
+$$h(S[i+1..j+1) = x \cdot h(S[i..j]) - S[i] x^m + S[j+1]$$


This isn't consistent with the computations in the code snippet, as the code snippet use prefix sums instead of rolling.

I think the rolling is more intuitive and that's how CLRS and Sleator present it. I can make a note saying the code uses prefix sums instead. If I make new code it'll probably be a separate PR.

I think competitive programmers usually already know about prefix sums when they learn hashing, so explanation via prefix sums should be more intuitive, as it refers to a familiar concept, while rolling is, in essence, an ad hoc trick. Prefix sums are also more versatile and practical, as they allow to compare any two substrings without a need to compute hashes of all substrings of the same side, which is what you often need in many generic hash problems (so, they'd need to learn prefix sums approach anyway).

So, mentioning rolling is fine of course, as knowing it broadens the horizons, but I'd also put at least as much emphasis on prefix sums just due to how practical and useful they are in the context of polynomial hashing. The only concern I have here is that it will probably make more sense to have a dedicated section about prefix sums in hashes own article, and link to that from here.

Idk if using prefix sums is even Rabin-Karp any more. Maybe Rabin-Karp can be grouped in with other sliding-window type functions.

To some extent it is, e.g. this refers to prefix sums based approach as a variant of Rabin-Karp. Though it is not popular in literature at all, in competitive programming context, it's common to use "Rabin-Karp algorithm" to refer to the idea of using a polynomial hash for substring matches, rather than to specific rolling hash based variant. This, in turn, was partly popularized by the fact that original e-maxx.ru article took this approach...

github-actions · 2024-10-13T23:55:54Z

Visit the preview URL for this PR (for commit fec16e6):

https://cp-algorithms--preview-1356-obqsdag6.web.app

(expires 2024-10-20T23:55:51.192703689Z)

#1356 (comment)

github-actions · 2024-10-14T00:05:43Z

Visit the preview URL for this PR (for commit 8833f67):

https://cp-algorithms--preview-1356-obqsdag6.web.app

(expires 2024-10-21T00:05:36.794629067Z)

#1356 (comment)

github-actions · 2024-10-14T00:15:02Z

Visit the preview URL for this PR (for commit 18daf94):

https://cp-algorithms--preview-1356-obqsdag6.web.app

(expires 2024-10-21T00:14:58.619836763Z)

Based on https://contest.cs.cmu.edu/295/s20/tutorials/strings.mark

Co-authored-by: Oleksandr Kulkov <adamant.pwn@gmail.com>

#1356 (comment)

github-actions · 2024-10-14T19:17:31Z

Preview the changes for PR #1356 (1c129a7) here: https://cp-algorithms.github.io/cp-algorithms/1356/

adamant-pwn requested changes Oct 13, 2024

View reviewed changes

jxu added a commit that referenced this pull request Oct 14, 2024

Rabin-Karp: clean up intro

8833f67

#1356 (comment)

jxu added a commit that referenced this pull request Oct 14, 2024

Rabin-Karp: Link CLRS

18daf94

#1356 (comment)

jxu marked this pull request as draft October 14, 2024 00:33

adamant-pwn deleted the branch main October 14, 2024 18:53

adamant-pwn closed this Oct 14, 2024

adamant-pwn reopened this Oct 14, 2024

adamant-pwn changed the base branch from master to main October 14, 2024 19:16

jxu and others added 5 commits October 14, 2024 21:16

Rewrite Rabin-Karp explanation for clarity

0fc847a

Based on https://contest.cs.cmu.edu/295/s20/tutorials/strings.mark

Update rabin-karp explanation

24b179c

Update src/string/rabin-karp.md

1bf0837

Co-authored-by: Oleksandr Kulkov <adamant.pwn@gmail.com>

Rabin-Karp: clean up intro

9351d41

#1356 (comment)

Rabin-Karp: Link CLRS

1c129a7

#1356 (comment)

adamant-pwn force-pushed the jxu-patch-1 branch from 18daf94 to 1c129a7 Compare October 14, 2024 19:16

jxu closed this Oct 24, 2024

jxu deleted the jxu-patch-1 branch October 24, 2024 15:02

github-actions bot added a commit that referenced this pull request Oct 24, 2024

Delete preview for #1356

9392a34


		$$h(S[i..j]) = S[i] x^{m-1} + S[i+1] x^{m-2} + \cdots + S[j-1] x + S[j]$$

		This is a polynomial in variable $x$, and $x$ is chosen to be as large as the alphabet, so that we have a unique hash for every fixed-length substring. For example, for a string of bytes, we can take $x = 256$. However, the resulting hash value is large, so we work modulo a large prime $p$. This creates a small chance of hash collisions that needs to be handled. In the worst case of a hash collision every time, the performance degrades to $\Theta(nm)$.


		Now, to "roll" the hash forward,

		$$h(S[i+1..j+1) = x \cdot h(S[i..j]) - S[i] x^m + S[j+1]$$

Uh oh!

Rewrite Rabin-Karp explanation for clarity #1356

Rewrite Rabin-Karp explanation for clarity #1356

Uh oh!

Conversation

jxu commented Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Oct 13, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jxu Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamant-pwn Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamant-pwn Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jxu Oct 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 13, 2024

Uh oh!

github-actions bot commented Oct 14, 2024

Uh oh!

github-actions bot commented Oct 14, 2024

Uh oh!

github-actions bot commented Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

jxu commented Oct 13, 2024 •

edited

Loading

jxu Oct 14, 2024 •

edited

Loading

adamant-pwn Oct 13, 2024 •

edited

Loading

adamant-pwn Oct 14, 2024 •

edited

Loading

jxu Oct 13, 2024 •

edited

Loading

github-actions bot commented Oct 14, 2024 •

edited

Loading