Skip to content

Fixed spelling mistake in suffix-automaton.md #1391

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 15 additions & 15 deletions src/string/suffix-automaton.md
Original file line number Diff line number Diff line change
Expand Up @@ -221,11 +221,11 @@ Let us describe this process:
(Initially we set $last = 0$, and we will change $last$ in the last step of the algorithm accordingly.)
- Create a new state $cur$, and assign it with $len(cur) = len(last) + 1$.
The value $link(cur)$ is not known at the time.
- Now we to the following procedure:
- Now we do the following procedure:
We start at the state $last$.
While there isn't a transition through the letter $c$, we will add a transition to the state $cur$, and follow the suffix link.
If at some point there already exists a transition through the letter $c$, then we will stop and denote this state with $p$.
- If it haven't found such a state $p$, then we reached the fictitious state $-1$, then we can just assign $link(cur) = 0$ and leave.
- If we haven't found such a state $p$, then we reached the fictitious state $-1$, then we can just assign $link(cur) = 0$ and leave.
- Suppose now that we have found a state $p$, from which there exists a transition through the letter $c$.
We will denote the state, to which the transition leads, with $q$.
- Now we have two cases. Either $len(p) + 1 = len(q)$, or not.
Expand All @@ -241,7 +241,7 @@ Let us describe this process:

- In any of the three cases, after completing the procedure, we update the value $last$ with the state $cur$.

If we also want to know which states are **terminal** and which are not, the we can find all terminal states after constructing the complete suffix automaton for the entire string $s$.
If we also want to know which states are **terminal** and which are not, we can find all terminal states after constructing the complete suffix automaton for the entire string $s$.
To do this, we take the state corresponding to the entire string (stored in the variable $last$), and follow its suffix links until we reach the initial state.
We will mark all visited states as terminal.
It is easy to understand that by doing so we will mark exactly the states corresponding to all the suffixes of the string $s$, which are exactly the terminal states.
Expand Down Expand Up @@ -280,7 +280,7 @@ The linearity of the number of transitions, and in general the linearity of the

- In the second case we came across an existing transition $(p, q)$.
This means that we tried to add a string $x + c$ (where $x$ is a suffix of $s$) to the machine that **already exists** in the machine (the string $x + c$ already appears as a substring of $s$).
Since we assume that the automaton for the string $s$ is build correctly, we should not add a new transition here.
Since we assume that the automaton for the string $s$ is built correctly, we should not add a new transition here.

However there is a difficulty.
To which state should the suffix link from the state $cur$ lead?
Expand Down Expand Up @@ -324,7 +324,7 @@ If we consider all parts of the algorithm, then it contains three places in the
- The second place is the copying of transitions when the state $q$ is cloned into a new state $clone$.
- Third place is changing the transition leading to $q$, redirecting them to $clone$.

We use the fact that the size of the suffix automaton (both in number of states and in the number of transitions) is **linear**.
We use the fact that the size of the suffix automaton (both in the number of states and in the number of transitions) is **linear**.
(The proof of the linearity of the number of states is the algorithm itself, and the proof of linearity of the number of states is given below, after the implementation of the algorithm).

Thus the total complexity of the **first and second places** is obvious, after all each operation adds only one amortized new transition to the automaton.
Expand All @@ -334,7 +334,7 @@ We denote $v = longest(p)$.
This is a suffix of the string $s$, and with each iteration its length decreases - and therefore the position $v$ as the suffix of the string $s$ increases monotonically with each iteration.
In this case, if before the first iteration of the loop, the corresponding string $v$ was at the depth $k$ ($k \ge 2$) from $last$ (by counting the depth as the number of suffix links), then after the last iteration the string $v + c$ will be a $2$-th suffix link on the path from $cur$ (which will become the new value $last$).

Thus, each iteration of this loop leads to the fact that the position of the string $longest(link(link(last))$ as suffix of the current string will monotonically increase.
Thus, each iteration of this loop leads to the fact that the position of the string $longest(link(link(last))$ as a suffix of the current string will monotonically increase.
Therefore this cycle cannot be executed more than $n$ iterations, which was required to prove.

### Implementation
Expand Down Expand Up @@ -444,7 +444,7 @@ Let the current non-continuous transition be $(p, q)$ with the character $c$.
We take the correspondent string $u + c + w$, where the string $u$ corresponds to the longest path from the initial state to $p$, and $w$ to the longest path from $q$ to any terminal state.
On one hand, each such string $u + c + w$ for each incomplete strings will be different (since the strings $u$ and $w$ are formed only by complete transitions).
On the other hand each such string $u + c + w$, by the definition of the terminal states, will be a suffix of the entire string $s$.
Since there are only $n$ non-empty suffixes of $s$, and non of the strings $u + c + w$ can contain $s$ (because the entire string only contains complete transitions), the total number of incomplete transitions does not exceed $n - 1$.
Since there are only $n$ non-empty suffixes of $s$, and none of the strings $u + c + w$ can contain $s$ (because the entire string only contains complete transitions), the total number of incomplete transitions does not exceed $n - 1$.

Combining these two estimates gives us the bound $3n - 3$.
However, since the maximum number of states can only be achieved with the test case $\text{"abbb\dots bbb"}$ and this case has clearly less than $3n - 3$ transitions, we get the tighter bound of $3n - 4$ for the number of transitions in a suffix automaton.
Expand All @@ -460,7 +460,7 @@ For the simplicity we assume that the alphabet size $k$ is constant, which allow

### Check for occurrence

Given a text $T$, and multiple patters $P$.
Given a text $T$, and multiple patterns $P$.
We have to check whether or not the strings $P$ appear as a substring of $T$.

We build a suffix automaton of the text $T$ in $O(length(T))$ time.
Expand Down Expand Up @@ -525,7 +525,7 @@ The value $ans[v]$ can be computed using the recursion:

$$ans[v] = \sum_{w : (v, w, c) \in DAWG} d[w] + ans[w]$$

We take the answer of each adjacent vertex $w$, and add to it $d[w]$ (since every substrings is one character longer when starting from the state $v$).
We take the answer of each adjacent vertex $w$, and add to it $d[w]$ (since every substring is one character longer when starting from the state $v$).

Again this task can be computed in $O(length(S))$ time.

Expand All @@ -547,15 +547,15 @@ long long get_tot_len_diff_substings() {
}
```

This approaches runs in $O(length(S))$ time, but experimentally runs 20x faster than the memoized dynamic programming version on randomized strings. It requires no extra space and no recursion.
This approach runs in $O(length(S))$ time, but experimentally runs 20x faster than the memoized dynamic programming version on randomized strings. It requires no extra space and no recursion.

### Lexicographically $k$-th substring {data-toc-label="Lexicographically k-th substring"}

Given a string $S$.
We have to answer multiple queries.
For each given number $K_i$ we have to find the $K_i$-th string in the lexicographically ordered list of all substrings.

The solution of this problem is based on the idea of the previous two problems.
The solution to this problem is based on the idea of the previous two problems.
The lexicographically $k$-th substring corresponds to the lexicographically $k$-th path in the suffix automaton.
Therefore after counting the number of paths from each state, we can easily search for the $k$-th path starting from the root of the automaton.

Expand All @@ -577,7 +577,7 @@ Total time complexity is $O(length(S))$.

For a given text $T$.
We have to answer multiple queries.
For each given pattern $P$ we have to find out how many times the string $P$ appears in the string $T$ as substring.
For each given pattern $P$ we have to find out how many times the string $P$ appears in the string $T$ as a substring.

We construct the suffix automaton for the text $T$.

Expand All @@ -603,7 +603,7 @@ Therefore initially we have $cnt = 1$ for each such state, and $cnt = 0$ for all
Then we apply the following operation for each $v$: $cnt[link(v)] \text{ += } cnt[v]$.
The meaning behind this is, that if a string $v$ appears $cnt[v]$ times, then also all its suffixes appear at the exact same end positions, therefore also $cnt[v]$ times.

Why don't we overcount in this procedure (i.e. don't count some position twice)?
Why don't we overcount in this procedure (i.e. don't count some positions twice)?
Because we add the positions of a state to only one other state, so it can not happen that one state directs its positions to another state twice in two different ways.

Thus we can compute the quantities $cnt$ for all states in the automaton in $O(length(T))$ time.
Expand Down Expand Up @@ -690,7 +690,7 @@ void output_all_occurrences(int v, int P_length) {
### Shortest non-appearing string

Given a string $S$ and a certain alphabet.
We have to find a string of smallest length, that doesn't appear in $S$.
We have to find a string of the smallest length, that doesn't appear in $S$.

We will apply dynamic programming on the suffix automaton built for the string $S$.

Expand All @@ -706,7 +706,7 @@ The answer to the problem will be $d[t_0]$, and the actual string can be restore
### Longest common substring of two strings

Given two strings $S$ and $T$.
We have to find the longest common substring, i.e. such a string $X$ that appears as substring in $S$ and also in $T$.
We have to find the longest common substring, i.e. such a string $X$ that appears as a substring in $S$ and also in $T$.

We construct a suffix automaton for the string $S$.

Expand Down
Loading
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy