SplitPDFFile 346 To 402
SplitPDFFile 346 To 402
UNIT - V
Pattern Matching and Tries: Pattern matching algorithms-Brute force, the Boyer –Moore
algorithm, the Knuth-Morris-Pratt algorithm, Standard Tries, Compressed Tries, Suffix tries.
Pattern Matching
When we talk about a string matching algorithm, every one can get a simple string
matching technique. That is starting from first letters of the text and first letter of the
pattern check whether these two letters are equal. if it is, then check second letters of the
text and pattern. If it is not equal, then move first letter of the pattern to the second
letter of the text. then check these two letters. this is the simple technique everyone can
thought.
Brute Force string matching algorithm is also like that. Therefore we call that as Naive
string
do
if (text letter == pattern letter)
compare next letter of pattern to next letter of text
else
move pattern down text by one letter
while (entire pattern found or end of text)
In above red boxes says mismatch letters against letters of the text and green boxes
saysmatch letters against letters of the text. According to the above
In first raw we check whether first letter of the pattern is matched with the first letter
of the text. It is mismatched, because "S" is the first letter of pattern and "T" is the first
letter of text. Then we move the pattern by one position. Shown in second raw.
Then check first letter of the pattern with the second letter of text. It is also
mismatched. Likewise we continue the checking and moving process. In fourth raw we
can see first letter of the pattern matched with text. Then we do not do any moving but
we increase testing letter of the pattern. We only move the position of pattern by one
when we find mismatches. Also in last raw, we can see all the letters of the pattern
matched with the some letters of the text continuously.
Example 2
Worst Case
Best case
Advantages
1. Very simple technique and also that does not require any preprocessing. Therefore
totalrunning time is the same as its matching time.
Disadvantages
1. Very inefficient method. Because this method takes only one position movement in
each time
The B-M algorithm takes a backward approach . the pattern string(p) is aligned with the
start ofthe text string(T) and then compare the characters of pattern from right to left
beginning with rightmost character
If a character is compared that is not within the pattern, no match can be found by
comparingany furher characters at this position so the pattern can be shifted
completely past the mismatching character.
For determining the possible shifts , B-M algorithm uses 2 preprocessing strategies
simultaneously whenever a mismatch occurs, the algorithm computes a shift using
both strategies and selects the longer one. thus it makes use of the most efficient
stategy for eachindividual case
NOTE : Boyer Moore algorithm starts matching from the last character of the pattern.
The 2 strategies are called heuristics of B-M as they are used to reduce the search. They
are
The idea of bad character heuristic is simple. The character of the text which doesn’t
match with the current character of the pattern is called the Bad Character. Upon
mismatch, we shiftthe pattern until –
1) The mismatch becomes a match
2) Pattern P move past the mismatched character.
case 1
We’ll lookup the position of last occurrence of mismatching character in pattern and if
character does not exist we will shift pattern past the mismatching character.
case2
This means we need some extra information to produce a shift an encountering a bad
character. The information is about last position of evry character in the pattern and
also the set of every character in the pattern and also the set of characters used in the
pattern
Pattern P might contain few more occurrences of t. In such case, we will try to shift the
patternto align that occurrence with t in text T. For example-
Explanation: In the above example, we have got a substring t of text T matched with
pattern P (in green) before mismatch at index 2. Now we will search for occurrence of t
(“AB”) in P. We have found an occurrence starting at position 1 (in yellow background)
so we will right shift the pattern 2 times to align t in P with t in T. This is weak rule of
original Boyer Moore
It is not always likely that we will find the occurrence of t in P. Sometimes there is
no occurrence at all, in such cases sometimes we can search for some suffix of t
matching withsome prefix of P and try to align them by shifting P. For example –
Explanation: In above example, we have got t (“BAB”) matched with P (in green) at
index 2-4 before mismatch . But because there exists no occurrence of t in P we will
search for some prefix of P which matches with some suffix of t. We have found prefix
“AB” (in the yellow background) starting at index 0 which matches not with whole t but
the suffix of t “AB” starting at index 3. So now we will shift pattern 3 times to align prefix
with the suffix.
If the above two cases are not satisfied, we will shift the pattern past the t. For example –
As a part of preprocessing, an array shift is created. Each entry shift[i] contain the
distance pattern will shift if mismatch occur at position i-1. That is, the suffix of pattern
starting at position i is matched and a mismatch occur at position i-1. Preprocessing is
done separately for strong good suffix and case 2 discussed above.
This algorithm takes o(mn) in the worst case and O(nlog(m)/m) on average case,
which is the sub linear in the sense that not all characters are inspected
Applications
This algorithm is highly useful in tasks like recursively searching files for virus
patterns,searching databases for keys or data ,text and word processing and any other
task that requires handling large amount of data at very high speed
The Naive pattern searching algorithm doesn’t work well in cases where we see many
matching
characters followed by a mismatching character. Following are some
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive
KMP Algorithm is one of the most popular patterns matching algorithms. KMP stands
for Knuth Morris Pratt. KMP algorithm was invented by Donald Knuth and Vaughan
Pratt together and independently by James H Morris in the year 1970. In the year 1977,
all the three jointlypublished KMP Algorithm.
KMP algorithm was the first linear time complexity algorithm for string matching.
KMP algorithm is one of the string matching algorithms used to find a Pattern in a Text.
• Step 1 - Define a one dimensional array with the size equal to the length of the Pattern.
(LPS[size])
• Step 2 - Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.
• Step 3 - Compare the characters at Pattern[i] and Pattern[j].
• Step 4 - If both are matched then set LPS[j] = i+1 and increment both i & j values by one.
Goto to Step 3.
• Step 5 - If both are not matched then check the value of variable 'i'. If it is '0' then
set LPS[j] = 0 and increment 'j' value by one, if it is not '0' then set i = LPS[i-1]. Goto Step
3.
• Step 6- Repeat above steps until all the values of LPS[] are filled.
Let us use above steps to create prefix table for a pattern...
We use the LPS table to decide how many characters are to be skipped for
comparisonwhen a mismatch has occurred.
When a mismatch occurs, check the LPS value of the previous character of the
mismatched
character in the pattern. If it is '0' then start comparing the first character of the
pattern with the next character to the mismatched character in the text. If it is not '0'
then start comparing the character which is at an index value equal to the LPS value of
the previous character to themismatched character in pattern with the mismatched
character in the Text.
EXAMPLE 1
Example 2
• Does not work well as the size of the alphabet increase. By which more chances of
mismatch occurs
Trie is an efficient information reTrieval data structure. The term tries comes from
the wordretrieval
Definition of a Trie
Properties of Tries
EXAMPLE
Trie is an efficient information retrieval data structure. Using Trie, search complexities
can bebrought to an optimal limit (key length).
Given multiple strings. The task is to insert the string in a Trie
Examples:
root
/ \
c t
| |
a h
|\ |
l t e
| | \
l i r
|\ | |
e i r e
| |
r n
root
/ |\
l n t
| |
l d
|\|
eiy
| |
r n
Trie deletion
1. Key may not be there in trie. Delete operation should not modify trie.
2. Key present as unique key (no part of key contains another key (prefix), nor the key
itself is prefix of another key in trie). Delete all the nodes.
3. Key is prefix key of another long key in trie. Unmark the leaf node.
4. Key present in trie, having atleast one other key as prefix key. Delete nodes from end of
key until first leaf node of longest prefix key.
Time Complexity: The time complexity of the deletion operation is O(n) where n is
the keylength
Tries is a tree that stores strings. The maximum number of children of a node is
equal to the size of the alphabet. Trie supports search, insert and delete operations
in O(L) time where L is the length of the key.
Hashing:- In hashing, we convert the key to a small value and the value is used to
index data. Hashing supports search, insert and delete operations in O(L) time on
average.
Self Balancing BST : The time complexity of the search, insert and delete
operations in a self-balancing Binary Search Tree (BST) (like Red-Black Tree, AVL
Tree, Splay Tree, etc) is O(L
* Log n) where n is total number words and L is the length of the word. The
advantage of Self-balancing BSTs is that they maintain order which makes
operations like minimum, maximum, closest (floor or ceiling) and kth largest faster.
Why Trie? :-
1. With Trie, we can insert and find strings in O(L) time where L represent the length of a
single word. This is obviously faster than BST. This is also faster than Hashing because of
the ways it is implemented. We do not need to compute any hash function. No collision
handling is required (like we do in open addressing and separate chaining)
2. Another advantage of Trie is, we can easily print all words in alphabetical order which is
not easily possible with hashing.
APPLICATIONS OF TRIES
String handling and processing are one of the most important topics for
programmers.Many real time applications are based on the string processing like:
The data structure that is very important for string handling is the Trie data structure
that isbased on prefix of string
TYPES OF TRIES
1. Standard Tries
2. Compressed Tries
3. Suffix Tries
STANDARD TRIES
Strings={ a,an,and,any}
Handling Keys(strings)
Example:
EXAMPLE
COMPRESSED TRIE
6. While performing the insertion operation, it may be required to un-group the already
grouped characters.
7. While performing the deletion operation, it may be required to re-group the already
grouped characters.
A compressed Trie can be stored at O9s) where s= | S| by using O(1) Space index ranges
at thenodes
SUFFIX TRIES
1. Suffix trie is a compressed trie for all the suffixes of the text
2. Suffix trie are space efficient data structure to store a string that allows many kinds of
queries to be answered quickly.
Example
UNIT - V
Pattern Matching and Tries: Pattern matching algorithms-Brute force, the Boyer –Moore
algorithm, the Knuth-Morris-Pratt algorithm, Standard Tries, Compressed Tries, Suffix tries.
Pattern Matching
When we talk about a string matching algorithm, every one can get a simple string
matching technique. That is starting from first letters of the text and first letter of the
pattern check whether these two letters are equal. if it is, then check second letters of the
text and pattern. If it is not equal, then move first letter of the pattern to the second
letter of the text. then check these two letters. this is the simple technique everyone can
thought.
Brute Force string matching algorithm is also like that. Therefore we call that as Naive
string
www.android.previousquestionpapers.com | www.previousquestionpapers.com | https://telegram.me/jntuh
lOMoARc PSD|18 878400
do
if (text letter == pattern letter)
compare next letter of pattern to next letter of text
else
move pattern down text by one letter
while (entire pattern found or end of text)
In above red boxes says mismatch letters against letters of the text and green boxes
saysmatch letters against letters of the text. According to the above
In first raw we check whether first letter of the pattern is matched with the first letter
of the text. It is mismatched, because "S" is the first letter of pattern and "T" is the first
letter of text. Then we move the pattern by one position. Shown in second raw.
Then check first letter of the pattern with the second letter of text. It is also
mismatched. Likewise we continue the checking and moving process. In fourth raw we
can see first letter of the pattern matched with text. Then we do not do any moving but
we increase testing letter of the pattern. We only move the position of pattern by one
when we find mismatches. Also in last raw, we can see all the letters of the pattern
matched with the some letters of the text continuously.
Example 2
Worst Case
Best case
Advantages
1. Very simple technique and also that does not require any preprocessing. Therefore
totalrunning time is the same as its matching time.
Disadvantages
1. Very inefficient method. Because this method takes only one position movement in
each time
The B-M algorithm takes a backward approach . the pattern string(p) is aligned with the
start ofthe text string(T) and then compare the characters of pattern from right to left
beginning with rightmost character
If a character is compared that is not within the pattern, no match can be found by
comparingany furher characters at this position so the pattern can be shifted
completely past the mismatching character.
For determining the possible shifts , B-M algorithm uses 2 preprocessing strategies
simultaneously whenever a mismatch occurs, the algorithm computes a shift using
both strategies and selects the longer one. thus it makes use of the most efficient
stategy for eachindividual case
NOTE : Boyer Moore algorithm starts matching from the last character of the pattern.
The 2 strategies are called heuristics of B-M as they are used to reduce the search. They
are
The idea of bad character heuristic is simple. The character of the text which doesn’t
match with the current character of the pattern is called the Bad Character. Upon
mismatch, we shiftthe pattern until –
1) The mismatch becomes a match
2) Pattern P move past the mismatched character.
case 1
We’ll lookup the position of last occurrence of mismatching character in pattern and if
character does not exist we will shift pattern past the mismatching character.
case2
This means we need some extra information to produce a shift an encountering a bad
character. The information is about last position of evry character in the pattern and
also the set of every character in the pattern and also the set of characters used in the
pattern
Pattern P might contain few more occurrences of t. In such case, we will try to shift the
patternto align that occurrence with t in text T. For example-
Explanation: In the above example, we have got a substring t of text T matched with
pattern P (in green) before mismatch at index 2. Now we will search for occurrence of t
(“AB”) in P. We have found an occurrence starting at position 1 (in yellow background)
so we will right shift the pattern 2 times to align t in P with t in T. This is weak rule of
original Boyer Moore
It is not always likely that we will find the occurrence of t in P. Sometimes there is
no occurrence at all, in such cases sometimes we can search for some suffix of t
matching withsome prefix of P and try to align them by shifting P. For example –
Explanation: In above example, we have got t (“BAB”) matched with P (in green) at
index 2-4 before mismatch . But because there exists no occurrence of t in P we will
search for some prefix of P which matches with some suffix of t. We have found prefix
“AB” (in the yellow background) starting at index 0 which matches not with whole t but
the suffix of t “AB” starting at index 3. So now we will shift pattern 3 times to align prefix
with the suffix.
If the above two cases are not satisfied, we will shift the pattern past the t. For example –
As a part of preprocessing, an array shift is created. Each entry shift[i] contain the
distance pattern will shift if mismatch occur at position i-1. That is, the suffix of pattern
starting at position i is matched and a mismatch occur at position i-1. Preprocessing is
done separately for strong good suffix and case 2 discussed above.
Before discussing preprocessing, let us first discuss the idea of border. A border is a
substringwhich is both proper suffix and proper prefix. For example, in string
“ccacc”, “c” is a
border, “cc” is a border because it appears in both end of string but “cca” is not a border.
This algorithm takes o(mn) in the worst case and O(nlog(m)/m) on average case,
which is the sub linear in the sense that not all characters are inspected
Applications
This algorithm is highly useful in tasks like recursively searching files for virus
patterns,searching databases for keys or data ,text and word processing and any other
task that requires handling large amount of data at very high speed
The Naive pattern searching algorithm doesn’t work well in cases where we see many
matching
characters followed by a mismatching character. Following are some
pat[] = "AAAAB"
txt[] = "ABABABCABABABCABABABC"
pat[] = "ABABAC" (not a worst case, but a bad case for Naive
KMP Algorithm is one of the most popular patterns matching algorithms. KMP stands
for Knuth Morris Pratt. KMP algorithm was invented by Donald Knuth and Vaughan
Pratt together and independently by James H Morris in the year 1970. In the year 1977,
all the three jointlypublished KMP Algorithm.
KMP algorithm was the first linear time complexity algorithm for string matching.
KMP algorithm is one of the string matching algorithms used to find a Pattern in a Text.
• Step 1 - Define a one dimensional array with the size equal to the length of the Pattern.
(LPS[size])
• Step 2 - Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.
• Step 3 - Compare the characters at Pattern[i] and Pattern[j].
• Step 4 - If both are matched then set LPS[j] = i+1 and increment both i & j values by one.
Goto to Step 3.
• Step 5 - If both are not matched then check the value of variable 'i'. If it is '0' then
set LPS[j] = 0 and increment 'j' value by one, if it is not '0' then set i = LPS[i-1]. Goto Step
3.
• Step 6- Repeat above steps until all the values of LPS[] are filled.
Let us use above steps to create prefix table for a pattern...
We use the LPS table to decide how many characters are to be skipped for
comparisonwhen a mismatch has occurred.
When a mismatch occurs, check the LPS value of the previous character of the
mismatched
character in the pattern. If it is '0' then start comparing the first character of the
pattern with the next character to the mismatched character in the text. If it is not '0'
then start comparing the character which is at an index value equal to the LPS value of
the previous character to themismatched character in pattern with the mismatched
character in the Text.
EXAMPLE 1
Example 2
• Does not work well as the size of the alphabet increase. By which more chances of
mismatch occurs
B) ASSIGNMENT QUESTIONS
List the data structures which are used in RDBMS, Network Data Modal,
and Hierarchical Data Model.
What is a Stack?
List the area of applications where stack data structure can be used?
List the area of applications where stack data structure can be used?
d) Objectives
1. Catalogingintheprocessofcreating databaseoftheLibraryresources.
2. Aprogrammeisasetof whicharemadetoperforma well-designedtask.
3. Theopensourcesoftwareisasoftwareforwhich isopen.
4. Localareanetwork(LAN)ConnectsComputersand devicesspreadinonarea of .
5. The switching technique provides path for data movement on network from
todestination.
6. Namethecharacteristic onthebasisofwhichE-resourcescanbecategorized.
a) Content andAccessibility
b) Print andAnalogue
c) Onlineand print
d) Accessibilityanddispatch
7. WhatdoesPDFStandsfor?
a) Printabledefinedformat
b) Portabledocumentformat
c) Printabledocumentfile
d) Principaldocumentformat
8. Findingout alltherelevantitemsonthestatedtopicisknownas
a) Highprecisionsearch
b) Highrecallsearch
c) Briefsearch
d) None oftheabove
1. Catalogingintheprocessofcreating databaseoftheLibraryresources.
2. Aprogrammeisasetof whicharemadetoperforma well-designedtask.
3. Theopensourcesoftwareisasoftwareforwhich isopen.
4. Localareanetwork(LAN)ConnectsComputersand devicesspreadinonarea of .
5. The switching technique provides path for data movement on network from
todestination.
6. Namethecharacteristic onthebasisofwhichE-resourcescanbecategorized.
e) Content andAccessibility
f) Print andAnalogue
g) Onlineand print
h) Accessibilityanddispatch
7. WhatdoesPDFStandsfor?
e) Printabledefinedformat
f) Portabledocumentformat
g) Printabledocumentfile
h) Principaldocumentformat
8. Findingout alltherelevantitemsonthestatedtopicisknownas
e) Highprecisionsearch
f) Highrecallsearch
g) Briefsearch
h) None oftheabove