DS V Unit Notes
DS V Unit Notes
In above red boxes says mismatch letters against letters of the text and green boxes
says match letters against letters of the text. According to the above
In first raw we check whether first letter of the pattern is matched with the first
letter of the text. It is mismatched, because "S" is the first letter of pattern and "T"
is the first letter of text. Then we move the pattern by one position. Shown in
second raw.
Then check first letter of the pattern with the second letter of text. It is also
mismatched. Likewise we continue the checking and moving process. In fourth
raw we can see first letter of the pattern matched with text. Then we do not do any
moving but we increase testing letter of the pattern. We only move the position of
pattern by one when we find mismatches. Also in last raw, we can see all the
letters of the pattern matched with the some letters of the text continuously.
Running Time Analysis Of Brute Force String Matching Algorithm
let length of the text as n (|text| = n) and length of the pattern as m (|pattern| =
m).
In worst case, in each position we should have to make m comparisons.
Also we should have to make these m comparisons in (n-m+1) positions.
Therefore all the comparisons in total string matching process is {m*(n-m+1)}.
Therefore running time of Brute Force String Matching Algorithm is O(m(n-
m+1)). Simply we say that as O(nm) if n<<<m.
From the above example, |text|= 24. but we do comparisons only for first 16 letters.
Then we can get |text|= 16 and |pattern|=6.
In worst case we should have to compare these 6 letters of the pattern
with 6 letters of the text in each position. Also we can see there are 11 positions for
first 16 letters of text. that is (|text|-|pattern|+1) = 11. Positions we can find using
number of raw's in above example.
Therefore, all the comparisons for above example is {|pattern|*[|text|-|pattern|
+1]}. This means, to match our pattern with first 16 letters of text we should have
to make 6*11 comparisons. But if |text|<<<|pattern|, we simply get all the
comparison by |text|*|pattern|. Therefore running time should be O(|pattern|*|text|).
Advantages
1. Very simple technique and also that does not require any preprocessing. Therefore
total running time is the same as its matching time.
Disadvantages
1. Very inefficient method. Because this method takes only one position movement in
each time.
#include<stdio.h>
#include<string.h>
char t[100],p[50];
void main()
{
int pos;
clrscr();
printf("Enter the Source String ");
scanf("%s",t);
printf("Enter the pattern ");
scanf("%s",p);
pos=brute_force();
if(pos==-1)
printf("%s pattern not found in text",p);
else
printf("%s pattern found at index %d",p,pos);
getch();
}
int brute_force()
{
int n,j,m,i;
n=strlen(t);
m=strlen(p);
for(i=0;i<n;i++)
{
j=0;
while(j<m && t[i+j]==p[j])
{
j++;
if(j==m)
return i+1; //pattern found
}
}
return -1; //pattern not found
}
Boyer Moore Algorithm
Pattern searching is an important problem in computer science. When we do search
for a string in notepad/word file or browser or database, pattern searching
algorithms are used to show the search results. A typical problem statement would
be- Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char
pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n
> m.
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10
case 1
int badchar[NO_OF_CHARS];
else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s+j]]);
}
}
Knuth-Morris-Pratt algorithm
KMP Algorithm is one of the most popular patterns matching algorithms. KMP stands for
Knuth Morris Pratt. KMP algorithm was invented by Donald Knuthand Vaughan
Pratt together and independently by James H Morris in the year 1970. In the year 1977, all
the three jointly published KMP Algorithm.
KMP algorithm was the first linear time complexity algorithm for string matching.
KMP algorithm is one of the string matching algorithms used to find a Pattern in a Text.
KMP algorithm is used to find a "Pattern" in a "Text". This algorithm campares character by
character from left to right. But whenever a mismatch occurs, it uses a preprocessed table
called "Prefix Table" to skip characters comparison while matching. Some times prefix table is also
known as LPS Table. Here LPS stands for "Longest proper Prefix which is also Suffix".
#include<stdio.h>
#include<string.h>
char txt[100],pat[100];
int M ,N ,lps[100],j=0,i=0;
void computeLPSArray()
{
int len = 0, i;
lps[0] = 0;
i = 1;
while(i < M)
{
if(pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else
{
if( len != 0 )
len = lps[len-1];
else
{
lps[i] = 0;
i++;
}
}
}
}
void KMPSearch()
{
int j=0,i=0;
M = strlen(pat);
N = strlen(txt);
computeLPSArray();
while(i < N)
{
if(pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
else if(pat[j] != txt[i])
{
if(j != 0)
j = lps[j-1];
else
i = i+1;
}
}
}
int main()
{
printf("\n ENTER THE TEXT : ");
gets(txt);
printf("\n ENTER THE PATTERN : ");
gets(pat);
KMPSearch();
return 0;
}
output:-
Tries
All the search trees are used to store the collection of numerical values but they are
not suitable for storing the collection of words or strings. Trie is a data structure
which is used to store the collection of strings and makes searching of a pattern in
words more easy. The term trie came from the word retrieval. Trie data structure
makes retrieval of a string from the collection of strings more easily. Trie is also
called as Prefix Tree and some times Digital Tree. A trie is defined as follows...
Trie is a tree like data structure used to store collection of strings.
A trie can also be defined as follows...
Trie is an efficient information storage and retrieval data structure.
The trie data structure provides fast pattern matching for string data values. Using
trie, we bring the search complexity of a string to the optimal limit. A trie searches
a string in O(m) time complexity, where m is the length of the string.
In trie, every node except the root stores a character value. Every node in trie can
have one or a number of children. All the children of a node are alphabetically
ordered. If any two strings have a common prefix then they will have the same
ancestors.
Example
The standard Trie data structure
o Definitions:
Alphabet = a set of characters
o Example:
S = { bear, bell, bid, bull, buy, sell, stock, stop } (s = 8)
Trie:
Note:
There are 8 leaf nodes !!
se a binary tree
This implementation is called a bitwise trie
Implementing a trie using an array of references
o Array implementation:
If the trie is dense and/or alphabet Σ is small (e.g., Σ = {a, b,
c, ..., z}), you can use an array to store the labels
Node structure:
o Example:
Example:
E.g.:
Example:
Then we create nodes for letters are are not in the trie: "ck":
o Psuedo code:
p = root;
for ( i = 0; i < k.length(); i++ )
{
nextChar = k[i];
if ( nextChar ∈ p.char[] )
{
nextLink = link associated with the char entry;
p = nextLink; // Traverse
}
else
{
p = new TrieNode(); // Create new node
/*
------------------------------------------------------
When the while loop ends, we have found or created
the external node for the key "k"
------------------------------------------------------ */
insert v into node p;
}
o When a key (string) is a prefix of another key, the path of the first
key would end in an internal node
Example: at and ate
o Solution:
Add a special termination symbol ◊ to the alphabet Σ
The termination symbol ◊ has the lower value in
the alphabet
Example:
o Note:
We typically use the NUL character '\0' as termination symbol
Compressed tries
o A compressed trie is a trie with one additional rule:
Each internal node has ≥ 2 children
o Redundant node:
An internal node v is redundant if
Node v is not the root node, and
Node v has 1 child node
Example:
Redundant chain of edges:
A chain of edges:
is redundant if:
Example:
Compression algorithm:
Replace:
a redundant chain or edges (v0,v1), (v1,v2), ...,
(vk−1,vk)
by one edge:
(v0 ,vk)
Replace:
the label vk
by the label:
v1 v2 ... vk
Example:
Before compression:
After compression:
What is different with the implementation of a compressed trie
Result:
Another example:
Result:
Properties of the compress trie
Hence:
# internal nodes in a compress trie < s
Total # nodes < 2 × s
Suffix tries
Suffix tree is a compressed trie of all the suffixes of a given string. Suffix trees
help in solving a lot of string related problems like pattern matching, finding
distinct substrings in a given string, finding longest palindrome etc.
Before going to suffix tree, let's first try to understand what a compressed trie is.
Consider the following set of strings:
{"banana","nabd","bcdef","bcfeg","aaaaaa","aaabaa" }
A standard trie for the above set of strings will look like:
And a compressed trie for the given set of strings will look like:
As it might be clear from the images show above, in a compressed trie, edges that
direct to a node having single child are combined together to form a single edge
and their edge labels are concatenated. So this means that each internal node in a
compressed trie has atleast two children. Also it has atmost N leaves, where N is
the number of strings inserted in the compressed trie. Now both the facts: Each
internal node having atleast two children, and that there are N leaves, implies that
there are atmost 2N−1 nodes in the trie. So the space complexity of a compressed
trie is O(N) as compared to the O(N2) of a normal trie.
So that is one reason why to use compressed tries over normal tries.
Before going to construction of suffix trees, there is one more thing that should be
understood, Implicit Suffix Tree. In Implicit suffix trees, there are atmost N leaves,
while in normal one there should be exactly N leaves. The reason for
atmost N leaves is one suffix being prefix of another suffix. Following example
will make it clear. Consider the string "banana"
Implicit Suffix Tree for the above string is shown in image below:
To avoid getting an Implicit Suffix Tree we append a special character that is not
equal to any other character of the string. Suppose we append $ to the given string
then, so the new string is "banana$". Now its suffix tree will be
A Suffix Tree for a given text is a compressed trie for all suffixes of the given
text. We have discussed Standard Trie. Let us understand Compressed Trie with
the following array of words.
{bear, bell, bid, bull, buy, sell, stock, stop}
Following is standard trie for the above input set of words.
Following is the compressed trie. Compress Trie is obtained from standard trie by
joining chains of single nodes. The nodes of a compressed trie can be stored by
storing index ranges at the nodes.
Please note that above steps are just to manually create a Suffix Tree. We will be
discussing actual algorithm and implementation in a separate post.
How to search a pattern in the built suffix tree?
We have discussed above how to build a Suffix Tree which is needed as a
preprocessing step in pattern searching. Following are abstract steps to search a
pattern in the built Suffix Tree.
1) Starting from the first character of the pattern and root of Suffix Tree, do
following for every character.
…..a) For the current character of pattern, if there is an edge from the current node
of suffix tree, follow the edge.
…..b) If there is no edge, print “pattern doesn’t exist in text” and return.
2) If all characters of pattern have been processed, i.e., there is a path from root for
characters of the given pattern, then print “Pattern found”.
Let us consider the example pattern as “nan” to see the searching process.
Following diagram shows the path followed for searching “nan” or “nana”.
How does this work?
Every pattern that is present in text (or we can say every substring of text) must be
a prefix of one of all possible suffixes. The statement seems complicated, but it is a
simple statement, we just need to take an example to check validity of it.
Suffix tree can be used for a wide range of problems. Following are some famous
problems where Suffix Trees provide optimal time complexity solution.
1) Pattern Searching
2) Finding the longest repeated substring
3) Finding the longest common substring
4) Finding the longest palindrome in a string