0% found this document useful (0 votes)
45 views33 pages

DS V Unit Notes

The document discusses the brute force string matching algorithm and the Boyer-Moore string matching algorithm. The brute force algorithm compares each character of the pattern to each character in the text sequentially in a nested loop. This has a worst-case running time of O(mn) where m is the length of the pattern and n is the length of the text. The Boyer-Moore algorithm improves on this by using two heuristics - the bad character heuristic and good suffix heuristic. The bad character heuristic skips characters in the text by looking at the last occurrence of mismatching characters in the pattern. The good suffix heuristic skips characters by looking at the longest suffix of the pattern that matches the end of the text
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views33 pages

DS V Unit Notes

The document discusses the brute force string matching algorithm and the Boyer-Moore string matching algorithm. The brute force algorithm compares each character of the pattern to each character in the text sequentially in a nested loop. This has a worst-case running time of O(mn) where m is the length of the pattern and n is the length of the text. The Boyer-Moore algorithm improves on this by using two heuristics - the bad character heuristic and good suffix heuristic. The bad character heuristic skips characters in the text by looking at the last occurrence of mismatching characters in the pattern. The good suffix heuristic skips characters by looking at the longest suffix of the pattern that matches the end of the text
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT-V PATTERN MATCHING AND TRIES

Pattern matching in computer science is the checking and locating of specific


sequences of data of some pattern among raw data or a sequence of tokens. Unlike
pattern recognition, the match has to be exact in the case of pattern matching. Pattern
matching is one of the most fundamental and important paradigms in several
programming languages. Many applications make use of pattern matching as a major
part of their tasks.
Brute force is a straightforward approach to solving a problem, usually directly based
on the problem statement and definition
Brute Force(Naive) String Matching Algorithm
When we talk about a string matching algorithm, every one can get a simple
string matching technique. That is starting from first letters of the text and first
letter of the pattern check whether these two letters are equal. if it is, then check
second letters of the text and pattern. If it is not equal, then move first letter of the
pattern to the second letter of the text. then check these two letters. this is the
simple technique everyone can thought.
Brute Force string matching algorithm is also like that. Therefore we call
that as Naive string matching algorithm. Naive means basic. Lets learn this method
using an example.

Red Boxes-Mismatch Green Boxes-Match

In above red boxes says mismatch letters against letters of the text and green boxes
says match letters against letters of the text. According to the above
In first raw we check whether first letter of the pattern is matched with the first
letter of the text. It is mismatched, because "S" is the first letter of pattern and "T"
is the first letter of text. Then we move the pattern by one position. Shown in
second raw.
Then check first letter of the pattern with the second letter of text. It is also
mismatched. Likewise we continue the checking and moving process. In fourth
raw we can see first letter of the pattern matched with text. Then we do not do any
moving but we increase testing letter of the pattern. We only move the position of
pattern by one when we find mismatches. Also in last raw, we can see all the
letters of the pattern matched with the some letters of the text continuously.
Running Time Analysis Of Brute Force String Matching Algorithm
let length of the text as n (|text| = n) and length of the pattern as m (|pattern| =
m).
In worst case, in each position we should have to make m comparisons.
Also we should have to make these m comparisons in (n-m+1) positions.
Therefore all the comparisons in total string matching process is {m*(n-m+1)}.
Therefore running time of Brute Force String Matching Algorithm is O(m(n-
m+1)). Simply we say that as O(nm) if n<<<m.

From the above example, |text|= 24. but we do comparisons only for first 16 letters.
Then we can get |text|= 16 and |pattern|=6.
In worst case we should have to compare these 6 letters of the pattern
with 6 letters of the text in each position. Also we can see there are 11 positions for
first 16 letters of text. that is (|text|-|pattern|+1) = 11. Positions we can find using
number of raw's in above example.
Therefore, all the comparisons for above example is {|pattern|*[|text|-|pattern|
+1]}. This means, to match our pattern with first 16 letters of text we should have
to make 6*11 comparisons. But if |text|<<<|pattern|, we simply get all the
comparison by |text|*|pattern|. Therefore running time should be O(|pattern|*|text|).
Advantages
1. Very simple technique and also that does not require any preprocessing. Therefore
total running time is the same as its matching time.
Disadvantages
1. Very inefficient method. Because this method takes only one position movement in
each time.

Brute-Force String Matching


Searching for a pattern, P[0...m-1], in text, T[0...n-1]
Algorithm BruteForceStringMatch(T[0...n-1], P[0...m-1])
for i ← 0 to n-m do
j←0
while j < m and P[j] = T[i+j] do
j++
if j = m then return i
return -1

#include<stdio.h>
#include<string.h>
char t[100],p[50];
void main()
{
int pos;
clrscr();
printf("Enter the Source String ");
scanf("%s",t);
printf("Enter the pattern ");
scanf("%s",p);
pos=brute_force();
if(pos==-1)
printf("%s pattern not found in text",p);
else
printf("%s pattern found at index %d",p,pos);
getch();
}
int brute_force()
{
int n,j,m,i;
n=strlen(t);
m=strlen(p);
for(i=0;i<n;i++)
{
j=0;
while(j<m && t[i+j]==p[j])
{
j++;
if(j==m)
return i+1; //pattern found
}
}
return -1; //pattern not found
}
Boyer Moore Algorithm
Pattern searching is an important problem in computer science. When we do search
for a string in notepad/word file or browser or database, pattern searching
algorithms are used to show the search results. A typical problem statement would
be- Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char
pat[], char txt[]) that prints all occurrences of pat[] in txt[]. You may assume that n
> m.
Examples:
Input: txt[] = "THIS IS A TEST TEXT"
pat[] = "TEST"
Output: Pattern found at index 10

Input: txt[] = "AABAACAADAABAABA"


pat[] = "AABA"
Output: Pattern found at index 0
Pattern found at index 9
Pattern found at index 12

Boyer Moore is a combination of following two approaches.


1) Bad Character Heuristic
2) Good Suffix Heuristic
Both of the above heuristics can also be used independently to search a pattern in a
text. Let us first understand how two independent approaches work together in the
Boyer Moore algorithm. If we take a look at the Naive algorithm, it slides the pattern
over the text one by one. KMP algorithm does preprocessing over the pattern so that
the pattern can be shifted by more than one. The Boyer Moore algorithm does
preprocessing for the same reason. It processes the pattern and creates different arrays
for both heuristics. At every step, it slides the pattern by the max of the slides suggested
by the two heuristics. So it uses best of the two heuristics at every step.
Unlike the previous pattern searching algorithms, Boyer Moore algorithm starts matching from
the last character of the pattern.

Bad Character Heuristic


The idea of bad character heuristic is simple. The character of the text which doesn’t
match with the current character of the pattern is called the Bad Character. Upon
mismatch, we shift the pattern until –
1) The mismatch becomes a match
2) Pattern P move past the mismatched character.
Case 1 – Mismatch become match
We will lookup the position of last occurrence of mismatching character in pattern and if
mismatching character exist in pattern then we’ll shift the pattern such that it get aligned
to the mismatching character in text T.

case 1

Explanation: In the above example, we got a mismatch at position 3. Here our


mismatching character is “A”. Now we will search for last occurrence of “A” in pattern.
We got “A” at position 1 in pattern (displayed in Blue) and this is the last occurrence of
it. Now we will shift pattern 2 times so that “A” in pattern get aligned with “A” in text.
Case 2 – Pattern move past the mismatch character
We’ll lookup the position of last occurrence of mismatching character in pattern and if
character does not exist we will shift pattern past the mismatching character.

Explanation: Here we have a mismatch at position 7. The mismatching character “C”


does not exist in pattern before position 7 so we’ll shift pattern past to the position 7 and
eventually in above example we have got a perfect match of pattern (displayed in
Green). We are doing this because, “C” do not exist in pattern so at every shift before
position 7 we will get mismatch and our search will be fruitless.
In the following implementation, we preprocess the pattern and store the last
occurrence of every possible character in an array of size equal to alphabet size. If the
character is not present at all, then it may result in a shift by m (length of pattern).
Therefore, the bad character heuristic takes o(n/m) time in the best case.
/* C Program for Bad Character Heuristic of Boyer
Moore String Matching Algorithm */
# include <limits.h>
# include <string.h>
# include <stdio.h>

# define NO_OF_CHARS 256

// A utility function to get maximum of two integers


int max (int a, int b) { return (a > b)? a: b; }

// The preprocessing function for Boyer Moore's


// bad character heuristic
void badCharHeuristic( char *str, int size,
int badchar[NO_OF_CHARS])
{
int i;

// Initialize all occurrences as -1


for (i = 0; i < NO_OF_CHARS; i++)
badchar[i] = -1;

// Fill the actual value of last occurrence


// of a character
for (i = 0; i < size; i++)
badchar[(int) str[i]] = i;
}

/* A pattern searching function that uses Bad


Character Heuristic of Boyer Moore Algorithm */
void search( char *txt, char *pat)
{
int m = strlen(pat);
int n = strlen(txt);

int badchar[NO_OF_CHARS];

/* Fill the bad character array by calling


the preprocessing function badCharHeuristic()
for given pattern */
badCharHeuristic(pat, m, badchar);

int s = 0; // s is shift of the pattern with


// respect to text
while(s <= (n - m))
{
int j = m-1;

/* Keep reducing index j of pattern while


characters of pattern and text are
matching at this shift s */
while(j >= 0 && pat[j] == txt[s+j])
j--;

/* If the pattern is present at current


shift, then index j will become -1 after
the above loop */
if (j < 0)
{
printf("\n pattern occurs at shift = %d", s);
/* Shift the pattern so that the next
character in text aligns with the last
occurrence of it in pattern.
The condition s+m < n is necessary for
the case when pattern occurs at the end
of text */
s += (s+m < n)? m-badchar[txt[s+m]] : 1;

else
/* Shift the pattern so that the bad character
in text aligns with the last occurrence of
it in pattern. The max function is used to
make sure that we get a positive shift.
We may get a negative shift if the last
occurrence of bad character in pattern
is on the right side of the current
character. */
s += max(1, j - badchar[txt[s+j]]);
}
}

/* Driver program to test above function */


int main()
{
char txt[] = "ABAAABCD";
char pat[] = "ABC";
search(txt, pat);
return 0;
}

Knuth-Morris-Pratt algorithm
KMP Algorithm is one of the most popular patterns matching algorithms. KMP stands for
Knuth Morris Pratt. KMP algorithm was invented by Donald Knuthand Vaughan
Pratt together and independently by James H Morris in the year 1970. In the year 1977, all
the three jointly published KMP Algorithm.
KMP algorithm was the first linear time complexity algorithm for string matching.
KMP algorithm is one of the string matching algorithms used to find a Pattern in a Text.

KMP algorithm is used to find a "Pattern" in a "Text". This algorithm campares character by
character from left to right. But whenever a mismatch occurs, it uses a preprocessed table
called "Prefix Table" to skip characters comparison while matching. Some times prefix table is also
known as LPS Table. Here LPS stands for "Longest proper Prefix which is also Suffix".

Steps for Creating LPS Table (Prefix Table)


 Step 1 - Define a one dimensional array with the size equal to the length of the Pattern.
(LPS[size])
 Step 2 - Define variables i & j. Set i = 0, j = 1 and LPS[0] = 0.
 Step 3 - Compare the characters at Pattern[i] and Pattern[j].
 Step 4 - If both are matched then set LPS[j] = i+1 and increment both i & j values by one.
Goto to Step 3.
 Step 5 - If both are not matched then check the value of variable 'i'. If it is '0' then set LPS[j]
= 0 and increment 'j' value by one, if it is not '0' then set i = LPS[i-1]. Goto Step 3.
 Step 6- Repeat above steps until all the values of LPS[] are filled.

Let us use above steps to create prefix table for a pattern...


How to use LPS Table
We use the LPS table to decide how many characters are to be skipped for comparison when a
mismatch has occurred.
When a mismatch occurs, check the LPS value of the previous character of the mismatched
character in the pattern. If it is '0' then start comparing the first character of the pattern with the next
character to the mismatched character in the text. If it is not '0' then start comparing the character
which is at an index value equal to the LPS value of the previous character to the mismatched
character in pattern with the mismatched character in the Text.

How the KMP Algorithm Works


Let us see a working example of KMP Algorithm to find a Pattern in a Text...
code in C:

#include<stdio.h>
#include<string.h>
char txt[100],pat[100];
int M ,N ,lps[100],j=0,i=0;
void computeLPSArray()
{
int len = 0, i;
lps[0] = 0;
i = 1;
while(i < M)
{
if(pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else
{
if( len != 0 )
len = lps[len-1];
else
{
lps[i] = 0;
i++;
}
}
}
}
void KMPSearch()
{
int j=0,i=0;
M = strlen(pat);
N = strlen(txt);
computeLPSArray();
while(i < N)
{
if(pat[j] == txt[i])
{
j++;
i++;
}

if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
else if(pat[j] != txt[i])
{
if(j != 0)
j = lps[j-1];
else
i = i+1;
}
}
}

int main()
{
printf("\n ENTER THE TEXT : ");
gets(txt);
printf("\n ENTER THE PATTERN : ");
gets(pat);
KMPSearch();
return 0;
}
output:-

ENTER THE TEXT : Welcome To CampusCoke


ENTER THE PATTERN : C
Found pattern at index 11
Found pattern at index 17
--------------------------------

Tries
All the search trees are used to store the collection of numerical values but they are
not suitable for storing the collection of words or strings. Trie is a data structure
which is used to store the collection of strings and makes searching of a pattern in
words more easy. The term trie came from the word retrieval. Trie data structure
makes retrieval of a string from the collection of strings more easily. Trie is also
called as Prefix Tree and some times Digital Tree. A trie is defined as follows...
Trie is a tree like data structure used to store collection of strings.
A trie can also be defined as follows...
Trie is an efficient information storage and retrieval data structure.
The trie data structure provides fast pattern matching for string data values. Using
trie, we bring the search complexity of a string to the optimal limit. A trie searches
a string in O(m) time complexity, where m is the length of the string.
In trie, every node except the root stores a character value. Every node in trie can
have one or a number of children. All the children of a node are alphabetically
ordered. If any two strings have a common prefix then they will have the same
ancestors.
Example
 The standard Trie data structure
o Definitions:
 Alphabet = a set of characters

 Let S = a set of s strings (= keys) over an alphabet Σ

 A trie T that store the keys in S is a structure where:


 Each node of T (except the root node)
is labeled with a character c ∈ Σ
 The root node has no label !!!

 Each internal node of T can have ≤ |Σ| # of


keys
 The keys are stored in alphabetical order inside
an internal node
 The trie T has s external nodes
 Each external node is associated
with one string in S

(And index (= location)


information are stored for
these strings)
 The path from the root node to
an external node yields exactly one string in S

o Example:
 S = { bear, bell, bid, bull, buy, sell, stock, stop } (s = 8)

Trie:
 Note:
 There are 8 leaf nodes !!

 How to implementation a trie


o There are many ways to implement a trie
 Use an array of references

 Each array element represent one letter of the alphabet


 Each array reference will point to a sub-trie that corresponds
to strings that starts with the corresponding letter

 se a binary tree
 This implementation is called a bitwise trie
 Implementing a trie using an array of references
o Array implementation:
 If the trie is dense and/or alphabet Σ is small (e.g., Σ = {a, b,
c, ..., z}), you can use an array to store the labels

 Node structure:

o Example:

Implementing a trie using a binary tree

o The binary tree implementation:


 The binary tree implementation of a trie is known as
a bitwise trie
 The implementation uses the alphabet: Σ = {0, 1}
 In other words, the implementation stores
a sequences of bits

 The keys are read as a sequence of bits

Example:

bear = 'b' 'e' 'a' 'r'


= 01100010 01100101 01100001 01110010

o A bitwise trie is a binary tree:

Structural properties of the standard trie

o Properties of the standard trie:


 Every internal node has ≤ |Σ| children
 This follows from the way that the trie is constructed

 The trie T on the set S with s strings


(keys) has exactly s external nodes

 This is because a path from the root


node to one external node corresponds to 1 key

 The height of the trie T on the set S = the length of


the longest string ∈ S

 This is because a path from the root


node to one external node corresponds to 1
key
 The longest path = longest key ∈ S

 The number of nodes in the trie T on the set S = O(n),


where n = # characters in the strings ∈ S
 In the worst case, every character in the keys are different

E.g.:

 S = { try, die, flaw, pack }

 Inserting into a standard trie


o High level description:

p = root; // Start at the root

for ( each character c in the keyword ) do


{
if ( c is already stored of the sub-trie )
traverse to that subtrie
else
create a new node for c
}

insert value into leaf node

Example:

 Insert stock in this trie:


 First we traverse the prefix that is already stored in the
trie: "sto":

 Then we create nodes for letters are are not in the trie: "ck":

o Psuedo code:

Value put( Key k, Value v )


{
int i;
TrieNode p;

p = root;
for ( i = 0; i < k.length(); i++ )
{
nextChar = k[i];

if ( nextChar ∈ p.char[] )
{
nextLink = link associated with the char entry;
p = nextLink; // Traverse
}
else
{
p = new TrieNode(); // Create new node

insert "nextChar" into p;


}
}

/*
------------------------------------------------------
When the while loop ends, we have found or created
the external node for the key "k"

------------------------------------------------------ */
insert v into node p;
}

 Advantages of Tries over an ordinary map


 Looking up data in a trie is in the worst case = O(m), where m =
length of the key
 Look up in a map is O(lg(n)) where n = #
entries !!!

So a trie has performance levels that is similar to


a hash table !!!

 Unlike a hash table, a trie can provide
an alphabetical ordering of the entries by key

(I.e., A trie implements an ordered map while a hash


table cannot !)

Handling keys that are prefixes of another key

o The standard trie has the property that:


 Only the external nodes can store information

(The path formed by the internal nodes represents the key)

o When a key (string) is a prefix of another key, the path of the first
key would end in an internal node
Example: at and ate

o Solution:
 Add a special termination symbol ◊ to the alphabet Σ
 The termination symbol ◊ has the lower value in
the alphabet

I.e.,: termination symbol


◊ preceeds every character in the alphabet Σ !!

 We append the termination symbol ◊ to each keyword stored in
the trie

Example:
o Note:
 We typically use the NUL character '\0' as termination symbol

Compressed tries
o A compressed trie is a trie with one additional rule:
 Each internal node has ≥ 2 children

Such compacted trie is also known as:

 Patricia tries or Radix tries

Wikipedia page: click here

o In order to enforce the above rule, the labels are generalized:


 Each node is labeled with a string (multiple characters)

(The label used to be a single character)

Converting a standard trie to a compressed tree

o Redundant node:
 An internal node v is redundant if
 Node v is not the root node, and
 Node v has 1 child node

Example:
Redundant chain of edges:

 A chain of edges:

(v0,v1), (v1,v2), ..., (vk−1,vk)

is redundant if:

 Nodes v1, v2, ..., vk−1 are redundant


 Nodes v0 and vk are not redundant

Example:

Compression algorithm:
 Replace:
 a redundant chain or edges (v0,v1), (v1,v2), ...,
(vk−1,vk)

by one edge:

 (v0 ,vk)

 Replace:
 the label vk

by the label:

 v1 v2 ... vk

Example:

 Before compression:

After compression:
What is different with the implementation of a compressed trie

o Recall how the standard trie is implemented:

o The compressed trie uses strings as keys:


 Find a node that has only one child node:

 Result:
Another example:

 Find a node that has only one child node:

 Result:
Properties of the compress trie

 Each internal node has ≥ 2 children and ≤ |Σ| children

 A compressed trie T storing s strings (keys) has:


s external nodes

 A compressed trie T storing s strings (keys) has: O(s) total


number of nodes
 Because a compressed trie is in the worst case comparable to
a binary tree
 In a binary tree:
 # external nodes = # internal nodes + 1

 Hence:
 # internal nodes in a compress trie < s
 Total # nodes < 2 × s

Suffix tries
Suffix tree is a compressed trie of all the suffixes of a given string. Suffix trees
help in solving a lot of string related problems like pattern matching, finding
distinct substrings in a given string, finding longest palindrome etc.

Before going to suffix tree, let's first try to understand what a compressed trie is.
Consider the following set of strings:
{"banana","nabd","bcdef","bcfeg","aaaaaa","aaabaa" }
A standard trie for the above set of strings will look like:

And a compressed trie for the given set of strings will look like:

As it might be clear from the images show above, in a compressed trie, edges that
direct to a node having single child are combined together to form a single edge
and their edge labels are concatenated. So this means that each internal node in a
compressed trie has atleast two children. Also it has atmost N leaves, where N is
the number of strings inserted in the compressed trie. Now both the facts: Each
internal node having atleast two children, and that there are N leaves, implies that
there are atmost 2N−1 nodes in the trie. So the space complexity of a compressed
trie is O(N) as compared to the O(N2) of a normal trie.
So that is one reason why to use compressed tries over normal tries.
Before going to construction of suffix trees, there is one more thing that should be
understood, Implicit Suffix Tree. In Implicit suffix trees, there are atmost N leaves,
while in normal one there should be exactly N leaves. The reason for
atmost N leaves is one suffix being prefix of another suffix. Following example
will make it clear. Consider the string "banana"
Implicit Suffix Tree for the above string is shown in image below:

To avoid getting an Implicit Suffix Tree we append a special character that is not
equal to any other character of the string. Suppose we append $ to the given string
then, so the new string is "banana$". Now its suffix tree will be

Now let's go to the construction of the suffix trees.


Suffix tree as mentioned previously is a compressed trie of all the suffixes of a
given string, so the brute force approach will be to consider all the suffixes of the
given string as separate strings and insert them in the trie one by one. But time
complexity of the brute force approach is O(N^2), and that is of no use for large
values of N.

A Suffix Tree for a given text is a compressed trie for all suffixes of the given
text. We have discussed Standard Trie. Let us understand Compressed Trie with
the following array of words.
{bear, bell, bid, bull, buy, sell, stock, stop}
Following is standard trie for the above input set of words.

Following is the compressed trie. Compress Trie is obtained from standard trie by
joining chains of single nodes. The nodes of a compressed trie can be stored by
storing index ranges at the nodes.

How to build a Suffix Tree for a given text?


As discussed above, Suffix Tree is compressed trie of all suffixes, so following are
very abstract steps to build a suffix tree from given text.
1) Generate all suffixes of given text.
2) Consider all suffixes as individual words and build a compressed trie.
Let us consider an example text “banana\0” where ‘\0’ is string termination
character. Following are all suffixes of “banana\0”
banana\0
anana\0
nana\0
ana\0
na\0
a\0
\0
If we consider all of the above suffixes as individual words and build a trie, we get
following.
If we join chains of single nodes, we get the following compressed trie, which is
the Suffix Tree for given text “banana\0”

Please note that above steps are just to manually create a Suffix Tree. We will be
discussing actual algorithm and implementation in a separate post.
How to search a pattern in the built suffix tree?
We have discussed above how to build a Suffix Tree which is needed as a
preprocessing step in pattern searching. Following are abstract steps to search a
pattern in the built Suffix Tree.
1) Starting from the first character of the pattern and root of Suffix Tree, do
following for every character.
…..a) For the current character of pattern, if there is an edge from the current node
of suffix tree, follow the edge.
…..b) If there is no edge, print “pattern doesn’t exist in text” and return.
2) If all characters of pattern have been processed, i.e., there is a path from root for
characters of the given pattern, then print “Pattern found”.
Let us consider the example pattern as “nan” to see the searching process.
Following diagram shows the path followed for searching “nan” or “nana”.
How does this work?
Every pattern that is present in text (or we can say every substring of text) must be
a prefix of one of all possible suffixes. The statement seems complicated, but it is a
simple statement, we just need to take an example to check validity of it.

Applications of Suffix Tree

Suffix tree can be used for a wide range of problems. Following are some famous
problems where Suffix Trees provide optimal time complexity solution.
1) Pattern Searching
2) Finding the longest repeated substring
3) Finding the longest common substring
4) Finding the longest palindrome in a string

You might also like