0% found this document useful (0 votes)
17 views62 pages

String Matching Algo (1)

The document discusses string matching techniques, including the naive string matching algorithm and the Rabin-Karp algorithm, which are used to find occurrences of a pattern within a text. It outlines the complexities of these algorithms and provides examples to illustrate their functionality. Applications of string matching include spell checkers, search engines, and bioinformatics.

Uploaded by

rsinglame24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views62 pages

String Matching Algo (1)

The document discusses string matching techniques, including the naive string matching algorithm and the Rabin-Karp algorithm, which are used to find occurrences of a pattern within a text. It outlines the complexities of these algorithms and provides examples to illustrate their functionality. Applications of string matching include spell checkers, search engines, and bioinformatics.

Uploaded by

rsinglame24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 62

String Matching

Finding all occurrences of a pattern in


the text.
Applications
e Spell Checkers.
e Search Engines.
e Spam Filters.
* Intrusion Detection System.
e Plagiarism Detection.
* Bioinformatics — DNA Sequencing.
* Digital Forensics.
* Information Retrieval, etc.
String Matching Problem
* Let,
Text T'[1..n] 1s an array of length n.
e Pattern P [1..m] is an array of length m < n.
P and T are drawn from a finite alphabet X.
—2={0,1} orX={a,b, ..., z}.
* Example:
—T=abcabaabcabac
—P=abaa
Contd...; [aToTclalblala
b clalo]alc
sfiftzox P alblal|a

sfift:lx P alblala

51ift=2x P alblala

51ift=3/ P alblala

s1ift=4x P albla|a

51ift=5x P alblala

s1ift=6x P alblala

51ift=7x P alblala

shift=8x P albla|a

shift=0}( P a|blala
Contd...
text T albl|lc|la|lb|lala|/lb|c|a|b|la|c

5=3
pattern P ——> a|b|a|a

» Shift s a valid shift, if P occurs with shift s in 7.


—0<s<n-mand T[s+ 1.s +m]=P[l.m].
* Otherwise, shift s is an invalid shift.
* String-matching problem means finding all valid
shifts with which a given pattern P occurs in a given
text 7.
Naive String Matching Algorithm
NAIVE-STRING-MATCHER(T,P)
1. n="T.length
m = P.length
fors=0ton—m
AT

ifP[l.m]|==T][s+ 1..s+m]
print “Pattern occurs with shift” s

e Complexity: O((n—m + 1)m)


NAIVE-STRING-MATCHER (T, P)
Example ] n = T.length
2 m = P.length
3 fors =0ton—m
4 if P[1..ml==T[s+1..5+m]
D print “Pattern occurs with shift” s
* Text T'= acaabc, and pattern P = aab.
a4 clajla|b|c alc|ala|b]|c

q=0aab =TI lalb

alc|la|a|b|c alc|la|a|b|c

LS
b2
1]

I}

]
)
Rabin-Karp Algorithm
* Uses elementary number-theoretic notions.
—Equivalence of two numbers modulo a third
number.
* Let,X=1{0,1,2,3,4,5,6,7,8, 9}.
String of k consecutive characters represents a
length-k decimal number.
—Thus, character string 31415 corresponds to the
decimal number 31,415.
* Note:
—In the general case, each character is a digit in
radix-d notation, where d = [X|.)
Example
P 3111415 mod
13 =7

' {2]3|5(9(0(2(|3|1|4|1|5(2]6[7[3]|9]|9]|2
\ Y
1 }

5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 | 2 SRS
2 [SoR S SO 2 (1

I T -+« mod 13
311|017 |8|4(5|10|11}F |9 |11
valid spurious
match hit
A few calculations...

* For a pattern P [1..m], let p denote its corresponding


value in radix-d notation.

» Using Horner’s rule, p can be computed in time O(m).

p=P[m]l+dP[m—-1]+dP[m-2]+...+d(P[2]+dP[1])...)).

* Similarly, for a text 7'[1..n], let ¢, denotes the radix-d


notation value of the length-m substring 7' [s + 1.5 +
m], fors=0,1, ....n—m.

* Again, t, can be computed from 7 [1..m] in O(m).


Contd...
* Each of the remaining values ¢, ¢,, ..., ¢, , can be
computed in constant time.
— Subtracting d” T [s + 1] removes the high-order digit
from 7, multiplying the result by d shifts the number left
by one digit position, and adding 7 [s + m + 1] brings in
the appropriate low-order digit.
to=d,—d" 'T[s+1])+T[s+m+1].
e Leth=d" 1, then
tg=d@,—hT[s+1)+T[s+m+1].
* Modulus:
— p modulo g takes O(m) time.
— For all ¢, ¢, modulo g takes O(n —m + 1) time.
Algorithm
RABIN-KARP-MATCHER (T, P, d.q)
I n = T.length
2 m = P.length
3 h=d™ "' modg
4 p=20
5 fg =33

6 fori = ltom // preprocessing


7 p = (dp + P[i]) mod ¢
8 to = (dty + T[i]) mod ¢
9 fors =0ton—m // matching
10 ifp==t;
11 if P[1..m]==T[s+1..5 4+ m]
12 print “Pattern occurs with shift” s
13 ifs<n—m
14 ts11 = (d(ts —T[s+ 1]h) + T[s +m + 1]) mod ¢
Example -1
Index 1 2 (3 (45|67 ]8]9]10 1211314 |15| 16| 17 18 19

Text (T) 2 31519(0(2|3|1(4]1


Pattern: 31415

>=1{0,1,....9) d=[Z=10 q=13


n=19, m=5n-m= 14.
h =105"mod13 =10*mod13 =3.
p =(3x104+1x10°+4x102+1x10'+5x
109 mod 13 =7.
tp=2x10*+3x10°+5x
102+ 9 x 10" + 0 x 10°) mod 13 = 8.

Step 1: s=0,t,=8,p="7.
p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
t, =10 (8 — 3 (2)) + 2 (mod 13)
t, =10 (2) +2 (mod 13) =22 (mod 13) =9.
Contd...
Index 1 2 (3 (45|67 ]8]9]10 1213|1415 16 17 18 19

Text (T) 2 31519(0(2(3|1(4]1


Pattern: 31415

>=10,1,...,9} d=[Z/=10
n=19, m=5n-m= 14.
h =10 mod 13 =10*mod 13 =3.
p =(3x10%+1x103+4x 102+ 1 x 10!+ 5 x 10°) mod 13 =7.
tp=2x10*+3x10°+5x
102+ 9 x 10" + 0 x 10°) mod 13 = 8.

Step 2: s=1,t,=9,p="7.
p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
t, =10 (9 -3 (3)) + 3 (mod 13)
t, =10 (0)+ 3 (mod 13) =3 (mod 13) 3.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19

Text(T) |2 (3]15]910(2[3({1|4(1[5]|2|6|7[3]|9]|9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n—-m=14.
h =10>'mod13 =10*mod 13 =3.
p =(3x104+1x10°+4x102+1x10'+5x 109 mod 13 = 7.
tp=(2
% 10*+3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 3: s=2,t,=3,p="7.
p==t, — No.
s<14 — Yes.
t,y=d@,—hT[s+1])+T[s+m+ 1] (mod 13)
t;, =103-3(5))+1(mod 13) =10(-12)+ 1 (mod 13)
Because-12mod13=1 t; =10 (1) + 1 (mod 13) =11 (mod 13) =11.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19

Text(T)|2(3[5]910(2[3|1|4|1[5|2|6|7[3]|9]|9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n-m=14.
h =10>'mod13 =10*mod13 =3.
p =(x104+1x10°+4x102+1x 10" +5x 109 mod 13 =7.
ty=(2%104 +3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 4: s=3,3=11,p="7.
p==1t; — No.
s<14 — Yes.
too=d(t,—hT[s+1])+ T[s+m+ 1] (mod 13)
ty, =10(11-3(9)) +4 (mod 13) =10 (-16) +4 (mod 13)
Because -16 mod 13=10 t; = 10 (10) +4 (mod 13) =104 (mod 13) =0.
Contd...
Index 1 2 (3 (4|56 7]8 13 14 15 16 17 18 19

Text (T) 2 315/19(0(2(3|1


Pattern: 31415

>=10,1,...,9} d=[Z/=10
n=19, m=5n-m= 14.
h =10 mod 13 =10*mod 13 =3.
p =(3x10%+1x103+4x102+1 x 10! + 5 x 10 mod 13 = 7.
tp=02*x10*+3x10°+5x
10>+ 9 x 10" + 0 x 10°) mod 13 = 8.

Step 5: s=4,t,=0,p="7.
p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
t; =10 (0—3 (0)) + 1 (mod 13)
ts; =1 (mod 13) =1.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19

Text(T) |2 (359|023 1|4[1(5|2|6|7[3]|9]|9|2]1


Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n—-m=14.
h =10>'mod13 =10*mod13 =3.
p =(x104+1x10°+4x102+1x 10" +5x 109 mod 13 =7.
ty=(2%104 +3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 6: s=5,ts=1,p="7.
p=—=1t5; — No.
s<14 — Yes.
T[s+m+ 1] (mod 13)
tooy=d(t,—hT[s+1])+
te =10 (1 -3 (2)) +5 (mod 13) =10 (-5)
+ 5 (mod 13)
Because-5mod13=8 tc =10(8)+5 (mod 13) =85 (mod 13) =7.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19

Text(T)[2|3(5]9(0({23|1(4[1|5(2|6[7|3|9]|9|2]1
Pattern: 31415
>x={0,1,...,9} d=Z|=10 q=13
* n=19,m=5n-m=14.
h =10"mod 13 =10*mod 13 =3.
p =G x10*+1x103+4x102+1x10'+5x 10% mod 13 =7.
ty2=02x10*+3x10°+5x 102+ 9 x 10" + 0 x 10°) mod 13 = 8.
Step 7: s=6,t,=7,p="7.
p==t, — Yes.
Character by character matching p[1..5] == T[7..11].
{31415}==1{31415}. Match, hence s = 6 is a valid shift.
s<14 — Yes.
tiy=d(t,—hT[s+1])+T[s+m+ 1] (mod 13)
t, =10(7-3(3)) + 2 (mod 13) =10 (-2) + 2 (mod
13)
Because-2mod13=11" ¢ — 10 (11)+2
(mod 13) =112
(mod 13) =8.
Contd...
Index 1 2 (3 (45|67 ]8]9]10 1211314 |15| 16| 17 18 19

Text (T) 2 3151910(2(|3(1(4(1


Pattern: 31415

>=1{0,1,....9) d=[Z=10 q=13


n=19, m=5n-m= 14.
h =105"mod13 =10*mod13 =3.
p =(3x104+1x10°+4x102+1x10'+5x
109 mod 13 = 7.
tp=02x10*+3x10°+5x
102+ 9 x 10" + 0 x 10°) mod 13 = 8.

Step 8: s=7,t,=8,p="7.
p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
te =10 (8 — 3 (1)) + 6 (mod 13)
t; =10 (5) + 6 (mod 13) =56 (mod 13) =4.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19

Text(T)|2(3[5]|9|0(2[3|1[4[1[5](2]6|7[3|9]|9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n—-m=14.
h =10>'mod13 =10*mod 13 =3.
p =(3x104+1x10°+4x102+1x10'+5x 109 mod 13 =7.
tp=(2
% 10*+3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 9: s=8t=4,p="7.
p==1t; — No.
s<14 — Yes.
tyo=d(t,—hT[s+1]) + T[s+m+ 1] (mod 13)
ty =10 (4—-3(4)) + 7 (mod 13) =10 (-8)+ 7 (mod 13)
Because-8mod13=5 to = 10 (5)+ 7 (mod 13) =57 (mod 13) =5.
Contd...
Index 1 2 (3 (45|67 ]8]9]10 12113141516 |17 | 18 19

Text (T) 2 3151910(2(3(1(4f1


Pattern: 31415

>=1{0,1,....9) d=[Z=10 q=13


n=19, m=5n-m= 14.
h =105"mod13 =10*mod13 =3.
p =GB x10*+1x103+4x102+1x10'+5x10% mod 13 =7.
tp=02x10*+3x10°+5x
10>+ 9 x 10! + 0 x 10°) mod 13 = 8.

Step 10: s=9,t,=5,p="7.


p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
t,y =10 (53 (1)) + 3 (mod 13)
t,o =10 (2)+ 3 (mod 13) =23 (mod 13) = 10.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19

Text(T)|2(3[5]|9|0(2[3[1|4|1]5]2|6[7[3]9]|9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n-m=14.
h =10>'mod 13 =10*mod13 =3.
p =(x104+1x10°+4x102+1x 10" +5x 109 mod 13 =7.
ty=(2%104 +3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 11: s=10,t,,=10,p=7.
p ==1,,— No.
s<14 — Yes.
t,y=d@,—hT[s+1])+T[s+m+ 1] (mod 13)
t,, =10(10-3(5)) +9 (mod 13) =10(-5)+9 (mod 13)
Because-5mod13=8 t,; =10 (8) +9 (mod 13) =89 (mod 13) =11.
Contd...
Index 1 2 (3 (45|67 ]8]9]10 1211314 |15| 16| 17 18 19

Text (T) 2 3151910(2(3(1(4]|1


Pattern: 31415

>=1{0,1,....9) d=[Z=10 q=13


n=19, m=5n-m= 14.
h =105"mod13 =10*mod13 =3.
p =(x104+1x10°+4x102+1x 10" +5x 109 mod 13 =7.
tp=02*x10*+3x10°+5x
10>+ 9 x 10" + 0 x 10°) mod 13 = 8.

Step 12: s=11,t,=11,p="7.


p==t,;, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
t;, =10 (11 -3 (2)) +9 (mod 13)
t;, =10 (5)+9 (mod 13) =59 (mod 13) =7.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19

Text(T)[2|3(5]9(0(2|3[|1|4]1|5(2|6[713[9]9|2]1
Pattern: 31415
>x={0,1,...,9} d=Z|=10 q=13
* n=19,m=5n-m= 14,
h =10>'mod 13 =10*mod 13 =3.
p =3x10*+1x103+4x10>+1x10'+5x10°% mod 13=7.
ty2=2x10*+3x10°+5x 102+ 9 x 10" + 0 x 10°) mod 13 = 8.
Step 13: s=12,t,=7,p="7.
p == t;, — Yes.
Character by character matching p[1..5] ==T[13..17].
{31415}==1{67399}. Mismatch occurs at first character.
s<14 — Yes.
tiy=d(t,—hT[s+1])+T[s+m+ 1] (mod 13)
t;; =10(7-3(6)) +2 (mod 13) =10 (-11)+ 2 (mod 13)
Because-11mod13=2 t . =10 (2)+2 (mod 13) =22 (mod 13) =09.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19

Text(T)|2(3[5]|9(0(2[3[1|4|1[5]|2|6[7[3]9]9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n—-m=14.
h =10>'mod13 =10*mod 13 =3.
p =(3x104+1x10°+4x102+1x10'+5x 109 mod 13 = 7.
tp=(2%
104 +3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 14: s=13,t3=9,p=7.
p ==1t,3—
No.
s<14 — Yes.
t,y=d@,—hT[s+1])+T[s+m+ 1] (mod 13)
ty, =1009-3(7))+1(mod 13) =10(-12)+ 1 (mod 13)
Because-12mod13=1 t,, =10 (1)+ 1 (mod 13) =11 (mod 13) =11.
Contd...
Index 1 2 (3 (4|56 7]8 13 14 15 16 17 18 19

Text (T) 2 3151910(2]3]1


Pattern: 31415

>=10,1,...,9} d=[g/=10
n=19, m=5n-m= 14.
h =10 mod 13 =10*mod 13 =3.
p =(3x10%+1x103+4x 102+ 1 x 10! +5 x 10°) mod 13 =7.
tp=02x10*+3x10°+5x
102+ 9 x 10" + 0 x 10°) mod 13 = 8.

Step 15: s=14,t,=11,p=17.


p==t,,— No.
s<14 — No.

Step 16: s=15.


Loop terminates.
Index 1 2 3 4 5 6 7 8

Example 2 Text(T)]ala|[b|b|c|la|b]a
ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97

n=8,n;1=3,n—m=5. Pattern:cab
h =26>'mod3
=262m0d3=1. Zz{a,b,...,Z} d=|2|:26 q=3

p =(99 x 262+ 97 x 26! +98 x 26%) mod 3 = 1.


ty=(97 x 262+ 97 x 26! + 98 x 26°) mod 3 = 2.

Step 1: s=0,t,=2,p=1.
p=—=1t, — No.
s<5 — Yes.
t,g=d@,—hT[s+ 1)+ T[s+m+ 1] (mod 3)
t, =26 (2-1(97))+98 (mod 3)
t, =26(2—1 (1))
+2 (mod 3)
(because 97 mod 3 = 1 and 98 mod 3 =2)
=26 (1) +2 (mod 3)
=28 (mod 3) =1.
Index 1 2 3 4 5 6 7 8

Contd'" Text(T)| ala | blb|c|la|b]|a


ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97
Pattern:cab

Step 2: s=Lt;,=1Lp=1
p==t;, — Yes.
Character by character matching p[1..3] == T[2..4].
{cab} == {abb}. Mismatch occurs at first character.
s<5 — Yes.
t,o1=d@,—hT[s+1])+ T[s+m+ 1] (mod3)
t, =26 (1 —1(97)) + 99 (mod 3)
t, =26 (1 —1(1))+ 0 (mod 3)
(because 97 mod 3 =1 and 99 mod 3 =0)
=26 (0) (mod 3)
=0.
Index 1 2 3 4 5 6 7 8

Contd'" Text(T)|a|a|b|blc|al|b|a
ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97
Pattern:cab

Step 3: s=2,t,=0,p=1.
p==1t, — No.
s<5 — Yes.
tioy=d@,—hT[s+ 1)+ T[s+m+ 1] (mod3)
t; =26 (0—1(98)) + 97 (mod 3)
t; =26 (0—-1(2))+ 1 (mod 3)
(because 97 mod 3 =1 and 98 mod 3 = 2)
=26 (0—-2)+ 1 (mod3)
=26 (0+ 1)+ 1 (mod3)
(because 3's complement of -2 = 1)
=26(1)+1(mod3) =27(mod3) =0.
Index 1 2 3 4 5 6 7 8

Contd'" Text(T)|a|a|b|blcla|b|a
ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97
Pattern:cab

Step 4: s=3,t=0,p=1.
p ==1t; — No.
s<5 — Yes.
tioy=d@,—hT[s+ 1)+ T[s+m+ 1] (mod3)
t, =26 (0—1(98)) + 98 (mod 3)
t, =26 (0—1(2)) +2 (mod 3)
(because 98 mod 3 = 2)
=26 (0—2)+ 2 (mod 3)
=26 (0+ 1)+ 2 (mod3)
(because 3's complement of -2 = 1)
=26(1)+2(mod3) =28(mod3) =1.
Index 1 2 3 4 5 6 7 8
Contd... Text(T)| a|a|b|b|c|al|b]|a
ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97
Pattern:cab

Step 5: s=4,t,=1,p=1.
p==t, — Yes.
Character by character matching p[1..3] == T[5..7].
{cab} == {cab}. Match, hence s =4 is a valid shift.
s<5 — Yes.
t,o1=d@,—hT[s+1])+ T[s+m+ 1] (mod3)
ts =26 (1—-1(99)) +97 (mod 3)
ts =26 (1-1(0)) + 1 (mod 3)
(because 97 mod 3 =1 and 99 mod 3 =0)
=26 (1-0)+1 (mod3)
=26(1)+1(mod3) =27(mod3) =0.
Index
Contd...
1

Text (T) a

ASCII 97 97 98 98

Pattern:cab

Step 6: s=5,t;=0,p=1.
p ==t; — No.
s<5 — No.

Step 7: s=6.
Loop terminates.

Text (T)
Complexity
* Takes ©(m) preprocessing time.
* Worst-case running time is O(m (n —m + 1)).
—Example: P=aMand T =a", each of the [n—m +
1] possible shifts is valid.
* In many applications, there are a few valid shifts
(say some constant c). In such applications, the
expected matching time is only O(n —m + 1) + cm),
plus the time required to process spurious hits.
Contd...
* Probabilistic analysis
— The probability of a false positive hit for a random
input is 1/q.
— The expected number of false positive hits is O(n/q).
—The expected run time is O(n) + O(m(v + n/q))), if v is
the number of valid shifts.
* Choosing g 2 m and having only a constant number of
hits, then the expected matching time is O(n + m).
* Since m £ n, this expected matching time is O(n).
Knuth-Morris-Pratt Algorithm
* Based on the concept of prefix function for a
pattern.
— Encapsulates knowledge about how the pattern
matches against shifts of itself.
—This information can be used to avoid testing of
invalid shifts.
*T: bacbababaabcbab
e P: ababaca
bla|c|b a bla

B) > alb|a
e

bla|e|lb al/lbla

~
s =542
* Given that pattern characters P [1..q] match text
characters T [s + 1..s + q], what is the least shift
s' > s such that for some k < q,
P[1.k]=T[s'+ 1..s' + k], where s'+ k=s+ q?
1e.s'=s+(q—k)
* Best case k = 0. Skips (q — 1) shifts.
Contd... Note: P, = P[1..]. Similarly, P, = P[1..q].
* In other words, knowing that P is a suffix of T, ,, find the
longest proper prefix P, of P, that is also a suffix of T, .

bla|/lc|/b|la|/bla/ b|lala|b|lc|bla|b| T
%
L) »al|lbla|/bla|cl|al|P
- q -

* Since suffix P, of T, , should also be suffix of P,. Thus, an


equivalent statement is “determine the greatest k < g, such
that Py is a suffix of P”. albla[bla] P,

abla| P,
Prefix Function

o))
&}
—_

N
w

3
Ve
P: ababaca klolol1l2
P,=P[1..q]
P, = {a}. P, = {}. No matching prefix and suffix of P;.
P, ={a b}. P, = {}. No matching prefix and suffix of P,.
P, ={aba}. P, ={a}. P,={abab}. P, ={ab}.

Prefix | Suffix | k Prefix |Suffix |k

aba |bab |3
~—t


O
@)
>

%)

o)
)

g
Na)
* P: ababaca k]0]0 2 01
* Py=P[l.q]
* P.={ababa}.P,={abal. e P, ={a}.
P,={ababaca}.
Keep
the longest. Prefix Suffix k
Prefix Suffix k
EX
CR
ab ca 2
ab ba 2 aba aca 3
abab baca 4
abab baba 4 ababa |abaca |5
ababac|babaca|6
* P,={ababac} P, ={}. No matching prefix and suffix of P, as 'c'
does not appear in any of the proper prefix.
Prefix Function
q— i [1]|2(3]|4]|5]|6
Pli] |la|b|a|b|a|c |a
k—ox[i] |[0]0|1]|2[3]0]1

e Given a pattern P [1..m], the prefix function for the


pattern P is the function IT : {1, 2, ..., m} — {0, 1,
..., m— 1} such that
I [q] = max {k : k <qand P, is a proper suffix of P
* It contains the length of the longest prefix of P that
is a proper suffix of P_.
Contd...
COMPUTE-PREFIX-FUNCTION (P)
1 m = P.length
let r[1..m] be a new array
9 ={ll=10
4 k=0
5 forqg =2tom
6 while £ > Oand Pk + 1] # Plqg]
q k = mlk]
8 if P[k + 1] == Plq]
9 k =k + 1
10 nlgl =k
Il return
KMP Algorithm
KMP-MATCHER(T. P)
1 n = T.length
2 m = P.length
3 m = COMPUTE-PREFIX-FUNCTION(P)
4 ¢g=0 // number of characters matched
5 fori = 1ton // scan the text from left to right
6 while ¢ > O and Plg + 1] # Ti]
7 q = 7lq] // next character does not match
8 if Plg+ 1]==T1i]
9 qg=q+1 // next character matches
10 ifg==m // is all of P matched?
11 print “Pattern occurs with shift™ 7 —m
12 q = mlq] // look for the next match
Complexity
* The COMPUTE-PREFIXFUNCTION runs in 6(m) time as the while
loop in lines 6—7 executes at most m — 1 times altogether.
1. Line 4 starts k at 0, and the only way to increase k is the increment
operation in line 9, which executes at most once per iteration of the
for loop of lines 5-10. Thus, the total increase in k is at most m — 1.
2. Second, since k < q upon entering the for loop and each iteration of
the loop increments g and k < g always. Assignments in lines 3 and 10
ensure that N[q] < qforallg=1, 2, ..., m, which means that each
iteration of the while loop decreases k.
3. Third, k never becomes negative.
— Altogether, the total decrease in k from the while loop is bounded
from above by the total increase in k over all iterations of the for loop,
which is m —1.
* Using similar analysis, the matching time of KMP-MATCHER is
O(n).
i 2[37¢
Example (Preprocessing) p[;a ; b|a ;
il |o]o|1]2
* m=7,MN[1]=0,k=0.
Stepl: q=2,k=0 Plk + 1] == P[q]
(if) P[1] == P[2]. Mismatch.
n[2] =o0.
Step 2: g=3,k=0. Plk + 1] == P[q]
(if) P[1] == P[3]. Match. -+ = 1
n[3] = 1.
Step 3: g=4,k=1. Plk + 1] == P[q]
(if) P[2] == P[4]. Match. kK++ =2
ni4] = 2.
i [1]|2(3]4
Contd... Pl |a|b|a|b
x[i] [0]0]1]2
Step 4: g=5k=2 P[k
+ 1] == P[q
(if) P[3] == P[5]. Match. k++=13
n[s] = 3.
Step 5: g=6,k=3. Plk
+ 1] == P[q]
(while) P[4] == P[6]. Mismatch. k=n[k]=1
(while) P[2] == P[6]. Mismatch. k=n[k]=0

(ify P[1]==P[6]. Mismatch.


n[ej = 0.
Step 6: g=7,k=0. P[k + 1] == P[q]
(if) P[1] == P[7]. Match. k++=1
ni7i = 1.
Step 7: g=8 Loop terminates.
Contd... n==15 IR :

N oV
Bl o g l

B
{ewl

SS I o
(Matching) 7] |0

e T: |bla|c|b|la|bla|bla|b|lal|c]|a

* P: |a|lbla|blalc]|a

Step 1: i=1,9=0 Plg + 1] ==T][i]


(if) P[1] ==TI[1]. Mismatch.
blalc|b|la|b|la|b|al|b|alc]|a
X
- i 2134
Contd... n=15 BRI
m=7 :
il [o]lo|1]2

Step 2: i=2,9=0 Plg+1] ==T]i]


(if) P[1] ==TI[2]. Match. g++=1

blalc|b|alb|la|bla|blalc|a]a

albla|lblalc
- I ]
Contd... n=15 s

B
m=7 :

SS I o
wfi] |0

Step 3: i=3,9=1 Plg+ 1] ==TJi


(while) P[2] ==T[3]. Mismatch.
q=N[q]=0
blal|c|b|a|b|la|b|la|b|a]|c

albla|blalc|a

(if) P[1] ==TI[3]. Mismatch.


b alc|/b|lalblalblalblalc
- i 2 4
Contd... n=1> BRI
m=7 :
x[i] (o]0 2

Step 4: i=4,9=0 Plg+1] ==T]i]


(if) P[1] ==TI[4]. Mismatch.

blalc|b|albla|bja|blalc]|a

X
alblal|b c|a
Contd...

&
ol

|O
"

)
W
Step 5: i=5,q=0

(if) P[1] ==TI[5].

b|a C b a blal|b alblal|c


- i 2134
Contd... ”m-=175 BRI
il [o]lo|1]2

Step 6: i=6,q=1 Plg+ 1] ==T]i]


(if) P[2] ==TI[6]. Match. gq++=2

blalc|b|alb|la|bla|blalc|a]a

a|b|a alc|a
- i 2134
Contd... nm—=175 Pli] |a|b|a|b
il [o]lo|1]2

Step 7: i=7,9=2 Plg+ 1] ==T]i]


(if) P[3] ==TI[7]. Match. g++=3

blalc|b|la|b]|a albla|c|ala

a|b|a alc|a
Contd... n- L _
ettt i 2134

~ 2] |o]o|1]2

Step 8: i=8,q=3 + 1] == T[]


P[q

(if) P[4] ==TI[8]. Match. qg++=4

blalc|b|la|b]|a albla|c|ala

a|b|a alc|a
- i 2134
Contd... ”m-_15 BRI
_ xli] |0]0]1]2

Step 9: i=9,qg=4 Plg+ 1] ==T]i]


(if) P[5] ==TI[9]. Match. g++=5

blalc|b|alb|la|bla|blalc|a]a

albla|blalc|a
n=15 i (11234 6|7
Contd--- m=7 Pli] |a|b|a|b c|a
_ xli] |ojo][1]2]3]o]1

Step10: i=10,q=5 Plg+ 1] ==T]i]


(while) P[6] ==T[10]. Mismatch. q=MN[g]=3

bla|c|b|a|b|la|b|la|b|la|c|a|lal|b

ababaéa

(ify P[4]==T[10]. Match. q++=4


blalc|b|a|lb|la|b|ja|lblalc|la|a]|b
Shifts skipped = 1 and first three I i i E
charactgrs are not compared. slblalolalcla
Comparison starts from P[4].
Contd... n=15 " '

B
- Pli]

SS I o
Step1l: i=11,9=4 P[g + 1] == TJ[i

(if) P[5]==T[11]. Match.

{---1 T
{---1 T
{---4 o

o
o
@
Contd... n=15
_ Pli] |a=

B
SS I o
Step12: i=12,9=5 Plg+1] ==1
(if) P[6]==T[12]. Match.

O
QD
===+

===t

===+
o

o
Q)
— i |1]2|3
Contd... n=1> T REIEIE

B
SS I o
Step13: i=13,9g=6
(if) P[7]==T[13]. Match. q++=7
bla|c|b|al|b d d b a

{---4i T
i i i
1 1 1
1 1 1
d d b d

o
* g==m. Yes.
— Pattern occurs with s hifti—m=13-7=6.

—q="N[q]
= 1.
— i 2 4 6|7
Contd--- n__15 Pli] |a|b|a|b c|a
m=7 :
x[li] 10|01 ]|2|3|0|1

Stepl14: i=14,q=1 Plg+ 1] ==T]i]


(while) P[2] ==T[14]. Mismatch. g=M[q]=0

blalc|blalbla|bla|blaljc|alal|b

Shiftsskipped:SandfirstchIaracterisnot ; b alblalcla
compared. Comparison starts from P[2].

(if) P[1]==T[14]. Match. g++=1


blajc|b|a|bla|b|a|b|alc|alal|b
RO
m=7 :
m[i]

Step15: i=15,qg=1 Plq + 1]


(if) P[2]==T[15]. Match.

bla|c|b|a|bla|b|la|lb|a|c|a]a

d
Step16: i=16
Loop terminates.

You might also like