String Matching Algo (1)
String Matching Algo (1)
sfift:lx P alblala
51ift=2x P alblala
51ift=3/ P alblala
s1ift=4x P albla|a
51ift=5x P alblala
s1ift=6x P alblala
51ift=7x P alblala
shift=8x P albla|a
shift=0}( P a|blala
Contd...
text T albl|lc|la|lb|lala|/lb|c|a|b|la|c
5=3
pattern P ——> a|b|a|a
ifP[l.m]|==T][s+ 1..s+m]
print “Pattern occurs with shift” s
alc|la|a|b|c alc|la|a|b|c
LS
b2
1]
I}
]
)
Rabin-Karp Algorithm
* Uses elementary number-theoretic notions.
—Equivalence of two numbers modulo a third
number.
* Let,X=1{0,1,2,3,4,5,6,7,8, 9}.
String of k consecutive characters represents a
length-k decimal number.
—Thus, character string 31415 corresponds to the
decimal number 31,415.
* Note:
—In the general case, each character is a digit in
radix-d notation, where d = [X|.)
Example
P 3111415 mod
13 =7
' {2]3|5(9(0(2(|3|1|4|1|5(2]6[7[3]|9]|9]|2
\ Y
1 }
5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 | 2 SRS
2 [SoR S SO 2 (1
I T -+« mod 13
311|017 |8|4(5|10|11}F |9 |11
valid spurious
match hit
A few calculations...
p=P[m]l+dP[m—-1]+dP[m-2]+...+d(P[2]+dP[1])...)).
Step 1: s=0,t,=8,p="7.
p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
t, =10 (8 — 3 (2)) + 2 (mod 13)
t, =10 (2) +2 (mod 13) =22 (mod 13) =9.
Contd...
Index 1 2 (3 (45|67 ]8]9]10 1213|1415 16 17 18 19
>=10,1,...,9} d=[Z/=10
n=19, m=5n-m= 14.
h =10 mod 13 =10*mod 13 =3.
p =(3x10%+1x103+4x 102+ 1 x 10!+ 5 x 10°) mod 13 =7.
tp=2x10*+3x10°+5x
102+ 9 x 10" + 0 x 10°) mod 13 = 8.
Step 2: s=1,t,=9,p="7.
p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
t, =10 (9 -3 (3)) + 3 (mod 13)
t, =10 (0)+ 3 (mod 13) =3 (mod 13) 3.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19
Text(T) |2 (3]15]910(2[3({1|4(1[5]|2|6|7[3]|9]|9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n—-m=14.
h =10>'mod13 =10*mod 13 =3.
p =(3x104+1x10°+4x102+1x10'+5x 109 mod 13 = 7.
tp=(2
% 10*+3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 3: s=2,t,=3,p="7.
p==t, — No.
s<14 — Yes.
t,y=d@,—hT[s+1])+T[s+m+ 1] (mod 13)
t;, =103-3(5))+1(mod 13) =10(-12)+ 1 (mod 13)
Because-12mod13=1 t; =10 (1) + 1 (mod 13) =11 (mod 13) =11.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19
Text(T)|2(3[5]910(2[3|1|4|1[5|2|6|7[3]|9]|9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n-m=14.
h =10>'mod13 =10*mod13 =3.
p =(x104+1x10°+4x102+1x 10" +5x 109 mod 13 =7.
ty=(2%104 +3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 4: s=3,3=11,p="7.
p==1t; — No.
s<14 — Yes.
too=d(t,—hT[s+1])+ T[s+m+ 1] (mod 13)
ty, =10(11-3(9)) +4 (mod 13) =10 (-16) +4 (mod 13)
Because -16 mod 13=10 t; = 10 (10) +4 (mod 13) =104 (mod 13) =0.
Contd...
Index 1 2 (3 (4|56 7]8 13 14 15 16 17 18 19
>=10,1,...,9} d=[Z/=10
n=19, m=5n-m= 14.
h =10 mod 13 =10*mod 13 =3.
p =(3x10%+1x103+4x102+1 x 10! + 5 x 10 mod 13 = 7.
tp=02*x10*+3x10°+5x
10>+ 9 x 10" + 0 x 10°) mod 13 = 8.
Step 5: s=4,t,=0,p="7.
p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
t; =10 (0—3 (0)) + 1 (mod 13)
ts; =1 (mod 13) =1.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19
Text(T)[2|3(5]9(0({23|1(4[1|5(2|6[7|3|9]|9|2]1
Pattern: 31415
>x={0,1,...,9} d=Z|=10 q=13
* n=19,m=5n-m=14.
h =10"mod 13 =10*mod 13 =3.
p =G x10*+1x103+4x102+1x10'+5x 10% mod 13 =7.
ty2=02x10*+3x10°+5x 102+ 9 x 10" + 0 x 10°) mod 13 = 8.
Step 7: s=6,t,=7,p="7.
p==t, — Yes.
Character by character matching p[1..5] == T[7..11].
{31415}==1{31415}. Match, hence s = 6 is a valid shift.
s<14 — Yes.
tiy=d(t,—hT[s+1])+T[s+m+ 1] (mod 13)
t, =10(7-3(3)) + 2 (mod 13) =10 (-2) + 2 (mod
13)
Because-2mod13=11" ¢ — 10 (11)+2
(mod 13) =112
(mod 13) =8.
Contd...
Index 1 2 (3 (45|67 ]8]9]10 1211314 |15| 16| 17 18 19
Step 8: s=7,t,=8,p="7.
p==t, — No.
s<14 — Yes.
t.=d(t,—hT[s+1])+ T[s +m+ 1] (mod 13)
te =10 (8 — 3 (1)) + 6 (mod 13)
t; =10 (5) + 6 (mod 13) =56 (mod 13) =4.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19
Text(T)|2(3[5]|9|0(2[3|1[4[1[5](2]6|7[3|9]|9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n—-m=14.
h =10>'mod13 =10*mod 13 =3.
p =(3x104+1x10°+4x102+1x10'+5x 109 mod 13 =7.
tp=(2
% 10*+3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 9: s=8t=4,p="7.
p==1t; — No.
s<14 — Yes.
tyo=d(t,—hT[s+1]) + T[s+m+ 1] (mod 13)
ty =10 (4—-3(4)) + 7 (mod 13) =10 (-8)+ 7 (mod 13)
Because-8mod13=5 to = 10 (5)+ 7 (mod 13) =57 (mod 13) =5.
Contd...
Index 1 2 (3 (45|67 ]8]9]10 12113141516 |17 | 18 19
Text(T)|2(3[5]|9|0(2[3[1|4|1]5]2|6[7[3]9]|9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n-m=14.
h =10>'mod 13 =10*mod13 =3.
p =(x104+1x10°+4x102+1x 10" +5x 109 mod 13 =7.
ty=(2%104 +3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 11: s=10,t,,=10,p=7.
p ==1,,— No.
s<14 — Yes.
t,y=d@,—hT[s+1])+T[s+m+ 1] (mod 13)
t,, =10(10-3(5)) +9 (mod 13) =10(-5)+9 (mod 13)
Because-5mod13=8 t,; =10 (8) +9 (mod 13) =89 (mod 13) =11.
Contd...
Index 1 2 (3 (45|67 ]8]9]10 1211314 |15| 16| 17 18 19
Text(T)[2|3(5]9(0(2|3[|1|4]1|5(2|6[713[9]9|2]1
Pattern: 31415
>x={0,1,...,9} d=Z|=10 q=13
* n=19,m=5n-m= 14,
h =10>'mod 13 =10*mod 13 =3.
p =3x10*+1x103+4x10>+1x10'+5x10°% mod 13=7.
ty2=2x10*+3x10°+5x 102+ 9 x 10" + 0 x 10°) mod 13 = 8.
Step 13: s=12,t,=7,p="7.
p == t;, — Yes.
Character by character matching p[1..5] ==T[13..17].
{31415}==1{67399}. Mismatch occurs at first character.
s<14 — Yes.
tiy=d(t,—hT[s+1])+T[s+m+ 1] (mod 13)
t;; =10(7-3(6)) +2 (mod 13) =10 (-11)+ 2 (mod 13)
Because-11mod13=2 t . =10 (2)+2 (mod 13) =22 (mod 13) =09.
Contd...
Index 1(2(3(4(|(5|6|7|8]|]9]|10|11|12|13|14|15|16(17 (18|19
Text(T)|2(3[5]|9(0(2[3[1|4|1[5]|2|6[7[3]9]9|2|1
Pattern: 31415
>=1{0,1,...,9} d=Z=10 q=13
* n=19,m=5,n—-m=14.
h =10>'mod13 =10*mod 13 =3.
p =(3x104+1x10°+4x102+1x10'+5x 109 mod 13 = 7.
tp=(2%
104 +3 x 103+ 5 x 102+ 9 x 10! + 0 x 10°) mod 13 = 8.
Step 14: s=13,t3=9,p=7.
p ==1t,3—
No.
s<14 — Yes.
t,y=d@,—hT[s+1])+T[s+m+ 1] (mod 13)
ty, =1009-3(7))+1(mod 13) =10(-12)+ 1 (mod 13)
Because-12mod13=1 t,, =10 (1)+ 1 (mod 13) =11 (mod 13) =11.
Contd...
Index 1 2 (3 (4|56 7]8 13 14 15 16 17 18 19
>=10,1,...,9} d=[g/=10
n=19, m=5n-m= 14.
h =10 mod 13 =10*mod 13 =3.
p =(3x10%+1x103+4x 102+ 1 x 10! +5 x 10°) mod 13 =7.
tp=02x10*+3x10°+5x
102+ 9 x 10" + 0 x 10°) mod 13 = 8.
Example 2 Text(T)]ala|[b|b|c|la|b]a
ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97
n=8,n;1=3,n—m=5. Pattern:cab
h =26>'mod3
=262m0d3=1. Zz{a,b,...,Z} d=|2|:26 q=3
Step 1: s=0,t,=2,p=1.
p=—=1t, — No.
s<5 — Yes.
t,g=d@,—hT[s+ 1)+ T[s+m+ 1] (mod 3)
t, =26 (2-1(97))+98 (mod 3)
t, =26(2—1 (1))
+2 (mod 3)
(because 97 mod 3 = 1 and 98 mod 3 =2)
=26 (1) +2 (mod 3)
=28 (mod 3) =1.
Index 1 2 3 4 5 6 7 8
Step 2: s=Lt;,=1Lp=1
p==t;, — Yes.
Character by character matching p[1..3] == T[2..4].
{cab} == {abb}. Mismatch occurs at first character.
s<5 — Yes.
t,o1=d@,—hT[s+1])+ T[s+m+ 1] (mod3)
t, =26 (1 —1(97)) + 99 (mod 3)
t, =26 (1 —1(1))+ 0 (mod 3)
(because 97 mod 3 =1 and 99 mod 3 =0)
=26 (0) (mod 3)
=0.
Index 1 2 3 4 5 6 7 8
Contd'" Text(T)|a|a|b|blc|al|b|a
ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97
Pattern:cab
Step 3: s=2,t,=0,p=1.
p==1t, — No.
s<5 — Yes.
tioy=d@,—hT[s+ 1)+ T[s+m+ 1] (mod3)
t; =26 (0—1(98)) + 97 (mod 3)
t; =26 (0—-1(2))+ 1 (mod 3)
(because 97 mod 3 =1 and 98 mod 3 = 2)
=26 (0—-2)+ 1 (mod3)
=26 (0+ 1)+ 1 (mod3)
(because 3's complement of -2 = 1)
=26(1)+1(mod3) =27(mod3) =0.
Index 1 2 3 4 5 6 7 8
Contd'" Text(T)|a|a|b|blcla|b|a
ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97
Pattern:cab
Step 4: s=3,t=0,p=1.
p ==1t; — No.
s<5 — Yes.
tioy=d@,—hT[s+ 1)+ T[s+m+ 1] (mod3)
t, =26 (0—1(98)) + 98 (mod 3)
t, =26 (0—1(2)) +2 (mod 3)
(because 98 mod 3 = 2)
=26 (0—2)+ 2 (mod 3)
=26 (0+ 1)+ 2 (mod3)
(because 3's complement of -2 = 1)
=26(1)+2(mod3) =28(mod3) =1.
Index 1 2 3 4 5 6 7 8
Contd... Text(T)| a|a|b|b|c|al|b]|a
ASCII 97 | 97 | 98 | 98 | 99 | 97 | 98 | 97
Pattern:cab
Step 5: s=4,t,=1,p=1.
p==t, — Yes.
Character by character matching p[1..3] == T[5..7].
{cab} == {cab}. Match, hence s =4 is a valid shift.
s<5 — Yes.
t,o1=d@,—hT[s+1])+ T[s+m+ 1] (mod3)
ts =26 (1—-1(99)) +97 (mod 3)
ts =26 (1-1(0)) + 1 (mod 3)
(because 97 mod 3 =1 and 99 mod 3 =0)
=26 (1-0)+1 (mod3)
=26(1)+1(mod3) =27(mod3) =0.
Index
Contd...
1
Text (T) a
ASCII 97 97 98 98
Pattern:cab
Step 6: s=5,t;=0,p=1.
p ==t; — No.
s<5 — No.
Step 7: s=6.
Loop terminates.
Text (T)
Complexity
* Takes ©(m) preprocessing time.
* Worst-case running time is O(m (n —m + 1)).
—Example: P=aMand T =a", each of the [n—m +
1] possible shifts is valid.
* In many applications, there are a few valid shifts
(say some constant c). In such applications, the
expected matching time is only O(n —m + 1) + cm),
plus the time required to process spurious hits.
Contd...
* Probabilistic analysis
— The probability of a false positive hit for a random
input is 1/q.
— The expected number of false positive hits is O(n/q).
—The expected run time is O(n) + O(m(v + n/q))), if v is
the number of valid shifts.
* Choosing g 2 m and having only a constant number of
hits, then the expected matching time is O(n + m).
* Since m £ n, this expected matching time is O(n).
Knuth-Morris-Pratt Algorithm
* Based on the concept of prefix function for a
pattern.
— Encapsulates knowledge about how the pattern
matches against shifts of itself.
—This information can be used to avoid testing of
invalid shifts.
*T: bacbababaabcbab
e P: ababaca
bla|c|b a bla
B) > alb|a
e
bla|e|lb al/lbla
~
s =542
* Given that pattern characters P [1..q] match text
characters T [s + 1..s + q], what is the least shift
s' > s such that for some k < q,
P[1.k]=T[s'+ 1..s' + k], where s'+ k=s+ q?
1e.s'=s+(q—k)
* Best case k = 0. Skips (q — 1) shifts.
Contd... Note: P, = P[1..]. Similarly, P, = P[1..q].
* In other words, knowing that P is a suffix of T, ,, find the
longest proper prefix P, of P, that is also a suffix of T, .
bla|/lc|/b|la|/bla/ b|lala|b|lc|bla|b| T
%
L) »al|lbla|/bla|cl|al|P
- q -
abla| P,
Prefix Function
o))
&}
—_
N
w
3
Ve
P: ababaca klolol1l2
P,=P[1..q]
P, = {a}. P, = {}. No matching prefix and suffix of P;.
P, ={a b}. P, = {}. No matching prefix and suffix of P,.
P, ={aba}. P, ={a}. P,={abab}. P, ={ab}.
aba |bab |3
~—t
—
O
@)
>
%)
o)
)
g
Na)
* P: ababaca k]0]0 2 01
* Py=P[l.q]
* P.={ababa}.P,={abal. e P, ={a}.
P,={ababaca}.
Keep
the longest. Prefix Suffix k
Prefix Suffix k
EX
CR
ab ca 2
ab ba 2 aba aca 3
abab baca 4
abab baba 4 ababa |abaca |5
ababac|babaca|6
* P,={ababac} P, ={}. No matching prefix and suffix of P, as 'c'
does not appear in any of the proper prefix.
Prefix Function
q— i [1]|2(3]|4]|5]|6
Pli] |la|b|a|b|a|c |a
k—ox[i] |[0]0|1]|2[3]0]1
N oV
Bl o g l
B
{ewl
SS I o
(Matching) 7] |0
e T: |bla|c|b|la|bla|bla|b|lal|c]|a
* P: |a|lbla|blalc]|a
blalc|b|alb|la|bla|blalc|a]a
albla|lblalc
- I ]
Contd... n=15 s
B
m=7 :
SS I o
wfi] |0
albla|blalc|a
blalc|b|albla|bja|blalc]|a
X
alblal|b c|a
Contd...
&
ol
|O
"
)
W
Step 5: i=5,q=0
blalc|b|alb|la|bla|blalc|a]a
a|b|a alc|a
- i 2134
Contd... nm—=175 Pli] |a|b|a|b
il [o]lo|1]2
blalc|b|la|b]|a albla|c|ala
a|b|a alc|a
Contd... n- L _
ettt i 2134
~ 2] |o]o|1]2
blalc|b|la|b]|a albla|c|ala
a|b|a alc|a
- i 2134
Contd... ”m-_15 BRI
_ xli] |0]0]1]2
blalc|b|alb|la|bla|blalc|a]a
albla|blalc|a
n=15 i (11234 6|7
Contd--- m=7 Pli] |a|b|a|b c|a
_ xli] |ojo][1]2]3]o]1
bla|c|b|a|b|la|b|la|b|la|c|a|lal|b
ababaéa
B
- Pli]
SS I o
Step1l: i=11,9=4 P[g + 1] == TJ[i
{---1 T
{---1 T
{---4 o
o
o
@
Contd... n=15
_ Pli] |a=
B
SS I o
Step12: i=12,9=5 Plg+1] ==1
(if) P[6]==T[12]. Match.
O
QD
===+
===t
===+
o
o
Q)
— i |1]2|3
Contd... n=1> T REIEIE
B
SS I o
Step13: i=13,9g=6
(if) P[7]==T[13]. Match. q++=7
bla|c|b|al|b d d b a
{---4i T
i i i
1 1 1
1 1 1
d d b d
o
* g==m. Yes.
— Pattern occurs with s hifti—m=13-7=6.
—q="N[q]
= 1.
— i 2 4 6|7
Contd--- n__15 Pli] |a|b|a|b c|a
m=7 :
x[li] 10|01 ]|2|3|0|1
blalc|blalbla|bla|blaljc|alal|b
Shiftsskipped:SandfirstchIaracterisnot ; b alblalcla
compared. Comparison starts from P[2].
bla|c|b|a|bla|b|la|lb|a|c|a]a
d
Step16: i=16
Loop terminates.