A_Transformer-Based_Framework_for_Scene_Text_Recognition
A_Transformer-Based_Framework_for_Scene_Text_Recognition
A Transformer-Based Framework
for Scene Text Recognition
PRABU SELVAM1 , JOSEPH ABRAHAM SUNDAR KOILRAJ1 ,
CARLOS ANDRÉS TAVERA ROMERO 2 , (Member, IEEE),
MESHAL ALHARBI 3 , ABOLFAZL MEHBODNIYA 4 , (Senior Member, IEEE),
JULIAN L. WEBBER 4 , (Senior Member, IEEE),
AND SUDHAKAR SENGAN 5 , (Member, IEEE)
1 Schoolof Computing, SASTRA Deemed University, Thanjavur, Tamil Nadu 613401, India
2 COMBA R&D Laboratory, Faculty of Engineering, Universidad Santiago de Cali, Cali 76001, Colombia
3 Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
4 Department of Electronics and Communications Engineering, Kuwait College of Science and Technology, Safat 20185145, Kuwait
5 Department of Computer Science and Engineering, PSN College of Engineering and Technology, Tirunelveli, Tamil Nadu 627152, India
1 ABSTRACT Scene Text Recognition (STR) has become a popular and long-standing research problem in
2 computer vision communities. Almost all the existing approaches mainly adopt the connectionist temporal
3 classification (CTC) technique. However, these existing approaches are not much effective for irregular STR.
4 In this research article, we introduced a new encoder-decoder framework to identify both regular and irregular
5 natural scene text, which is developed based on the transformer framework. The proposed framework is
6 divided into four main modules: Image Transformation, Visual Feature Extraction (VFE), Encoder and
7 Decoder. Firstly, we employ a Thin Plate Spline (TPS) transformation in the image transformation module
8 to normalize the original input image to reduce the burden of subsequent feature extraction. Secondly,
9 in the VFE module, we use ResNet as the Convolutional Neural Network (CNN) backbone to retrieve text
10 image features maps from the rectified word image. However, the VFE module generates one-dimensional
11 feature maps that are not suitable for locating a multi-oriented text on two-dimensional word images.
12 We proposed 2D Positional Encoding (2DPE) to preserve the sequential information. Thirdly, the feature
13 aggregation and feature transformation are carried out simultaneously in the encoder module. We replace
14 the original scaled dot-product attention model as in the standard transformer framework with an Optimal
15 Adaptive Threshold-based Self-Attention (OATSA) model to filter noisy information effectively and focus
16 on the most contributive text regions. Finally, we introduce a new architectural level bi-directional decoding
17 approach in the decoder module to generate a more accurate character sequence. Eventually, We evaluate
18 the effectiveness and robustness of the proposed framework in both horizontal and arbitrary text recognition
19 through extensive experiments on seven public benchmarks including IIIT5K-Words, SVT, ICDAR 2003,
20 ICDAR 2013, ICDAR 2015, SVT-P and CUTE80 datasets. We also demonstrate that our proposed framework
21 outperforms most of the existing approaches by a substantial margin.
22 INDEX TERMS Connectionist temporal classification, scene text recognition, self-attention, transformer,
23 optical character recognition, deep learning.
25 Text information disclosed in natural scene images is ful in numerous deep learning systems [2] including, robot 28
26 vital for visual interpretation, conceptual understanding and navigation, industrial automation and driver assistance. The 29
The associate editor coordinating the review of this manuscript and special attention to STR. Despite decades of research into 31
approving it for publication was Seok-Bum Ko . Optical Character Recognition (OCR) [3], text recognition 32
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 100895
P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition
33 from natural scene images remains challenging due to a variety of attention-based approaches have evolved in the 89
34 variety of factors, for example, huge variation in the text STR field, influenced by the growth of machine transla- 90
35 font and text colour, low contrast, complex backgrounds, tion frameworks. There are various flaws in the attention 91
36 perspective distortion, occlusion, uneven lighting and so on. mechanism: This technique needs additional storage space 92
37 A wide variety of STR approaches has been discussed in the and computation power, it suffers from attention drift prob- 93
38 literature with significant success in recent years, benefitting lem, and The latest attention mechanism research is primar- 94
39 from the emergence of deep learning. Traditional approaches ily focused on languages with just a few character groups 95
40 recognized text from scene images by first locating individual (e.g., English, French). 96
41 characters and then using a CNN to recognize each cropped Transformer [16], a modern attention alternative, was 97
42 character [4]. However, a great volume of inter-character and extensively used to increase parallelization and minimize 98
43 intra-character conflict will effectively decrease the recogni- complexity for STR. Some efforts were made to replace 99
44 tion network’s performance. These methods rely largely on recurrent neural networks with non-recurrent structures in the 100
45 a robust character detector. An attentional Recurrent Neu- domain of regular text recognition, such as convolutional- 101
46 ral Network (RNN) can effectively handle the sequence- based and attention-based approaches. Nevertheless, both 102
47 to-sequence (seq2seq) problem of recognizing regular text. approaches depend on seq2seq structures, which are inade- 103
48 Whereas, recognizing arbitrary shape text is more challeng- quate for handling arbitrary shaped text. It’s worth mention- 104
49 ing for an RNN model. ing that the Transformer is originally designed for language 105
50 Conventional irregular STR methods can be divided into translation tasks such as English to French, French to English 106
51 4 main categories: text shape rectification-based approach and so on. It takes one-dimensional sequences as its input. 107
52 [4], [5], Connectionist Temporal Classification (CTC) based Since position information is not naturally encoded within the 108
53 approach [6], [7], multi-direction encoding-based approach input set of sequences, the model is less sensitive to the posi- 109
54 [8], [9] and attention-based approach [10], [11], [12]. Shape tioning of input sequences than RNN and LSTM frameworks 110
55 rectification is used to normalize the word image, eliminate with associative bias. The transformer is permutative equiv- 111
56 distortion and make irregular text recognition simpler. Meth- arience, because of its Self-Attention (SA) and Feed-Forward 112
57 ods based on shape rectification [5] attempt the transfor- Network (FFN) layers, which calculate the result of each 113
58 mation of irregular text to regular text before using regular component in the input sequence separately. While the 1D 114
59 text recognizers. A Spatial Transformer Network (STN) [13] Positional Encoding (PE) approach employed in Transformer 115
60 is a first text normalization network, it was utilized to rec- may handle the permutation equivarience issue that may arise 116
61 tify individual character regions and complete word images. in 1D sequences associated with NLP, it cannot preserve 117
62 Later, Shi et al. [4] incorporated Thin-Plate Spline (TPS) the horizontal and vertical features produced by CNNs for 118
64 to handle more complex text distortions. Sophisticated rec- In short, the primary contributions of our research work are 120
66 tortions, and they are becoming a new trend. However, these • Transformer [16] in Natural Language Processing 122
67 factors have an impact on the performance and memory (NLP), which takes only 1D sequences as its input. 123
68 use of recognition algorithms. CTC has made substantial On the other hand, the scene text recognizers can only 124
69 improvements in a variety of areas, including voice recog- handle 2D images. To solve this permutation equivari- 125
70 nition and web handwritten character identification. CTC is ence problem and to preserve the order of sequential 126
71 a widely used prediction algorithm in STR. Yu et al. [14] and information, we modify the conventional transformer 127
72 Shi et al. [7] were the initial algorithms to apply CTC to architecture to recognize texts in the scene image. Here, 128
73 STR, inspired by its achievement in speech recognition. we introduced a novel mechanism to convert the spa- 129
74 Luo et al. [15] used a CTC-based prediction algorithm tial encoder from 1D to 2D, by expanding the stan- 130
75 for model learning and have shown superior performance. dard transformer’s 1D Positional Encoding (1DPE) to 131
76 Despite the remarkable performance in STR, However, 2D Positional Encoding (2DPE). 132
77 CTC has certain shortcomings: CTC’s underlying approach • Input word images in natural scenes take different 133
78 is complex, CTC exhibits peaky distribution issues, and forms, including curved and skewed texts. If such input 134
79 CTC is ineffective for two-dimensional prediction tasks like word images are transmitted left unchanged, the fea- 135
80 irregular STR. ture extraction step must learn an invariant represen- 136
81 Attention-based prediction methods have surpassed CTC tation for such geometry. To eliminate distortion on 137
82 in decoding in recent years, because of their capability to the input word images and make text recognition eas- 138
83 attend to the appropriate place. The RNN consistently use the ier, we employ a thin-plate spline (TPS) transforma- 139
84 attention mechanism in its prediction module for STR prob- tion [4]. The rectified or normalized images enhance 140
85 lem. The input text image pattern, the output text sequences text recognition accuracy, particularly for datasets with 141
86 pattern and its difference are primarily learned by the atten- a majority of arbitrary texts and perspectively distorted 142
87 tion mechanism by examining the experience of the final texts. TPS can be selected or deselected using our 143
145 • We propose a new mechanism called the Optimal a CNN classifier composed of four convolutional layers and 200
146 Adaptive Threshold-based Self Attention (OATSA) two fully connected layers used to retrieve high-level fea- 201
147 that explicitly ignores the least contributive compo- ture maps from the text image. This technique, however, 202
148 nents in the attention matrix to limit the extraction was confined to a pre-defined lexicon. To get around this 203
149 of irrelevant elements and improve attention focus. constraint, several authors have recently considered STR as 204
150 OATSA approach can efficiently filter noisy informa- a sequence translation task. Shi et al. [7] introduced a new 205
151 tion and generate a more accurate word. unified deep neural network architecture; named Convolu- 206
152 • We introduce a unified bi-directional decoder architec- tional Recurrent Neural Network (CRNN) combines the func- 207
153 ture integrated into modified Transformers that works tions of both CNN and Recurrent Neural Networks (RNN). 208
154 in both right-to-left (R2L) and left-to-right (L2R) CRNN can handle input images of various dimensions and 209
155 directions for STR to obtain more robust character generate predictions of different lengths. CRNN is capable 210
156 sequences. The bidirectional decoding method, which of handling random strings (e.g. phone numbers), sentences, 211
157 uses two distinct decoders outperforms the stand-alone and other scripts such as Chinese words, and is not con- 212
158 decoders. We accomplished this by executing bidi- fined to recognizing words from a known dictionary. Since 213
159 rectional decoding at the architectural-based systems, the spatial dependencies between local image patches in 214
160 rather than at the input-based systems, which were used CNN are not explored and utilized, Shi et al. [5] specially 215
161 in earlier research works. designed a new deep neural network that has both convolu- 216
162 • The proposed framework is a non-recurrent network, tional and recurrent layers to transform the irregular word 217
163 trained in an end-to-end manner. This framework can images into a more readable regular word image via STN. 218
164 be trained concurrently without employing any RNN This network generates a sequence of feature vectors for any 219
165 modules. Although, the proposed framework achieves arbitrary size input word image. Finally, an attention-based 220
166 comparable or superior performance compared with sequence recognizer is used to generate a character sequence. 221
167 most existing methods on both horizontal and arbitrary Lin et al. [24] designed a special network by combining a 222
168 shaped scene text benchmarks for example 97.1% on sequential transformation network, used to rectify the irreg- 223
169 ICDAR03, 98.0% on ICDAR13, 87.2% on ICDAR15, ular text by dividing the complex transformation step into 224
170 and 89.3% on CUTE80 datasets. multiple basic transformation steps and an attention-based 225
172 In the past decade, there has been a growing interest in natural Recent deep networks can develop robust representations 228
173 STR in the computer vision community, which differs from that are tolerant to imaging distortions and changes in text 229
174 classical handwritten character and text recognition in terms style, but they still have issues handling scene texts includ- 230
175 of characteristics and difficulties. The extensive studies can ing viewpoint and curvature distortions. To deal with such 231
176 be found in [1], [2], and [3]. issues, Zhan and Lu [25] established an end-to-end STR 232
177 Traditional STR systems used text detectors to extract network called ESIR, which reduces viewpoint distortion 233
178 various candidate character positions and then used a char- and text line curvature iteratively that improves the perfor- 234
179 acter classifier to recognize the characters. These traditional mance of STR systems. The posture of text lines in scenes 235
180 methods depend on low-level features for STR including is estimated using an innovative rectification network that 236
181 stroke width transform [17], connected components [18], introduces a different line fitting transformation. In addition, 237
182 Histogram of Oriented Gradients (HOG) descriptors [19] and an iterative rectification mechanism is being created, which 238
183 so on. Wang et al. [20] used the HOG descriptors to train corrects scene text distortions in a fronto-parallel perspective. 239
184 a character recognizer, which then used a sliding window Litman et al. [26] presented a new encoder-decoder archi- 240
185 to recognize characters in cropped word images. Seok and tecture named Selective Context ATtentional Text Rec- 241
186 Kim [21] represent the target character set as an Implicit ognizer (SCATTER) for predicting character sequences 242
187 Shape Model (ISM) to achieve robustness on the character against complicated image backgrounds. A deep Bi-LSTM 243
188 set. Hough forest is trained to localize and group the char- encoder is designed for encoding contextual dependencies. 244
189 acter candidates; the semi-Markov conditional random field A two-step 1D attention method is used to decode the 245
191 text candidates. However, the low capability of hand-crafted Instead of rectifying the complete text image, Liu et al. [27] 247
192 features limits the performance of traditional text recognition suggested using Character-Level Encoder (CLE) to iden- 248
193 systems. Yao et al. [22] introduced a new way of reliably tify and rectify specific characters in the word image. 249
194 identifying individual characters called Strokelets, providing The arbitrary orientation network (AON) was developed by 250
195 a histogram feature for recognizing character components in Cheng et al. [28] to directly capture the deep feature repre- 251
196 natural scenes image. sentations of irregular texts in four directions along with char- 252
197 Several researchers have started using deep learning mod- acter location clues. A filter gate mechanism was designed to 253
198 els for STR as a result of the fast growth of neural networks. integrate the four-direction character sequences of features, 254
199 For a 90k-word classification, Jaderberg et al. [23] designed and an attention-based decoder was employed to generate 255
256 character sequences. To tackle the ‘‘attention drift’’ problem, regular STR and these are found hard to recognize irregular 312
257 Cheng et al. [29] proposed a Focusing Attention Network text. 313
258 (FAN) method comprised of two main components: an atten- In contrast with the convolution network and attention 314
259 tion network (AN) and a focusing network (FN). The AN is mechanism, we propose a simple but powerful STR model 315
260 designed to recognize character targets and FN is designed with an Optimal Adaptive Threshold-based Self-Attention 316
261 to evaluate and alter the attention by paying attention to (OATSA) mechanism in this paper. This method directly 317
262 the character target locations in the word images properly. maps word images into character sequences and it also works 318
263 A Multi-Branch Guided Attention Network (MBGAN) was well on both horizontal and arbitrary shaped scene text 319
264 proposed by Wang and Liu [30] to acquire invariant semantic images. 320
289 Transformer architecture for the voice recognition problem. include tightly confined regular text. This encourages us 342
290 Similarly, Yu et al. [33] developed a network for a read- to do a spatial transformation before a recognition to con- 343
291 ing comprehension problem by integrating internal convo- vert input images into ones that recognizers can read easily. 344
292 lution layers with a universal self-attention module. Both We employed a TPS transformation to transform an input text 345
293 these models were inspired by the Transformer framework. image (I) into a normalized image (I’) as shown in Fig. 2. Text 346
294 Dehghani et al. [34] recently expanded the Transformer images come in various shapes for example: tilted, perspec- 347
295 architecture by developing a new model called the ‘‘Universal tive and curved texts. Such complex-shaped text images force 348
296 Transformer’’ to handle string copying and other rational the feature extraction steps to learn an invariant representation 349
297 interpretation with string lengths that are longer than those concerning such geometry. TPS rectification algorithm is an 350
298 seen during training. There have also been numerous attempts alternative to the STN, which has been used for various aspect 351
299 to interpret scene text without the use of recurrent networks. ratios of text lines to reduce this complexity. TPS interpolates 352
300 Based on the Transformer model, Chen et al. [35] developed between a collection of fiducial points using a smooth spline. 353
301 a new non-recurrent seq2seq framework for STR, which TPS identifies several fiducial points at the upper and lower 354
302 includes a self-attention block functioning as a fundamen- enveloping points, and then normalizes the character region 355
303 tal component in both the encoder and decoder architec- to a predefined rectangle. 356
306 powerful, based on holistic representation-guided attention. POSITIONAL ENCODING (2DPE) 358
307 An attention-based sequence decoder is linked directly to According to recent research, ResNet [37] is gaining 359
308 two-dimensional CNN features. The holistic depiction may a lot of traction in the research community because 360
309 steer the attention-based decoder to more precisely concen- of its unique features of dealing with the overfitting 361
310 trate on text regions. Because of their inherent model archi- and vanishing gradient issue, parameter efficiency, cap- 362
311 tecture, all of these existing approaches are mostly focused on turing well-defined feature representations and introducing 363
FIGURE 1. Illustrate the overall pipeline of our proposed framework. dimension H×W×d’. The multi-head self-attention layer can 383
ing P(.) as given in Eq. (1) – Eq. (4), we generalize the original 386
feature. 388
364 ‘‘identity shortcut connection’’. Therefore, we use ResNet as where pos(hor) and pos(ver) represent the horizontal and 393
365 the CNN backbone for the VFE. vertical positions respectively, frei , frej ∈ R is the 2D posi- 394
366 Table 1 illustrates the 50-layer residual network configu- tional encoding signal’s learnable frequencies, ch represents 395
367 ration as used in [10] we utilize for our feature extraction the number of channels in F 0 and i, j ∈ [0, d/4]. Position 396
368 stage. CNN process the rectified image (I’) to extract compact code (P) and 2D PE (F 0 ) are combined to enable to notice 397
369 feature representation F ∈ RW ×H ×c , where W, H and C character’s information in each position before it. The 2D 398
370 represent the rectified image’s width, height and the number encoding maps P added to F 0 represented by F 00 = F 0 + P. 399
371 of channels respectively. To reduce the encoding stage’s com- The Transformer’s encoder only takes a set of vectors as 400
372 putational cost, we use 1×1 convolution to reduce the feature input. Hence, it is necessary to vectorize the d channels of 401
373 map channels represented by F 0 ∈ RW ×H ×d , where d < c. F 00 and stacked together to create a single feature matrix 402
374 Conventional Transformers lack recurrence and convolu- M is shown in Eq. (5). The function Mat2Vec is used to 403
375 tion layers, so it is important to provide absolute or relative convert the feature map into a vector represented by xi,j = 404
376 location information of the characters in the word or sen- Mat2Vec(F 00 (:, :, j)) ∈ 1 × HW. 405
407 C. ENCODER
408 The encoder is composed of two important layers: Multi-
409 Head Self-Attention (MHSA) and Feed-Forward Network
410 Layers (FFNL). An encoder captures a set of vectors (M)
411 as its input and processes these vectors by giving input to
412 the self-attention layer and FFN layer in order. Then the
413 output from the FFN layer is sent to the next encoder. Around
414 each of the two layers, we use a residual connection and
415 proceed with layer normalization as shown in Fig. 3 i.e Lay-
416 erNorm(x + Sublayers(x)), where the function Sublayers(x)
417 is implemented by the sublayers themselves. All sublayers in
418 the model, as well as the embedding layers, generate results
419 with dimension 512 to facilitate these residual connections.
430 key matrix K ∈ RW×d (see Eq. (8)) and query matrix Q ∈ MHAttention(Q, K , V ) = Concat(head1 , . . . , headn )W O
458
431 RW×d (see Eq. (7)). The queries, keys, and values for the self-
(10) 459
432 attention modules are all generated in the same sequence. The Q
433 attention output matrix is derived as follows (see Eq. (6)): Where headi = Attention(QWi , KWi , VWV K
i ) 460
(11) 461
Q × KT
434 Attention (Q, K, V) = softmax √ V (6) Using multiple attention heads has the advantage of allowing 462
dk the model to learn to focus on various parts of the input image 463
435 Q = [q1 , q2 , . . . .., qw ]T , qi = Wq xi + bq (7) for each attention head at different phases of the encoding 464
441 factor is used to prevent a very small gradient of the softmax distributes credits to all context components. It’s inappro- 471
442 function. priate since a lot of credit could be given to information 472
443 The multi-head attention method involves projecting the that’s not relevant and should be discarded. For example, the 473
444 queries, keys, and values ‘n’ times with various learnable traditional self-attention method calculates attention weights 474
445 projection weights, allowing the model to collect useful by multiplying the specified query by the key from several 475
446 data from several representation subsections at the same modalities. The weighted sum is then calculated by applying 476
447 time. In Transformer, both dot-product attention and MHSA the attention matrix to the value. However, many irrelevant 477
448 are effective in practice, with multi-head attention being a words may have a minimal association with encoded image 478
449 concatenation of dot-product attention as given in Eq. (10) attributes, leading to a very modest amount after multiply- 479
450 and Eq. (11). Multiple heads of attention are employed in ing the provided query by key. When attention scores are 480
451 each transformer layer. Furthermore, a multi-head-attention relatively close, a SOTA approach namely constraint local 481
452 mechanism has been introduced to the attention layer to attention cannot filter irrelevant information and will break 482
453 improve the method’s expressive capability. Before calculat- the long-term dependency. 483
454 ing attention, all Q, K, and V are partitioned into numer- In our proposed work, we integrate a new threshold module 484
455 ous heads and run through distinct, learnt linear projections. namely Optimal Adaptive Threshold-based Self Attention 485
456 After the attention calculation, multiple heads will be (OATSA) in a standard self-attention calculation as shown 486
FIGURE 4. a) Standard scaled dot-product attention. b) Proposed Optimal Adaptive Threshold-based Self-Attention (OATSA).
487 in Fig. 4(b) to discard irrelevant information and preserve are taken ‘‘one at a time’’ by the Feed Forward Network. The 524
488 the long-term dependencies by replacing Standard scaled finest part is that, unlike with RNN, each of these attention 525
489 dot-product attention as shown in Fig. 4(a). We integrate vectors is independent of the others. As a result, paralleliza- 526
490 our OATSA module between scale and softmax function to tion may be used here, which makes a huge impact. We can 527
491 convergent attention. Since the softmax function is domi- now feed all of the words into the encoder block at the same 528
492 nated by the elements with the higher numerical value, the time and obtain the set of encoded vectors for each word at 529
493 OATSA module chooses the elements with higher numeri- the same time. 530
494 cal values and discards the elements with lower numerical
495 values as shown in Fig. 5. In the first step, we perform the D. DECODER 531
496 dot product operation between query and key to derive the
The decoder is made up of n layers of transformer decoders. 532
497 attention matrix P. The elements of attention matrix P are
The embeddings of the decoded output sequence of charac- 533
498 represented by {p11 , p12 , . . . , pmn }. The elements having
ters are sent into the decoder. The decoder is made up of N 534
499 higher values in attention matrix P are assumed to be the
identical layers, each of which contains three sub-layers. The 535
500 most contributive element. To aggregate focus, we choose the
MHSA is used in the first layer of the network; the masked 536
501 most contributive elements from each row in the attention
mechanism in the first layer prevents the model from seeing 537
502 matrix P. In the next step, we divide the attention matrix P
future data. This masking approach ensures that the model 538
503 into ‘n’ chunks and calculate the mean value for each chunk.
only utilizes the previous words to generate the current word. 539
504 The elements lower than the threshold multiplicative factor
An MHSA layer without the masked technique makes up the 540
505 (mean(piw )∗t) are assigned to negative infinity (see Eq. (11)).
second layer. It applies a multi-head self-attention mechanism 541
506 Based on the hypotheses, the elements with negative infinity
over the first layer’s result. This layer serves as the foun- 542
507 do not contain any relevant information, the elements having
dation for correlating text and image information with the 543
508 higher numerical values contains close relevance informa-
self-attention layer. A position-wise fully connected FFN is 544
509 tion. Finally, negative infinity values are replaced by zero by
incorporated in the third layer. Following layer normalization, 545
510 applying the softmax function (see Eq. (12)) on the attention
the Transformer establishes a residual connection for all three 546
511 matrix (OPt ). The working procedure of the OATSA module
layers. To convert the Transformer’s result into probabilities 547
512 is given in the Table 2. The proposed Optimal Adaptive
for each character sequence in a sentence, we attach a Fully 548
513 Threshold-based Self-Attention (OATSA) module remark-
Connected (FC) layer and a softmax layer at the top. All the 549
514 ably eliminates noisy information.
characters in the phrase can be created simultaneously, unlike 550
522 of two layers of 1 × 1 convolution with a ReLU activation The conventional Seq2seq model’s decoder preserves output 557
523 function, followed by a residual connection. Attention vectors only in one direction, leaving the other direction uncaptured. 558
FIGURE 5. Working procedure of proposed Adaptive Threshold-based attention mechanism to obtain most participative elements after assigning
higher probabilities to it. a) Attention matrix pij is obtained by performing the dot product between the key (K) and the query (Q). b) Attention
matrix pij split into ‘n’ (3) chunks. c) The Mean value is computed for each chunk, set the element value to −∞ if the mean value is lesser than the
threshold value (0.7). d) Softmax function is applied on the pt to replace the −∞ to 0. The final matrix contains the most contributive elements.
559 For instance, in certain fonts, a decoder that recognizes char- and the decoder has no memory of the previous deciphered 563
560 acter sequence from L2R may have trouble choosing the characters. Such challenging characters can be recognized 564
561 initial letter between upper-case ‘I’ and lower-case ‘l’. These using an R2L decoder easily since the succeeding char- 565
562 initial characters are difficult to differentiate perceptibly acters suggest the initial character based on the language 566
567 preceding. Decoders that function in opposite directions are as follows: The results from the experiment show that our 620
570 (see Fig. 6), which comprises a decoder with opposing direc-
571 tions, to make use of the dependencies in both ways. The A. DATASETS 623
572 decoder is primarily designed to predict texts from both In this paper, we train our framework with only two synthetic 624
573 directions (L2R and R2F). After running the decoder, two datasets: Synth90k [45] and SynthText [46]. We evaluated the 625
574 recognition results are generated. During inference, to aggre- significance and robustness of our proposed STR framework 626
575 gate the outcomes, we merely choose the symbol with the on seven standard benchmark datasets, four regular scene text 627
576 highest log-softmax recognition score, which is the total datasets and three irregular scene text datasets. 628
577 of all predicted symbols’ recognition scores. In addition to Synth90k is the synthetic text dataset proposed by 629
578 positional embedding and token embedding, we introduce Gupta et al. [45]. A total of 9 million word pictures were 630
579 direction embedding during decoding to add more contextual acquired from a collection of 9k frequent English language. 631
580 information. The framework is instructed to decipher the text The entire dataset images were only used for training. Each 632
581 string from L2R or R2L using this direction embedding. The image in Synth90k has a word-level ground-truth annotation. 633
582 same decoder architecture and constraint can be utilized on These images were generated with the help of a synthetic text 634
583 the output sequence processing order by adding the direction engine and are quite realistic. 635
584 embedding. Similar to position embedding, this direction Synthtext is another synthetic text dataset that was only 636
585 embedding also gives additional context information to the used for training, which is proposed by Jaderberg et al. [46]. 637
586 framework. The process of image generation was similar to that of [45]. 638
587 The directional embedding enables the network to decipher Originally, the Synthext dataset was created for text detection, 639
588 the text information not only from the L2R direction but unlike [45]. Characters are rendered as a full-size images. 640
589 also from the R2L direction. If the decoding is performed IIIT5K-Words [38] (IIIT5K) there are 3k cropped word 641
590 in one direction (L2R) then the character loss for L2R deci- test images in the IIIT5K dataset collected from the inter- 642
591 phered images cannot be decreased. We considered the output net. Each image has a vocabulary of 50 short words and 643
592 sequence decoding direction as two subtasks. Each charac- 1,000 long words. A few words were created randomly and 644
593 ter sequence decoding (L2R direction and R2R direction) the rest were created from the dictionary. 645
594 are sub-tasks of the conditional output sequence algorithm. Street View Text [39] (SVT) dataset comprises of 646
595 To channel the result in a correct decoding direction, we cre- 647 cropped text pictures acquired from Google Street View 647
596 ate two separate 512-d vectors at the beginning of training. (GSV). Each image contains a 50-word lexicon. Most of 648
597 Each scene text image in the set is decoded two times during the images in the SVT dataset are severely distorted, noisy, 649
598 each training iteration step, first from the L2R direction and blurred and low resolution. 650
599 next from the R2L direction. The reserved ground truth of ICDAR 2003 [40] (IC03) consists of 251 scene text 651
600 the original description is the ground truth description for the images with text-labelled bounding boxes. For a fair compar- 652
601 R2L deciphered character sequence. The decoder achieves ison, we excluded word images containing non-alphanumeric 653
602 strong performance by combining the outputs of the two characters or images with less than 3 characters as sug- 654
603 directions, which also helps the classifier to predict the right gested by Wang et al. [20]. The updated dataset comprises 655
604 character. The total loss is the sum of the losses suffered by of 867 cropped word pictures. Images in the IC03 dataset 656
605 both the L2R and R2L directions given in Eq. (13). include both 50-word lexicon and ‘‘full-lexicon’’. 657
607 where yk, Pl2r, Pr2l and I represent the ground truth of the removed from the dataset. The filtered test dataset contains 661
608 kth character, predicted result on left to the right direction, 1015 cropped word images with no lexicon associated with 662
609 predicted result on R2L direction and the input image, respec- them. 663
610 tively. We also include a supervisory branch that projects each ICDAR 2015 [42] (IC15) dataset consists of 6545 cropped 664
611 visual component from the ‘c’ dimension to the number of text images, 4468 images used for training and 2077 images 665
612 alphabet classes in order to predict the character it belongs used for testing. No lexicon is associated with it. Most of the 666
613 to, as well as calculate the cross-entropy loss between ground text in the word images in this dataset are irregular shapes 667
614 truth and prediction. such as horizontal, oriented and curved. IC15 dataset images 668
616 In this section, we exhaustively evaluate the performance SVT-Perspective [43] (SVT-P) dataset consists of 671
617 of our model. Numerous experiments were performed on 645 cropped word pictures. Images in the SVT-P dataset 672
618 challenging STR benchmark datasets, including four regular include both a 50-word lexicon and a ‘‘full-lexicon’’. 673
619 datasets and three irregular datasets. The datasets descriptions SVT-P dataset images are collected from GSV, most of the 674
675 images captured at a side-view angle. Therefore, the images TABLE 3. The performance comparison of different modified CNN
backbones. Modified ResNet50 captures well-defined feature
676 in the SVT dataset are heavily distorted, noisy, blurred and representations and it provides an excellent balance between
677 low resolution. model size and accuracy.
678 CUTE80 [44] (CUTE) dataset contains a collection of
679 80 high-resolution images taken in naturalistic environments.
680 It contains 288 cropped word pictures for testing. CUTE is
681 the most challenging dataset since most of the word images
682 consist of arbitrarily shaped letters. No lexicon is associated
683 with this dataset. It was collected with the intent of evaluating
684 the performance of irregular STR.
709 VGG16, ResNet18, ResNet34, ResNet50 and ResNet164. initially, we built a self-attention layer on top of the convo- 730
710 In that, ResNet50 provides an excellent balance between lutional layers. On the other hand, we drop the self-attention 731
711 model size and accuracy and it captures well-defined feature layer from the decoder to analyze the influence of the 732
712 representations. Hence, we choose ResNet50 as our CNN’s self-attention layer on the decoder side. The recognition accu- 733
713 backbone. racy of the generalized model is marginally lower than the 734
714 After numerous experiments, the number of attention standard system (91.6% v.s. 97.7% on the IIIT5K dataset and 735
715 heads in the encoder and decoder is kept at 16. Similarly, 83.3% v.s. 90.6% on the SVT-P dataset) as shown in Table 5, 736
716 the number of decoder blocks is dynamically changed during but it is still comparable to earlier approaches. 737
717 each iteration and finally set to 3. The results suggest that In contrast to language translation approaches, we iden- 738
718 N = 3 yields the best outcomes for our model as shown tified that applying the self-attention mechanism in STR 739
719 in Table 4. This effect contradicts the Transformer’s exper- has a significant impact on performance. We believe there 740
720 imental results, which suggest that utilizing additional blocks are three alternative reasons: Firstly, the length of character 741
721 improves language translation and irregular text recognition sequences represented in standard STR tasks is often less 742
722 performance. Increasing the number of heads leads to an than those required for machine translation. Secondly, the 743
723 overfitting problem. CNN-based encoder effectively represents the long-range 744
724 Self-attention is critical in many seq2seq activities such relationships between the words. For example, the receptive 745
725 as chatbots, language translation and so on because it is field produced by ResNet50’s final feature layer has a great 746
726 capable of capturing long-term relationships. We explored influence on long-term dependencies. Finally, self-attention 747
727 the effect of improved self-attention block in our proposed is often employed in machine translation to represent the 748
728 framework for regular and irregular STR. To improve the relationships between words in a phrase or even a paragraph. 749
TABLE 5. The encoder and decoder performance comparison with and without self-attention block. When comparing row 1 and row 2, we find that
dropping the self-attention block from the decoder side of our framework produces a significant performance drop. Rows 2 and 3 illustrate that the
self-attention block to the encoder side has shown slight improvement.
TABLE 6. Experiment with various decoders. ‘‘Normal’’ denotes an L2R direction, ‘‘Reversed’’ signifies an R2L direction and ‘‘Bidirectional’’ denotes a
combination of them.
TABLE 7. The performance of the proposed method with and without the image transformation technique.
750 Between words that are far apart, there are nevertheless Reversed decoders is minimal, when they are combined, they 770
751 wealthy syntactic and semantic relationships. In contrast, provide a significant performance improvement. We carried 771
752 each input image in STR generally comprises a single word, out experiments to see how text rectification impacted our 772
753 a self-attention module is mainly employed to represent the framework. As a text rectification approach, we employed the 773
754 character relationships of an input text. The ties that bind image normalization technique proposed in the STR method. 774
755 the relationship between the letters in a word are usually Without an image normalization block, the proposed Trans- 775
756 weaker than those that bind the relationship between the former based 2D-attention mechanism can locate a single 776
757 words of a sentence. This might clarify why self-attention character scattered in 2D space. In this context, the image 777
758 does not assist to enhance irregular text recognition per- normalization block has minimal effect on our framework 778
759 formance. We analyze the recognition accuracies of several (see Table 7). We provide a novel technique called Opti- 779
760 decoders to assess the efficacy of the bidirectional decoder: mal Adaptive Threshold-based Self-Attention (OATSA) that 780
761 The normal decoder only understands text that is read from effectively explicit sparse Transformer. The performance and 781
762 the L2R direction. Reversed recognizes text only in the R2L importance of the OATSA technique are shown in Table 8. 782
763 direction; the Bidirectional decoder works in both L2R and The OATSA technique preserves the long-term dependencies, 783
764 R2L directions and chooses the one with the greatest recogni- which are defined by the distribution of neighbour nodes. 784
765 tion accuracy. In many cases, Normal and Reversed decoders It can focus the attention of the standard Transformer on the 785
766 produce equivalent accuracies as shown in Table 6. most contributive components. 786
767 Normal surpasses reversed on SVT, IC15 and CUTE while Optimal Adaptive Threshold is integrated into self- 787
768 reversed excels IIIT5K, IC03 and SVT-P. In the worst case, attention and performs as an attention mechanism in the 788
769 the difference in recognition accuracy between Normal and decoder, allowing the model to produce more accurate words. 789
TABLE 8. Performance comparison among four variations. ResNet50 is used as the proposed model’s feature extraction module. Rectification and no
rectification indicate the image normalization step was performed and the image normalization step not performed. 2D positional encoding represents
the model that performs recognition tasks; it keeps tracking the character position in each iteration. The OATSA represent the Optimal Adaptive
Threshold-based Self-Attention Algorithm. Different decoders are ‘‘Normal’’ denotes an L2R direction, ‘‘Reversed’’ signifies an R2L direction and
‘‘Bidirectional’’ denotes a combination of them.
FIGURE 6. The attention heat maps provide a visual representation of the 2D attention weights obtained from all of the decoding stages on a
standard benchmark.
790 After extensive comparative experiments, the optimal value the current SOTA approach of Yang et al. [36] and Lu et al. 806
791 of w and t is 4 and 0.67, respectively. Fig. 7 shows the visual [48] on standard datasets such as SVT, IC13, SVT-P, CUTE 807
792 representation of the 2D attention weights obtained from all and IC15. The proposed method produces better recognition 808
793 of the decoding stages on a standard benchmark. accuracy on regular text datasets such as IC03, IC13 and 809
794 D. COMPARISONS WITH EXISTING METHODS since they employed an extra word image with character-level 811
795 In this section, we compare the effectiveness and robustness annotations for model training. Note that Litman et al. [26] 812
796 of the proposed method by setting the number of decoder and Lu et al. [48] use an additional image dataset SynthAdd 813
797 blocks (N) to 3, the number of heads (H) to 16 and d = [47] for training and produced a better recognition accuracy 814
798 1024 in comparison with the current SOTA methods on a vari- of results of 86.9% and 84.5% on SVT-P and 87.5% on 815
799 ety of regular and irregular text benchmarks datasets. To be CUTE80. Zhang et al. [31] use the Wiki dataset for addi- 816
800 fair, we merely list all the performance results in lexicon- tional training to achieve the top performance on CUTE80. 817
801 free mode. Most of these existing approaches are trained on Still, the proposed method outperforms Litman et al. [26], 818
802 the same datasets. The proposed Transformer based scene Lu et al. [48] and Zhang et al. [31] on IC03, IC13, and 819
803 recognition model was compared with 21 existing methods on IC15. The recognition accuracy of the proposed method on 820
804 both regular and irregular datasets and the results are shown in all datasets substantially surpasses that of linguistic-based 821
805 Table 9. The proposed framework comfortably outperforms approaches, notably on irregular texts (leading by +3.4% 822
TABLE 9. On several benchmarks, the overall performance of our STR model is compared with that of the previous state of art approaches. All values are
expressed as a percentage (%). The outcomes are all in the no lexicon represented by ‘‘None’’, ‘‘90K’’, ‘‘ST’’, ‘‘SA’’ and ‘‘Wiki’’ stand for Synth90K, SynthText,
SynthAdd and Wikitext-103, respectively; ‘‘word’’ and ‘‘char’’ denote the use of word-level or character-level annotations; and ‘‘self’’ denotes the use of a
self-designed convolution network or self-made synthetic datasets.
823 on IC15 and +2.8% on CUTE datasets). Our method out- on IC15, +9.7% on SVT-P and +5.9% on CUTE. The 826
824 performs the prior SOTA approach by Yang et al. [36] by significant improvement validates the effectiveness of our 827
825 a margin of +7.7% on SVT, +4.8% on IC13, +14.2% method. We are only 0.6% behind Zhang et al. [31] on CUTE 828
FIGURE 7. Illustration of success and failure cases of the proposed method. ‘‘GT’’ stands for ‘‘Ground Truth,’’ ‘‘Pred’’ stands for ‘‘Prediction’’. Blurry,
low resolution (LR) and illumination are some of the reasons for failure.
829 (91.3% v.s. 91.9%). In addition, it is highlighted that the into four modules: image transformation, feature extraction, 855
830 IIIT5k dataset has plenty of images with background noise. encoder and decoder. First, the transformation module uti- 856
831 Still, the proposed approach achieves the highest recogni- lizes a Thin Plate Spline (TPS) transformation to normalize 857
832 tion accuracy of 97.7%. Our model employs only word- the irregular or arbitrary word image into a more readable 858
833 level annotation, it outperforms the character-level model word image that greatly helps to reduce the complexity of 859
834 Liao et al. [49] on IIIT5K (97.7% v.s. 91.9%). Many samples extracting text features. Second, the Visual Feature Extrac- 860
835 in the IC15 dataset were not horizontally positioned, which is tion (VFE) module uses ResNet as the CNN backbone to 861
836 beyond the scope of the present study. As a result, we normal- extract well-defined feature representations and expand the 862
837 ized the sample based on the image ratio. Most of the image standard transformer’s 1D Positional Encoding (1DPE) to 863
838 samples in the SVT-perspective dataset were perspective and 2D Positional Encoding (2DPE) information a 2D Positional 864
839 low-resolution. Although the proposed method performs as Encoding (PE) to capture the order of sequential informa- 865
840 same as the model presented by Liao et al. [12], it did tion from the 2D rectified word image. Third, Multi-Head 866
841 obtain SOTA performance in the majority of SVT-perspective Self-Attention (MHSA) and Feed-Forward Network Layers 867
842 scenarios. As shown in Fig. 8, our method outperforms (FFNL) in the encoder module perform feature aggregation 868
843 Yang et al. [36] in the STR problem, the proposed framework and feature transformation concurrently. Finally, we proposed 869
844 is capable of recognizing blurred word images and irregular a new Optimal Adaptive Threshold-based Self-Attention 870
845 text. Our model learns both self-attention and input-output (OATSA) model and an architectural level bi-directional 871
846 attention, where the encoder and decoder both preserve decoding approach comprised in the decoder module greatly 872
847 feature-feature and target-target interactions. This makes the supports the framework to generate a more accurate char- 873
848 intermediate representations more resistant to spatial distor- acter sequence. The OATSA model replaces the standard 874
849 tion. Furthermore, our approach considerably reduces the Scaled Dot-Product Attention. Moreover, it can be used in 875
850 issue of attention drifting. both encoder and decoder modules to filter noisy information 876
851 V. CONCLUSION focus on image text regions. The proposed framework is 878
852 In this paper, we presented a new simple yet powerful frame- trained with world-level annotations; it can handle words 879
853 work for both regular and irregular STR based on the trans- of any length in the lexicon-free mode. Comprehensive 880
854 former framework. The proposed framework breaks down experimental results on challenging standard benchmarks 881
882 including IIIT5K-Words, Street View Text, CUTE80 and [22] C. Yao, X. Bai, B. Shi, and W. Liu, ‘‘Strokelets: A learned multi-scale 951
883 ICDAR datasets show that our methodology outperforms representation for scene text recognition,’’ in Proc. IEEE Conf. Comput. 952
Vis. Pattern Recognit., Jun. 2014, pp. 4042–4049. 953
884 SOTA approaches. [23] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Synthetic data 954
and artificial neural networks for natural scene text recognition,’’ in Proc. 955
Adv. Neural Inf. Process. Syst. Workshop, 2014, pp. 1–10. 956
885 REFERENCES
[24] Q. Lin, C. Luo, L. Jin, and S. Lai, ‘‘STAN: A sequential transformation 957
886 [1] Y. Zhu, C. Yao, and X. Bai, ‘‘Scene text detection and recognition: attention-based network for scene text recognition,’’ Pattern Recognit., 958
887 Recent advances and future trends,’’ Frontiers Comput. Sci., vol. 10, no. 1, vol. 111, pp. 1–9, Mar. 2021. 959
888 pp. 19–36, 2016. [25] F. Zhan and S. Lu, ‘‘ESIR: End-to-end scene text recognition via itera- 960
889 [2] Q. Ye and D. Doermann, ‘‘Text detection and recognition in imagery: tive image rectification,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 961
890 A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7, Recognit. (CVPR), Jun. 2019, pp. 2054–2063. 962
891 pp. 1480–1500, Jul. 2015. [26] R. Litman, O. Anschel, S. Tsiper, R. Litman, S. Mazor, and R. Manmatha, 963
892 [3] S. Long, X. He, and C. Yao, ‘‘Scene text detection and recognition: The ‘‘SCATTER: Selective context attentional scene text recognizer,’’ in Proc. 964
893 deep learning era,’’ Int. J. Comput. Vis., vol. 129, no. 1, pp. 1–26, 2018. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, 965
894 [4] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, ‘‘ASTER: An pp. 11962–11972. 966
895 attentional scene text recognizer with flexible rectification,’’ IEEE Trans. [27] W. Liu, C. Chen, and K.-Y. K. Wong, ‘‘Char-Net: A character-aware neural 967
896 Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2035–2048, Jun. 2019. network for distorted scene text recognition,’’ in Proc. Assoc. Advancement 968
897 [5] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, ‘‘Robust scene text recognition Artif. Intell., 2018, pp. 7154–7161. 969
898 with automatic rectification,’’ in Proc. IEEE Conf. Comput. Vis. Pattern [28] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, ‘‘AON: Towards 970
899 Recognit. (CVPR), Jun. 2016, pp. 4168–4176. arbitrarily-oriented text recognition,’’ in Proc. IEEE/CVF Conf. Comput. 971
900 [6] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou, ‘‘Edit probability for scene text Vis. Pattern Recognit., Jun. 2018, pp. 5571–5579. 972
901 recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), [29] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou, ‘‘Focusing attention: 973
902 Jun. 2018, pp. 1508–1516. Towards accurate text recognition in natural images,’’ in Proc. IEEE Int. 974
903 [7] B. Shi, X. Bai, and C. Yao, ‘‘An end-to-end trainable neural network Conf. Comput. Vis. (ICCV). Venice, Italy, Oct. 2017, pp. 5086–5094. 975
904 for image-based sequence recognition and its application to scene text [30] C. Wang and C.-L. Liu, ‘‘Multi-branch guided attention network for irreg- 976
905 recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11, ular text recognition,’’ Neurocomputing, vol. 425, pp. 278–289, Feb. 2021. 977
906 pp. 2298–2304, Nov. 2016. [31] Y. Zhang, Z. Fu, F. Huang, and Y. Liu, ‘‘PMMN: Pre-trained multi-modal 978
907 [8] Y. Gao, Y. Chen, J. Wang, M. Tang, and H. Lu, ‘‘Reading scene text network for scene text recognition,’’ Pattern Recognit. Lett., vol. 151, 979
908 with fully convolutional sequence modeling,’’ Neurocomputing, vol. 339, pp. 103–111, Nov. 2021. 980
909 pp. 161–170, Apr. 2019. [32] L. Dong, S. Xu, and B. Xu, ‘‘Speech-transformer: A no-recurrence 981
910 [9] P. Selvam and J. A. S. Koilraj, ‘‘A deep learning framework for grocery sequence-to-sequence model for speech recognition,’’ in Proc. IEEE 982
911 product detection and recognition,’’ Food Anal. Methods, to be published, Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, 983
912 doi: 10.1007/s12161-022-02384-2. pp. 5884–5888. 984
913 [10] Y. Baek, S. Shin, J. Baek, S. Park, J. Lee, D. Nam, and H. Lee, ‘‘Character [33] A. W. Yu, D. Dohan, T. Luong, R. Zhao, K. Chen, and Q. Le, ‘‘QANet: 985
914 region attention for text spotting,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Combining local convolution with global self attention for reading com- 986
915 Switzerland: Springer, 2020, pp. 504–521. prehension,’’ in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–16. 987
916 [11] S. Prabu and K. J. A. Sundar, ‘‘Enhanced attention-based encoder–decoder [34] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, ‘‘Univer- 988
917 framework for text recognition,’’ Intell. Automat. Soft Comput., vol. 35, sal transformers,’’ in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–23. 989
918 no. 2, pp. 2071–2086, 2023. [35] Y. Chen, H. Shu, W. Xu, Z. Yang, Z. Hong, and M. Dong, ‘‘Trans- 990
919 [12] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai, ‘‘Mask TextSpot- former text recognition with deep learning algorithm,’’ Comput. Commun., 991
920 ter: An end-to-end trainable neural network for spotting text with arbi- vol. 178, pp. 153–160, Oct. 2021. 992
921 trary shapes,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, [36] L. Yang, P. Wang, H. Li, Z. Li, and Y. Zhang, ‘‘A holistic representation 993
922 pp. 532–548, Feb. 2021. guided attention network for scene text recognition,’’ Neurocomputing, 994
923 [13] M. Jaderberg, ‘‘Spatial transformer networks,’’ in Proc. Adv. Neural Inf. vol. 414, 2020, pp. 67–75. 995
924 Process. Syst. (NeurIPS), 2015, pp. 2017–2025. [37] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image 996
925 [14] D. Yu, X. Li, C. Zhang, J. Han, J. Liu, and E. Ding, ‘‘Towards accu- recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Las 997
926 rate scene text recognition with semantic reasoning networks,’’ in Proc. Vegas, NV, USA, Jun. 2016, pp. 770–778. 998
927 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). Seattle, WA, [38] A. Mishra, K. Alahari, and C. V. Jawahar, ‘‘Scene text recognition using 999
928 USA, Jun. 2020, pp. 12110–12119. higher order language priors,’’ in Proc. Brit. Mach. Vis. Conf. Surrey, U.K., 1000
929 [15] C. Luo, L. Jin, and Z. Sun, ‘‘MORAN: A multi-object rectified atten- 2012, pp. 127.1–127.11. 1001
930 tion network for scene text recognition,’’ Pattern Recognit., vol. 90, [39] K. Wang, B. Babenko, and S. Belongie, ‘‘End-to-end scene text recogni- 1002
931 pp. 109–118, Jun. 2019. tion,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 1457–1464. 1003
932 [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [40] S. M. Lucas, ‘‘ICDAR 2003 robust reading competitions: Entries, results, 1004
933 L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ 2017, and future directions,’’ Int. J. Document Anal. Recognit. (IJDAR), vol. 7, 1005
934 arXiv:1706.03762. nos. 2–3, pp. 105–122, 2005. 1006
935 [17] C. Yao, X. Bai, and W. Liu, ‘‘A unified framework for multioriented text [41] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. I. Bigorda, 1007
936 detection and recognition,’’ IEEE Trans. Image Process., vol. 23, no. 11, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. D. L. Heras, 1008
937 pp. 4737–4749, Nov. 2014. ‘‘ICDAR 2013 robust reading competition,’’ in Proc. Int. Conf. Document 1009
938 [18] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, ‘‘Object segmentation based Anal. Recognit. (ICDAR). Washington, DC, USA, 2013, pp. 1484–1493. 1010
939 on the integration of adaptive K-means and GrabCut algorithm,’’ in Proc. [42] D. Karatzas, ‘‘ICDAR 2015 competition on robust reading,’’ in Proc. Int. 1011
940 Int. Conf. Wireless Commun. Signal Process. Netw. (WiSPNET), Mar. 2022, Conf. Document Anal. Recognit. (ICDAR). Tunis, Tunisia, Aug. 2015, 1012
941 pp. 213–216, doi: 10.1109/WiSPNET54241.2022.9767099. pp. 1156–1160. 1013
942 [19] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, ‘‘End-to-end text recognition [43] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, ‘‘Recognizing text with 1014
943 with convolutional neural networks,’’ in Proc. Int. Conf. Pattern Recognit. perspective distortion in natural scenes,’’ in Proc. IEEE Int. Conf. Comput. 1015
944 (ICPR), Nov. 2012, pp. 3304–3308. Vis. Sydney, NSW, Australia, Dec. 2013, pp. 569–576. 1016
945 [20] K. Wang, B. Babenko, and S. Belongie, ‘‘End-to-end scene text recog- [44] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, ‘‘A robust 1017
946 nition,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011, arbitrary text detection system for natural scene images,’’ Expert Syst. 1018
947 pp. 1457–1464. Appl., vol. 41, no. 18, pp. 8027–8048, 2014. 1019
948 [21] J.-H. Seok and J. H. Kim, ‘‘Scene text recognition using a Hough forest [45] A. Gupta, A. Vedaldi, and A. Zisserman, ‘‘Synthetic data for text localisa- 1020
949 implicit shape model and semi-Markov conditional random fields,’’ Pat- tion in natural images,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 1021
950 tern Recognit., vol. 48, no. 11, pp. 3584–3599, 2015. (CVPR), Jun. 2016, pp. 2315–2324. 1022
1023 [46] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Synthetic data ABOLFAZL MEHBODNIYA (Senior Member, 1088
1024 and artificial neural networks for natural scene text recognition,’’ in Proc. IEEE) received the Ph.D. degree from the 1089
1025 Adv. Neural Inf. Process. Syst. Workshop, 2014, pp. 1–10. INRS-EMT University of Quebec, Montreal, 1090
1026 [47] H. Li, P. Wang, C. Shen, and G. Zhang, ‘‘Show, attend and read: A simple Canada, in 2010. He is currently an Asso- 1091
1027 and strong baseline for irregular text recognition,’’ in Proc. AAAI Conf. ciate Professor and the Head of the Department 1092
1028 Artif. Intell., 2019, pp. 8610–8617. of Electronics and Communication Engineering 1093
1029 [48] N. Lu, W. Yu, X. Qi, Y. Chen, P. Gong, R. Xiao, and X. Bai, ‘‘MASTER: (ECE), Kuwait College of Science and Technol- 1094
1030 Multi-aspect non-local network for scene text recognition,’’ Pattern Recog-
ogy. Before coming to KCST, he worked as a 1095
1031 nit., vol. 117, pp. 1–10, Sep. 2021.
Marie-Curie Senior Research Fellow at University 1096
1032 [49] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai,
1033 ‘‘Scene text recognition from two-dimensional perspective,’’ in Proc. AAAI College Dublin, Ireland. Prior to that, he worked 1097
1034 Conf. Artif. Intell., 2019, pp. 8714–8721. as an Assistant Professor at Tohoku University, Japan, and as a Research 1098
1035 [50] Y. Huang, Z. Sun, L. Jin, and C. Luo, ‘‘EPAN: Effective parts attention net- Scientist at the Advanced Telecommunication Research (ATR) International, 1099
1036 work for scene text recognition,’’ Neurocomputing, vol. 376, pp. 202–213, Kyoto, Japan. His research interests include communications engineering, 1100
1037 Feb. 2020. the IoT and artificial intelligence in wireless networks, and real-world appli- 1101
1038 [51] Y. Wu, J. Fan, R. Tao, J. Wang, H. Qin, A. Liu, and X. Liu, ‘‘Sequential cations. He received numerous awards, including the JSPS Young Faculty 1102
1039 alignment attention model for scene text recognition,’’ J. Vis. Commun. Startup Grant, the KDDI Foundation Grant, the Japan Radio Communica- 1103
1040 Image Represent., vol. 80, pp. 1–8, Oct. 2021. tions Society (RCS) Active Researcher Award, the European Commission 1104
Marie Skodowska-Curie Fellowship, and the NSERC Visiting Fellowships 1105
in Canadian Government Laboratories. He is a Senior Member of IEICE. 1106
1041 PRABU SELVAM received the B.E. degree in
1042 computer science and engineering from the Shirdi
1043 Sai Engineering College, in 2011, and the M.E.
1044 degree in computer science and engineering from
1045 the Sathyabama Institute of Science and Tech-
1046 nology, in 2013. He is currently working as a
1047 Research Scholar with the School of Computing,
1048 SASTRA Deemed University, Thanjavur. His cur- JULIAN L. WEBBER (Senior Member, IEEE) 1107
1049 rent research interests include pattern recognition received the Ph.D. degree from Bristol Univer- 1108
1050 and computer vision. He carried out the coding and sity, in 2004. Following postdoctoral research on 1109
1051 the implementation of the ideas. wireless communications at Hokkaido Univer- 1110
sity, in 2007. In 2012, he joined the Advanced 1111
Telecommunications Research Institute Interna- 1112
1052 JOSEPH ABRAHAM SUNDAR KOILRAJ tional, Kyoto. Since 2018, he has been a Visiting 1113
1053 received the Ph.D. degree from SASTRA Deemed Researcher and a Research Assistant Professor at 1114
1054 University, Thanjavur, in 2017. He is currently Osaka University. He is currently an Associate 1115
1055 working as an Assistant Professor with the School Professor and the Head of the Department of Elec- 1116
1056 of Computing, SASTRA Deemed University. tronics and Communication Engineering (ECE), Kuwait College of Science 1117
1057 He has published and presented more than 20 tech- and Technology. His research interests include communications engineering, 1118
1058 nical papers in international/national journals machine learning, signal and image processing, and emphasizing real-time 1119
1059 and conferences. His current research interests implementation. He is a member of IEICE. 1120
1060 include image processing and pattern recognition.
1061 In the current study, he analyzed, interpreted, and
1062 evaluated experimental results.
1075 MESHAL ALHARBI received the M.Sc. degree than 100 Projects for UG and PG students in engineering streams. He is the 1132
1076 in computer science from Wayne State Univer- Recognized Research Supervisor at Anna University, under Information and 1133
1077 sity, USA, in 2014, and the Ph.D. degree in Communication Engineering Faculty. He has published papers in 140 inter- 1134
1078 computer science from Durham University, U.K., national journals, 20 international conferences, and ten national conferences. 1135
1079 in 2020. He has ten years of experience in teach- He has published three textbooks for Anna University, Chennai Syllabus. 1136
1080 ing/research/industry. He is currently an Assistant He has filled 20 Indian and three international patents in various fields of 1137
1081 Professor of artificial intelligence with the Depart- interest. His research interests include security, MANET, the IoT, cloud com- 1138
1082 ment of Computer Science, Prince Sattam Bin puting, and machine learning. He is a member of various professional bodies 1139
1083 Abdulaziz University, Saudi Arabia. His research like MISTE, MIEEE, MIAENG, MIACSIT, MICST, MIE, and MIEDRC. 1140
1084 interests include artificial intelligence applications He received the Award of Honorary Doctorate (Doctor of Letters-D.Litt.) 1141
1085 and algorithms, agent-based modeling and simulation applications, disas- from International Economics University; SAARC Countries in Education 1142
1086 ter/emergency management and resilience, optimization applications, and and Students Empowerment, in April 2017. 1143