0% found this document useful (0 votes)
28 views16 pages

A_Transformer-Based_Framework_for_Scene_Text_Recognition

This document presents a novel transformer-based framework for Scene Text Recognition (STR) that effectively identifies both regular and irregular text in natural scenes. The framework comprises four modules: Image Transformation, Visual Feature Extraction, Encoder, and Decoder, employing techniques such as Thin Plate Spline transformation and Optimal Adaptive Threshold-based Self Attention. Extensive experiments demonstrate that this approach significantly outperforms existing methods across multiple benchmarks.

Uploaded by

bhanumathiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views16 pages

A_Transformer-Based_Framework_for_Scene_Text_Recognition

This document presents a novel transformer-based framework for Scene Text Recognition (STR) that effectively identifies both regular and irregular text in natural scenes. The framework comprises four modules: Image Transformation, Visual Feature Extraction, Encoder, and Decoder, employing techniques such as Thin Plate Spline transformation and Optimal Adaptive Threshold-based Self Attention. Extensive experiments demonstrate that this approach significantly outperforms existing methods across multiple benchmarks.

Uploaded by

bhanumathiv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Received 21 August 2022, accepted 8 September 2022, date of publication 16 September 2022,

date of current version 29 September 2022.


Digital Object Identifier 10.1109/ACCESS.2022.3207469

A Transformer-Based Framework
for Scene Text Recognition
PRABU SELVAM1 , JOSEPH ABRAHAM SUNDAR KOILRAJ1 ,
CARLOS ANDRÉS TAVERA ROMERO 2 , (Member, IEEE),
MESHAL ALHARBI 3 , ABOLFAZL MEHBODNIYA 4 , (Senior Member, IEEE),
JULIAN L. WEBBER 4 , (Senior Member, IEEE),
AND SUDHAKAR SENGAN 5 , (Member, IEEE)
1 Schoolof Computing, SASTRA Deemed University, Thanjavur, Tamil Nadu 613401, India
2 COMBA R&D Laboratory, Faculty of Engineering, Universidad Santiago de Cali, Cali 76001, Colombia
3 Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
4 Department of Electronics and Communications Engineering, Kuwait College of Science and Technology, Safat 20185145, Kuwait
5 Department of Computer Science and Engineering, PSN College of Engineering and Technology, Tirunelveli, Tamil Nadu 627152, India

Corresponding author: Sudhakar Sengan ([email protected])


This work was supported by the Dirección General de Investigaciones at Universidad Santiago de Cali under Grant 01-2022.

1 ABSTRACT Scene Text Recognition (STR) has become a popular and long-standing research problem in
2 computer vision communities. Almost all the existing approaches mainly adopt the connectionist temporal
3 classification (CTC) technique. However, these existing approaches are not much effective for irregular STR.
4 In this research article, we introduced a new encoder-decoder framework to identify both regular and irregular
5 natural scene text, which is developed based on the transformer framework. The proposed framework is
6 divided into four main modules: Image Transformation, Visual Feature Extraction (VFE), Encoder and
7 Decoder. Firstly, we employ a Thin Plate Spline (TPS) transformation in the image transformation module
8 to normalize the original input image to reduce the burden of subsequent feature extraction. Secondly,
9 in the VFE module, we use ResNet as the Convolutional Neural Network (CNN) backbone to retrieve text
10 image features maps from the rectified word image. However, the VFE module generates one-dimensional
11 feature maps that are not suitable for locating a multi-oriented text on two-dimensional word images.
12 We proposed 2D Positional Encoding (2DPE) to preserve the sequential information. Thirdly, the feature
13 aggregation and feature transformation are carried out simultaneously in the encoder module. We replace
14 the original scaled dot-product attention model as in the standard transformer framework with an Optimal
15 Adaptive Threshold-based Self-Attention (OATSA) model to filter noisy information effectively and focus
16 on the most contributive text regions. Finally, we introduce a new architectural level bi-directional decoding
17 approach in the decoder module to generate a more accurate character sequence. Eventually, We evaluate
18 the effectiveness and robustness of the proposed framework in both horizontal and arbitrary text recognition
19 through extensive experiments on seven public benchmarks including IIIT5K-Words, SVT, ICDAR 2003,
20 ICDAR 2013, ICDAR 2015, SVT-P and CUTE80 datasets. We also demonstrate that our proposed framework
21 outperforms most of the existing approaches by a substantial margin.

22 INDEX TERMS Connectionist temporal classification, scene text recognition, self-attention, transformer,
23 optical character recognition, deep learning.

24 I. INTRODUCTION reasoning [1]. Reading texts in natural scene images is use- 27

25 Text information disclosed in natural scene images is ful in numerous deep learning systems [2] including, robot 28

26 vital for visual interpretation, conceptual understanding and navigation, industrial automation and driver assistance. The 29

areas of computer vision and pattern recognition have paid 30

The associate editor coordinating the review of this manuscript and special attention to STR. Despite decades of research into 31

approving it for publication was Seok-Bum Ko . Optical Character Recognition (OCR) [3], text recognition 32

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 100895
P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

33 from natural scene images remains challenging due to a variety of attention-based approaches have evolved in the 89

34 variety of factors, for example, huge variation in the text STR field, influenced by the growth of machine transla- 90

35 font and text colour, low contrast, complex backgrounds, tion frameworks. There are various flaws in the attention 91

36 perspective distortion, occlusion, uneven lighting and so on. mechanism: This technique needs additional storage space 92

37 A wide variety of STR approaches has been discussed in the and computation power, it suffers from attention drift prob- 93

38 literature with significant success in recent years, benefitting lem, and The latest attention mechanism research is primar- 94

39 from the emergence of deep learning. Traditional approaches ily focused on languages with just a few character groups 95

40 recognized text from scene images by first locating individual (e.g., English, French). 96

41 characters and then using a CNN to recognize each cropped Transformer [16], a modern attention alternative, was 97

42 character [4]. However, a great volume of inter-character and extensively used to increase parallelization and minimize 98

43 intra-character conflict will effectively decrease the recogni- complexity for STR. Some efforts were made to replace 99

44 tion network’s performance. These methods rely largely on recurrent neural networks with non-recurrent structures in the 100

45 a robust character detector. An attentional Recurrent Neu- domain of regular text recognition, such as convolutional- 101

46 ral Network (RNN) can effectively handle the sequence- based and attention-based approaches. Nevertheless, both 102

47 to-sequence (seq2seq) problem of recognizing regular text. approaches depend on seq2seq structures, which are inade- 103

48 Whereas, recognizing arbitrary shape text is more challeng- quate for handling arbitrary shaped text. It’s worth mention- 104

49 ing for an RNN model. ing that the Transformer is originally designed for language 105

50 Conventional irregular STR methods can be divided into translation tasks such as English to French, French to English 106

51 4 main categories: text shape rectification-based approach and so on. It takes one-dimensional sequences as its input. 107

52 [4], [5], Connectionist Temporal Classification (CTC) based Since position information is not naturally encoded within the 108

53 approach [6], [7], multi-direction encoding-based approach input set of sequences, the model is less sensitive to the posi- 109

54 [8], [9] and attention-based approach [10], [11], [12]. Shape tioning of input sequences than RNN and LSTM frameworks 110

55 rectification is used to normalize the word image, eliminate with associative bias. The transformer is permutative equiv- 111

56 distortion and make irregular text recognition simpler. Meth- arience, because of its Self-Attention (SA) and Feed-Forward 112

57 ods based on shape rectification [5] attempt the transfor- Network (FFN) layers, which calculate the result of each 113

58 mation of irregular text to regular text before using regular component in the input sequence separately. While the 1D 114

59 text recognizers. A Spatial Transformer Network (STN) [13] Positional Encoding (PE) approach employed in Transformer 115

60 is a first text normalization network, it was utilized to rec- may handle the permutation equivarience issue that may arise 116

61 tify individual character regions and complete word images. in 1D sequences associated with NLP, it cannot preserve 117

62 Later, Shi et al. [4] incorporated Thin-Plate Spline (TPS) the horizontal and vertical features produced by CNNs for 118

63 transformation mechanism as a text normalization module, a 2D input image. 119

64 to handle more complex text distortions. Sophisticated rec- In short, the primary contributions of our research work are 120

65 tification modules are necessary to manage a range of dis- as follows: 121

66 tortions, and they are becoming a new trend. However, these • Transformer [16] in Natural Language Processing 122

67 factors have an impact on the performance and memory (NLP), which takes only 1D sequences as its input. 123

68 use of recognition algorithms. CTC has made substantial On the other hand, the scene text recognizers can only 124

69 improvements in a variety of areas, including voice recog- handle 2D images. To solve this permutation equivari- 125

70 nition and web handwritten character identification. CTC is ence problem and to preserve the order of sequential 126

71 a widely used prediction algorithm in STR. Yu et al. [14] and information, we modify the conventional transformer 127

72 Shi et al. [7] were the initial algorithms to apply CTC to architecture to recognize texts in the scene image. Here, 128

73 STR, inspired by its achievement in speech recognition. we introduced a novel mechanism to convert the spa- 129

74 Luo et al. [15] used a CTC-based prediction algorithm tial encoder from 1D to 2D, by expanding the stan- 130

75 for model learning and have shown superior performance. dard transformer’s 1D Positional Encoding (1DPE) to 131

76 Despite the remarkable performance in STR, However, 2D Positional Encoding (2DPE). 132

77 CTC has certain shortcomings: CTC’s underlying approach • Input word images in natural scenes take different 133

78 is complex, CTC exhibits peaky distribution issues, and forms, including curved and skewed texts. If such input 134

79 CTC is ineffective for two-dimensional prediction tasks like word images are transmitted left unchanged, the fea- 135

80 irregular STR. ture extraction step must learn an invariant represen- 136

81 Attention-based prediction methods have surpassed CTC tation for such geometry. To eliminate distortion on 137

82 in decoding in recent years, because of their capability to the input word images and make text recognition eas- 138

83 attend to the appropriate place. The RNN consistently use the ier, we employ a thin-plate spline (TPS) transforma- 139

84 attention mechanism in its prediction module for STR prob- tion [4]. The rectified or normalized images enhance 140

85 lem. The input text image pattern, the output text sequences text recognition accuracy, particularly for datasets with 141

86 pattern and its difference are primarily learned by the atten- a majority of arbitrary texts and perspectively distorted 142

87 tion mechanism by examining the experience of the final texts. TPS can be selected or deselected using our 143

88 character sequences and encoded feature vectors. A wide framework. 144

100896 VOLUME 10, 2022


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

145 • We propose a new mechanism called the Optimal a CNN classifier composed of four convolutional layers and 200

146 Adaptive Threshold-based Self Attention (OATSA) two fully connected layers used to retrieve high-level fea- 201

147 that explicitly ignores the least contributive compo- ture maps from the text image. This technique, however, 202

148 nents in the attention matrix to limit the extraction was confined to a pre-defined lexicon. To get around this 203

149 of irrelevant elements and improve attention focus. constraint, several authors have recently considered STR as 204

150 OATSA approach can efficiently filter noisy informa- a sequence translation task. Shi et al. [7] introduced a new 205

151 tion and generate a more accurate word. unified deep neural network architecture; named Convolu- 206

152 • We introduce a unified bi-directional decoder architec- tional Recurrent Neural Network (CRNN) combines the func- 207

153 ture integrated into modified Transformers that works tions of both CNN and Recurrent Neural Networks (RNN). 208

154 in both right-to-left (R2L) and left-to-right (L2R) CRNN can handle input images of various dimensions and 209

155 directions for STR to obtain more robust character generate predictions of different lengths. CRNN is capable 210

156 sequences. The bidirectional decoding method, which of handling random strings (e.g. phone numbers), sentences, 211

157 uses two distinct decoders outperforms the stand-alone and other scripts such as Chinese words, and is not con- 212

158 decoders. We accomplished this by executing bidi- fined to recognizing words from a known dictionary. Since 213

159 rectional decoding at the architectural-based systems, the spatial dependencies between local image patches in 214

160 rather than at the input-based systems, which were used CNN are not explored and utilized, Shi et al. [5] specially 215

161 in earlier research works. designed a new deep neural network that has both convolu- 216

162 • The proposed framework is a non-recurrent network, tional and recurrent layers to transform the irregular word 217

163 trained in an end-to-end manner. This framework can images into a more readable regular word image via STN. 218

164 be trained concurrently without employing any RNN This network generates a sequence of feature vectors for any 219

165 modules. Although, the proposed framework achieves arbitrary size input word image. Finally, an attention-based 220

166 comparable or superior performance compared with sequence recognizer is used to generate a character sequence. 221

167 most existing methods on both horizontal and arbitrary Lin et al. [24] designed a special network by combining a 222

168 shaped scene text benchmarks for example 97.1% on sequential transformation network, used to rectify the irreg- 223

169 ICDAR03, 98.0% on ICDAR13, 87.2% on ICDAR15, ular text by dividing the complex transformation step into 224

170 and 89.3% on CUTE80 datasets. multiple basic transformation steps and an attention-based 225

text recognizer adopted to classify and capture character 226

171 II. RELATED WORK sequences. 227

172 In the past decade, there has been a growing interest in natural Recent deep networks can develop robust representations 228

173 STR in the computer vision community, which differs from that are tolerant to imaging distortions and changes in text 229

174 classical handwritten character and text recognition in terms style, but they still have issues handling scene texts includ- 230

175 of characteristics and difficulties. The extensive studies can ing viewpoint and curvature distortions. To deal with such 231

176 be found in [1], [2], and [3]. issues, Zhan and Lu [25] established an end-to-end STR 232

177 Traditional STR systems used text detectors to extract network called ESIR, which reduces viewpoint distortion 233

178 various candidate character positions and then used a char- and text line curvature iteratively that improves the perfor- 234

179 acter classifier to recognize the characters. These traditional mance of STR systems. The posture of text lines in scenes 235

180 methods depend on low-level features for STR including is estimated using an innovative rectification network that 236

181 stroke width transform [17], connected components [18], introduces a different line fitting transformation. In addition, 237

182 Histogram of Oriented Gradients (HOG) descriptors [19] and an iterative rectification mechanism is being created, which 238

183 so on. Wang et al. [20] used the HOG descriptors to train corrects scene text distortions in a fronto-parallel perspective. 239

184 a character recognizer, which then used a sliding window Litman et al. [26] presented a new encoder-decoder archi- 240

185 to recognize characters in cropped word images. Seok and tecture named Selective Context ATtentional Text Rec- 241

186 Kim [21] represent the target character set as an Implicit ognizer (SCATTER) for predicting character sequences 242

187 Shape Model (ISM) to achieve robustness on the character against complicated image backgrounds. A deep Bi-LSTM 243

188 set. Hough forest is trained to localize and group the char- encoder is designed for encoding contextual dependencies. 244

189 acter candidates; the semi-Markov conditional random field A two-step 1D attention method is used to decode the 245

190 (semi-CRF) framework is constructed for the recognition of text. 246

191 text candidates. However, the low capability of hand-crafted Instead of rectifying the complete text image, Liu et al. [27] 247

192 features limits the performance of traditional text recognition suggested using Character-Level Encoder (CLE) to iden- 248

193 systems. Yao et al. [22] introduced a new way of reliably tify and rectify specific characters in the word image. 249

194 identifying individual characters called Strokelets, providing The arbitrary orientation network (AON) was developed by 250

195 a histogram feature for recognizing character components in Cheng et al. [28] to directly capture the deep feature repre- 251

196 natural scenes image. sentations of irregular texts in four directions along with char- 252

197 Several researchers have started using deep learning mod- acter location clues. A filter gate mechanism was designed to 253

198 els for STR as a result of the fast growth of neural networks. integrate the four-direction character sequences of features, 254

199 For a 90k-word classification, Jaderberg et al. [23] designed and an attention-based decoder was employed to generate 255

VOLUME 10, 2022 100897


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

256 character sequences. To tackle the ‘‘attention drift’’ problem, regular STR and these are found hard to recognize irregular 312

257 Cheng et al. [29] proposed a Focusing Attention Network text. 313

258 (FAN) method comprised of two main components: an atten- In contrast with the convolution network and attention 314

259 tion network (AN) and a focusing network (FN). The AN is mechanism, we propose a simple but powerful STR model 315

260 designed to recognize character targets and FN is designed with an Optimal Adaptive Threshold-based Self-Attention 316

261 to evaluate and alter the attention by paying attention to (OATSA) mechanism in this paper. This method directly 317

262 the character target locations in the word images properly. maps word images into character sequences and it also works 318

263 A Multi-Branch Guided Attention Network (MBGAN) was well on both horizontal and arbitrary shaped scene text 319

264 proposed by Wang and Liu [30] to acquire invariant semantic images. 320

265 representations of decoded character sequences, which can


266 handle numerous irregularity aspects for irregular text recog- III. PROPOSED SYSTEM 321
267 nition. Furthermore, the MBGAN contribute to the preven- We propose a modified Transformer-based architecture to 322
268 tion of attention drift. Similarly, Bai et al. [6] introduced recognize arbitrary shaped text from the natural scene image. 323
269 a technique called edit probability (EP) to efficiently deal Fig. 1 illustrates the overall pipeline of the proposed frame- 324
270 with the issue of missing or unnecessary characters causing work. The modified transformer can be categorized into four 325
271 misalignment between the training texts and the resultant main modules: Image transformation, VFE, Encoder and 326
272 probability distribution character sequence. Most deep learn- Decoder. Both the encoder and decoder utilize a multi-layer 327
273 ing models heavily rely on visual data such as well-annotated stack of transformers. The encoder module is designed 328
274 images, it does not efficiently employ linguistic data such as to obtain a high-level feature representation from a scene 329
275 texts. Zhang et al. [31] introduced a novel network called Pre- text image. The decoder block is designed to generate the 330
276 trained Multi-Modal Network (PMMN) that uses both visual sequence of characters from the feature maps while paying 331
277 and linguistic data for accurate STR. attention to the encoder output. Transformers are attention- 332
278 In recent years, some research has been done to elim- based deep-learning architectures that use a self-attention 333
279 inate the recurrent architecture from the seq2seq learning module to scan through each constituent of a sequence 334
280 algorithms, allowing for entirely parallel computation and and post updates by accumulating information from the 335
281 faster processing speeds. Vaswani et al. [16] introduced a entire sequence. An attention mechanism greatly helps 336
282 novel standard network architecture solely based on atten- the transformers to capture the global dependencies among 337
283 tion mechanisms, with no recurrence or convolutions called the input and output sequences, where previous deep-learning 338
284 ‘‘Transformer’’ for machine translation tasks. Position-pair approaches find it challenging to capture such relationships. 339
285 computation by RNNs enables the core self-attention mod-
286 ule to obtain interdependence between distinct places in
A. IMAGE TRANSFORMATION 340
287 a sequence, leading to greater parallelization and mini-
288 mal network complexity. Dong et al. [32] incorporated a Text recognizers are most effective when their input images 341

289 Transformer architecture for the voice recognition problem. include tightly confined regular text. This encourages us 342

290 Similarly, Yu et al. [33] developed a network for a read- to do a spatial transformation before a recognition to con- 343

291 ing comprehension problem by integrating internal convo- vert input images into ones that recognizers can read easily. 344

292 lution layers with a universal self-attention module. Both We employed a TPS transformation to transform an input text 345

293 these models were inspired by the Transformer framework. image (I) into a normalized image (I’) as shown in Fig. 2. Text 346

294 Dehghani et al. [34] recently expanded the Transformer images come in various shapes for example: tilted, perspec- 347

295 architecture by developing a new model called the ‘‘Universal tive and curved texts. Such complex-shaped text images force 348

296 Transformer’’ to handle string copying and other rational the feature extraction steps to learn an invariant representation 349

297 interpretation with string lengths that are longer than those concerning such geometry. TPS rectification algorithm is an 350

298 seen during training. There have also been numerous attempts alternative to the STN, which has been used for various aspect 351

299 to interpret scene text without the use of recurrent networks. ratios of text lines to reduce this complexity. TPS interpolates 352

300 Based on the Transformer model, Chen et al. [35] developed between a collection of fiducial points using a smooth spline. 353

301 a new non-recurrent seq2seq framework for STR, which TPS identifies several fiducial points at the upper and lower 354

302 includes a self-attention block functioning as a fundamen- enveloping points, and then normalizes the character region 355

303 tal component in both the encoder and decoder architec- to a predefined rectangle. 356

304 ture, to understand character dependencies. Yang et al. [36]


305 proposed an STR network that is much more simple and B. VISUAL FEATURE EXTRACTION (VFE) AND 2D 357

306 powerful, based on holistic representation-guided attention. POSITIONAL ENCODING (2DPE) 358

307 An attention-based sequence decoder is linked directly to According to recent research, ResNet [37] is gaining 359

308 two-dimensional CNN features. The holistic depiction may a lot of traction in the research community because 360

309 steer the attention-based decoder to more precisely concen- of its unique features of dealing with the overfitting 361

310 trate on text regions. Because of their inherent model archi- and vanishing gradient issue, parameter efficiency, cap- 362

311 tecture, all of these existing approaches are mostly focused on turing well-defined feature representations and introducing 363

100898 VOLUME 10, 2022


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

TABLE 1. Illustration of ResNet feature extraction network configuration.

in the input F 0 , resulting in the attention score tensor S with 382

FIGURE 1. Illustrate the overall pipeline of our proposed framework. dimension H×W×d’. The multi-head self-attention layer can 383

better capture 2D spatial information using the positional 384

encoded map F 0 . By introducing a fixed 2D positional encod- 385

ing P(.) as given in Eq. (1) – Eq. (4), we generalize the original 386

transformer’s 1D encoding to be suitable for the 2D image 387

feature. 388

PE (hor, ver, 2i) = sin(pos (hor) .frei ) (1) 389

PE (hor, ver, 2i + 1) = cos(pos(hor).frei ) (2) 390


FIGURE 2. Visualization of image normalization step a) Input image (I)
b) Set of fiducial points are predicted on the input text image PE (hor, ver, 2j + ch/2) = sin(pos(ver).frej ) (3) 391

(I) represented by red markers c) Normalized image (I’).


PE (hor, ver, 2j + 1 + ch/2) = cos(pos(ver).frej ) (4) 392

364 ‘‘identity shortcut connection’’. Therefore, we use ResNet as where pos(hor) and pos(ver) represent the horizontal and 393

365 the CNN backbone for the VFE. vertical positions respectively, frei , frej ∈ R is the 2D posi- 394

366 Table 1 illustrates the 50-layer residual network configu- tional encoding signal’s learnable frequencies, ch represents 395

367 ration as used in [10] we utilize for our feature extraction the number of channels in F 0 and i, j ∈ [0, d/4]. Position 396

368 stage. CNN process the rectified image (I’) to extract compact code (P) and 2D PE (F 0 ) are combined to enable to notice 397

369 feature representation F ∈ RW ×H ×c , where W, H and C character’s information in each position before it. The 2D 398

370 represent the rectified image’s width, height and the number encoding maps P added to F 0 represented by F 00 = F 0 + P. 399

371 of channels respectively. To reduce the encoding stage’s com- The Transformer’s encoder only takes a set of vectors as 400

372 putational cost, we use 1×1 convolution to reduce the feature input. Hence, it is necessary to vectorize the d channels of 401

373 map channels represented by F 0 ∈ RW ×H ×d , where d < c. F 00 and stacked together to create a single feature matrix 402

374 Conventional Transformers lack recurrence and convolu- M is shown in Eq. (5). The function Mat2Vec is used to 403

375 tion layers, so it is important to provide absolute or relative convert the feature map into a vector represented by xi,j = 404

376 location information of the characters in the word or sen- Mat2Vec(F 00 (:, :, j)) ∈ 1 × HW. 405

tence to the model which utilizes the sequence’s order for


 
377 xi,1
378 processing. To this end, we use 2D positional encoding maps  xi,2 
P ∈ RW×H×d as in [16] to capture the spatial information. The M =  .  ∈ Rd×WH (5)
 
 .. 
379 406

380 input is a tensor F 0 of dimension H×W×d, and an attention


381 score is used to correlate a query and a key at each position xi,d

VOLUME 10, 2022 100899


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

407 C. ENCODER
408 The encoder is composed of two important layers: Multi-
409 Head Self-Attention (MHSA) and Feed-Forward Network
410 Layers (FFNL). An encoder captures a set of vectors (M)
411 as its input and processes these vectors by giving input to
412 the self-attention layer and FFN layer in order. Then the
413 output from the FFN layer is sent to the next encoder. Around
414 each of the two layers, we use a residual connection and
415 proceed with layer normalization as shown in Fig. 3 i.e Lay-
416 erNorm(x + Sublayers(x)), where the function Sublayers(x)
417 is implemented by the sublayers themselves. All sublayers in
418 the model, as well as the embedding layers, generate results
419 with dimension 512 to facilitate these residual connections.

420 1) MULTIHEAD SELF ATTENTION (MHSA)


421 In a transformer-based STR network, the Multi-Head Self
422 Attention (MHSA) mechanism is mainly used to manage
423 size disparities in the text instances, the vector M is linearly
424 transformed into three matrixes Query (Q), Key (K), and
425 Value (V). The dot product form is used for familiarity to
426 assist the translation of vector calculation into matrix calcu- FIGURE 3. Visualization of vectors and layer normalization associated
427 lation. The weighted sum of the values matrix is output by with self-attention layer and Feed Forward Network.
428 the attention mechanism, which computes the weights of the
429 values matrix V ∈ RW×d (see Eq. (9)) using the appropriate joined together. 457

430 key matrix K ∈ RW×d (see Eq. (8)) and query matrix Q ∈ MHAttention(Q, K , V ) = Concat(head1 , . . . , headn )W O
458
431 RW×d (see Eq. (7)). The queries, keys, and values for the self-
(10) 459
432 attention modules are all generated in the same sequence. The Q
433 attention output matrix is derived as follows (see Eq. (6)): Where headi = Attention(QWi , KWi , VWV K
i ) 460

(11) 461

Q × KT

434 Attention (Q, K, V) = softmax √ V (6) Using multiple attention heads has the advantage of allowing 462

dk the model to learn to focus on various parts of the input image 463

435 Q = [q1 , q2 , . . . .., qw ]T , qi = Wq xi + bq (7) for each attention head at different phases of the encoding 464

436 K = [k1 , k2 , . . . .., kw ] ,


T
ki = Wk xi + bk (8) process and it provides several ‘‘representation subspaces’’ 465

to the attention layer. 466


437 V = [v1 , v2 , . . . .., vw ]T , vi = Wv xi + bv (9)
2) OPTIMAL ADAPTIVE THRESHOLD-BASED 467
438 where b, Wq , Wk , and Wv are the bias, weight matrices of SELF-ATTENTION (OATSA) 468
439 Query, Key and Value respectively. The scaling factor √1 The self-attention operation in the classic Transformer archi- 469
dk
440 represent the dimensions of queries and keys, the scaling tecture, on the other hand, has an evident flaw in that it 470

441 factor is used to prevent a very small gradient of the softmax distributes credits to all context components. It’s inappro- 471

442 function. priate since a lot of credit could be given to information 472

443 The multi-head attention method involves projecting the that’s not relevant and should be discarded. For example, the 473

444 queries, keys, and values ‘n’ times with various learnable traditional self-attention method calculates attention weights 474

445 projection weights, allowing the model to collect useful by multiplying the specified query by the key from several 475

446 data from several representation subsections at the same modalities. The weighted sum is then calculated by applying 476

447 time. In Transformer, both dot-product attention and MHSA the attention matrix to the value. However, many irrelevant 477

448 are effective in practice, with multi-head attention being a words may have a minimal association with encoded image 478

449 concatenation of dot-product attention as given in Eq. (10) attributes, leading to a very modest amount after multiply- 479

450 and Eq. (11). Multiple heads of attention are employed in ing the provided query by key. When attention scores are 480

451 each transformer layer. Furthermore, a multi-head-attention relatively close, a SOTA approach namely constraint local 481

452 mechanism has been introduced to the attention layer to attention cannot filter irrelevant information and will break 482

453 improve the method’s expressive capability. Before calculat- the long-term dependency. 483

454 ing attention, all Q, K, and V are partitioned into numer- In our proposed work, we integrate a new threshold module 484

455 ous heads and run through distinct, learnt linear projections. namely Optimal Adaptive Threshold-based Self Attention 485

456 After the attention calculation, multiple heads will be (OATSA) in a standard self-attention calculation as shown 486

100900 VOLUME 10, 2022


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

FIGURE 4. a) Standard scaled dot-product attention. b) Proposed Optimal Adaptive Threshold-based Self-Attention (OATSA).

487 in Fig. 4(b) to discard irrelevant information and preserve are taken ‘‘one at a time’’ by the Feed Forward Network. The 524

488 the long-term dependencies by replacing Standard scaled finest part is that, unlike with RNN, each of these attention 525

489 dot-product attention as shown in Fig. 4(a). We integrate vectors is independent of the others. As a result, paralleliza- 526

490 our OATSA module between scale and softmax function to tion may be used here, which makes a huge impact. We can 527

491 convergent attention. Since the softmax function is domi- now feed all of the words into the encoder block at the same 528

492 nated by the elements with the higher numerical value, the time and obtain the set of encoded vectors for each word at 529

493 OATSA module chooses the elements with higher numeri- the same time. 530

494 cal values and discards the elements with lower numerical
495 values as shown in Fig. 5. In the first step, we perform the D. DECODER 531
496 dot product operation between query and key to derive the
The decoder is made up of n layers of transformer decoders. 532
497 attention matrix P. The elements of attention matrix P are
The embeddings of the decoded output sequence of charac- 533
498 represented by {p11 , p12 , . . . , pmn }. The elements having
ters are sent into the decoder. The decoder is made up of N 534
499 higher values in attention matrix P are assumed to be the
identical layers, each of which contains three sub-layers. The 535
500 most contributive element. To aggregate focus, we choose the
MHSA is used in the first layer of the network; the masked 536
501 most contributive elements from each row in the attention
mechanism in the first layer prevents the model from seeing 537
502 matrix P. In the next step, we divide the attention matrix P
future data. This masking approach ensures that the model 538
503 into ‘n’ chunks and calculate the mean value for each chunk.
only utilizes the previous words to generate the current word. 539
504 The elements lower than the threshold multiplicative factor
An MHSA layer without the masked technique makes up the 540
505 (mean(piw )∗t) are assigned to negative infinity (see Eq. (11)).
second layer. It applies a multi-head self-attention mechanism 541
506 Based on the hypotheses, the elements with negative infinity
over the first layer’s result. This layer serves as the foun- 542
507 do not contain any relevant information, the elements having
dation for correlating text and image information with the 543
508 higher numerical values contains close relevance informa-
self-attention layer. A position-wise fully connected FFN is 544
509 tion. Finally, negative infinity values are replaced by zero by
incorporated in the third layer. Following layer normalization, 545
510 applying the softmax function (see Eq. (12)) on the attention
the Transformer establishes a residual connection for all three 546
511 matrix (OPt ). The working procedure of the OATSA module
layers. To convert the Transformer’s result into probabilities 547
512 is given in the Table 2. The proposed Optimal Adaptive
for each character sequence in a sentence, we attach a Fully 548
513 Threshold-based Self-Attention (OATSA) module remark-
Connected (FC) layer and a softmax layer at the top. All the 549
514 ably eliminates noisy information.
characters in the phrase can be created simultaneously, unlike 550

515 3) FEED FORWARD NETWORK LAYERS (FFNL)


the LSTM. The top encoder’s output is then converted into a 551

set of K and V attention vectors. These K and V attention 552


516 To make the original transformer more robust in preserv-
vectors are utilized by each decoder in its ‘‘encoder-decoder 553
517 ing the features produced by the encoder’s multi-head self-
attention’’ layer, which assists the decoder in focusing on the 554
518 attention mechanism, we employed an improved version of
appropriate positions in the input sequence. 555
519 an FFN layer. The main objective of FFN is to convert the
520 attention vectors into a format that the following encoder or
521 decoder layer can understand. The modified FFN is made up 1) BI-DIRECTION EMBEDDING 556

522 of two layers of 1 × 1 convolution with a ReLU activation The conventional Seq2seq model’s decoder preserves output 557

523 function, followed by a residual connection. Attention vectors only in one direction, leaving the other direction uncaptured. 558

VOLUME 10, 2022 100901


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

TABLE 2. Optimal Adaptive Threshold-based Self-Attention (OATSA) algorithm.

FIGURE 5. Working procedure of proposed Adaptive Threshold-based attention mechanism to obtain most participative elements after assigning
higher probabilities to it. a) Attention matrix pij is obtained by performing the dot product between the key (K) and the query (Q). b) Attention
matrix pij split into ‘n’ (3) chunks. c) The Mean value is computed for each chunk, set the element value to −∞ if the mean value is lesser than the
threshold value (0.7). d) Softmax function is applied on the pt to replace the −∞ to 0. The final matrix contains the most contributive elements.

559 For instance, in certain fonts, a decoder that recognizes char- and the decoder has no memory of the previous deciphered 563

560 acter sequence from L2R may have trouble choosing the characters. Such challenging characters can be recognized 564

561 initial letter between upper-case ‘I’ and lower-case ‘l’. These using an R2L decoder easily since the succeeding char- 565

562 initial characters are difficult to differentiate perceptibly acters suggest the initial character based on the language 566

100902 VOLUME 10, 2022


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

567 preceding. Decoders that function in opposite directions are as follows: The results from the experiment show that our 620

568 could be beneficial. proposed framework comprehensively outperforms current 621

569 We propose an architectural level bi-directional decoder SOTA methods. 622

570 (see Fig. 6), which comprises a decoder with opposing direc-
571 tions, to make use of the dependencies in both ways. The A. DATASETS 623

572 decoder is primarily designed to predict texts from both In this paper, we train our framework with only two synthetic 624

573 directions (L2R and R2F). After running the decoder, two datasets: Synth90k [45] and SynthText [46]. We evaluated the 625

574 recognition results are generated. During inference, to aggre- significance and robustness of our proposed STR framework 626

575 gate the outcomes, we merely choose the symbol with the on seven standard benchmark datasets, four regular scene text 627

576 highest log-softmax recognition score, which is the total datasets and three irregular scene text datasets. 628

577 of all predicted symbols’ recognition scores. In addition to Synth90k is the synthetic text dataset proposed by 629

578 positional embedding and token embedding, we introduce Gupta et al. [45]. A total of 9 million word pictures were 630

579 direction embedding during decoding to add more contextual acquired from a collection of 9k frequent English language. 631

580 information. The framework is instructed to decipher the text The entire dataset images were only used for training. Each 632

581 string from L2R or R2L using this direction embedding. The image in Synth90k has a word-level ground-truth annotation. 633

582 same decoder architecture and constraint can be utilized on These images were generated with the help of a synthetic text 634

583 the output sequence processing order by adding the direction engine and are quite realistic. 635

584 embedding. Similar to position embedding, this direction Synthtext is another synthetic text dataset that was only 636

585 embedding also gives additional context information to the used for training, which is proposed by Jaderberg et al. [46]. 637

586 framework. The process of image generation was similar to that of [45]. 638

587 The directional embedding enables the network to decipher Originally, the Synthext dataset was created for text detection, 639

588 the text information not only from the L2R direction but unlike [45]. Characters are rendered as a full-size images. 640

589 also from the R2L direction. If the decoding is performed IIIT5K-Words [38] (IIIT5K) there are 3k cropped word 641

590 in one direction (L2R) then the character loss for L2R deci- test images in the IIIT5K dataset collected from the inter- 642

591 phered images cannot be decreased. We considered the output net. Each image has a vocabulary of 50 short words and 643

592 sequence decoding direction as two subtasks. Each charac- 1,000 long words. A few words were created randomly and 644

593 ter sequence decoding (L2R direction and R2R direction) the rest were created from the dictionary. 645

594 are sub-tasks of the conditional output sequence algorithm. Street View Text [39] (SVT) dataset comprises of 646

595 To channel the result in a correct decoding direction, we cre- 647 cropped text pictures acquired from Google Street View 647

596 ate two separate 512-d vectors at the beginning of training. (GSV). Each image contains a 50-word lexicon. Most of 648

597 Each scene text image in the set is decoded two times during the images in the SVT dataset are severely distorted, noisy, 649

598 each training iteration step, first from the L2R direction and blurred and low resolution. 650

599 next from the R2L direction. The reserved ground truth of ICDAR 2003 [40] (IC03) consists of 251 scene text 651

600 the original description is the ground truth description for the images with text-labelled bounding boxes. For a fair compar- 652

601 R2L deciphered character sequence. The decoder achieves ison, we excluded word images containing non-alphanumeric 653

602 strong performance by combining the outputs of the two characters or images with less than 3 characters as sug- 654

603 directions, which also helps the classifier to predict the right gested by Wang et al. [20]. The updated dataset comprises 655

604 character. The total loss is the sum of the losses suffered by of 867 cropped word pictures. Images in the IC03 dataset 656

605 both the L2R and R2L directions given in Eq. (13). include both 50-word lexicon and ‘‘full-lexicon’’. 657

ICDAR 2013 [41] Most of the (IC13) dataset image 658


1X
606 Losstotal = − (logPl2r (yk | I) + logPr2l (yk | I)) (13) samples are inherited from its successor IC03. For a fair 659
2 k
comparison, words with non-alphanumeric characters were 660

607 where yk, Pl2r, Pr2l and I represent the ground truth of the removed from the dataset. The filtered test dataset contains 661

608 kth character, predicted result on left to the right direction, 1015 cropped word images with no lexicon associated with 662

609 predicted result on R2L direction and the input image, respec- them. 663

610 tively. We also include a supervisory branch that projects each ICDAR 2015 [42] (IC15) dataset consists of 6545 cropped 664

611 visual component from the ‘c’ dimension to the number of text images, 4468 images used for training and 2077 images 665

612 alphabet classes in order to predict the character it belongs used for testing. No lexicon is associated with it. Most of the 666

613 to, as well as calculate the cross-entropy loss between ground text in the word images in this dataset are irregular shapes 667

614 truth and prediction. such as horizontal, oriented and curved. IC15 dataset images 668

are captured by Google Glasses without proper positioning 669

615 IV. EXPERIMENT and focusing. 670

616 In this section, we exhaustively evaluate the performance SVT-Perspective [43] (SVT-P) dataset consists of 671

617 of our model. Numerous experiments were performed on 645 cropped word pictures. Images in the SVT-P dataset 672

618 challenging STR benchmark datasets, including four regular include both a 50-word lexicon and a ‘‘full-lexicon’’. 673

619 datasets and three irregular datasets. The datasets descriptions SVT-P dataset images are collected from GSV, most of the 674

VOLUME 10, 2022 100903


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

675 images captured at a side-view angle. Therefore, the images TABLE 3. The performance comparison of different modified CNN
backbones. Modified ResNet50 captures well-defined feature
676 in the SVT dataset are heavily distorted, noisy, blurred and representations and it provides an excellent balance between
677 low resolution. model size and accuracy.
678 CUTE80 [44] (CUTE) dataset contains a collection of
679 80 high-resolution images taken in naturalistic environments.
680 It contains 288 cropped word pictures for testing. CUTE is
681 the most challenging dataset since most of the word images
682 consist of arbitrarily shaped letters. No lexicon is associated
683 with this dataset. It was collected with the intent of evaluating
684 the performance of irregular STR.

685 B. IMPLEMENTATION DETAILS


686 The specifications of ResNet-based CNN network architec-
687 ture for robust text feature extraction are shown in Table 1.
688 We implemented our method using the Pytorch framework.
689 A single NVIDIA GTX-1080Ti GPU with 12 GB memory
690 was used to conduct all the experiments. We constructed
TABLE 4. The performance comparison of adding or removing decoder
691 our model completely from scratch using synthetic images blocks and attention head counts from the proposed framework. It can be
692 of SynthText and Synth90k. The training data consists of seen that increasing the number of heads decreases the performance
693 14 synthetic images, 6 million from Gupta et al. [45] and marginally due to overfitting (with N=4, H=16).

694 8 million from Jaderberg et al. [23] and. There is no additional


695 data utilized. In our tests, we don’t employ any geometric-
696 level or pixel-level labelling. The model is trained using
697 just synthetic text, with no fine-tuning for each dataset. The
698 AdaDelta optimization strategy is used to train our model
699 with a batch size of 64. The learning rate is initially set to 1.0,
700 as suggested by Shi et al. [4], learning rate decays to 0.1 at
701 step 0.6 M and 0.01 at step 0.8 M. The optimizer AdaDelta’s
702 learning rate is adjustable, we discover that slower learning
703 rates lead to greater convergence. The proposed model can
704 classify 37 classes, including 26 alphabets (a-z), 10 digits
705 (0-9) and a special symbol representing ‘‘<EOS>’’.

706 C. ABLATION STUDIES


707 For the image encoding and the feature extraction, we exper-
708 imented with various CNN models (see Table 3) such as depiction of relationships between distant image patches, 729

709 VGG16, ResNet18, ResNet34, ResNet50 and ResNet164. initially, we built a self-attention layer on top of the convo- 730

710 In that, ResNet50 provides an excellent balance between lutional layers. On the other hand, we drop the self-attention 731

711 model size and accuracy and it captures well-defined feature layer from the decoder to analyze the influence of the 732

712 representations. Hence, we choose ResNet50 as our CNN’s self-attention layer on the decoder side. The recognition accu- 733

713 backbone. racy of the generalized model is marginally lower than the 734

714 After numerous experiments, the number of attention standard system (91.6% v.s. 97.7% on the IIIT5K dataset and 735

715 heads in the encoder and decoder is kept at 16. Similarly, 83.3% v.s. 90.6% on the SVT-P dataset) as shown in Table 5, 736

716 the number of decoder blocks is dynamically changed during but it is still comparable to earlier approaches. 737

717 each iteration and finally set to 3. The results suggest that In contrast to language translation approaches, we iden- 738

718 N = 3 yields the best outcomes for our model as shown tified that applying the self-attention mechanism in STR 739

719 in Table 4. This effect contradicts the Transformer’s exper- has a significant impact on performance. We believe there 740

720 imental results, which suggest that utilizing additional blocks are three alternative reasons: Firstly, the length of character 741

721 improves language translation and irregular text recognition sequences represented in standard STR tasks is often less 742

722 performance. Increasing the number of heads leads to an than those required for machine translation. Secondly, the 743

723 overfitting problem. CNN-based encoder effectively represents the long-range 744

724 Self-attention is critical in many seq2seq activities such relationships between the words. For example, the receptive 745

725 as chatbots, language translation and so on because it is field produced by ResNet50’s final feature layer has a great 746

726 capable of capturing long-term relationships. We explored influence on long-term dependencies. Finally, self-attention 747

727 the effect of improved self-attention block in our proposed is often employed in machine translation to represent the 748

728 framework for regular and irregular STR. To improve the relationships between words in a phrase or even a paragraph. 749

100904 VOLUME 10, 2022


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

TABLE 5. The encoder and decoder performance comparison with and without self-attention block. When comparing row 1 and row 2, we find that
dropping the self-attention block from the decoder side of our framework produces a significant performance drop. Rows 2 and 3 illustrate that the
self-attention block to the encoder side has shown slight improvement.

TABLE 6. Experiment with various decoders. ‘‘Normal’’ denotes an L2R direction, ‘‘Reversed’’ signifies an R2L direction and ‘‘Bidirectional’’ denotes a
combination of them.

TABLE 7. The performance of the proposed method with and without the image transformation technique.

750 Between words that are far apart, there are nevertheless Reversed decoders is minimal, when they are combined, they 770

751 wealthy syntactic and semantic relationships. In contrast, provide a significant performance improvement. We carried 771

752 each input image in STR generally comprises a single word, out experiments to see how text rectification impacted our 772

753 a self-attention module is mainly employed to represent the framework. As a text rectification approach, we employed the 773

754 character relationships of an input text. The ties that bind image normalization technique proposed in the STR method. 774

755 the relationship between the letters in a word are usually Without an image normalization block, the proposed Trans- 775

756 weaker than those that bind the relationship between the former based 2D-attention mechanism can locate a single 776

757 words of a sentence. This might clarify why self-attention character scattered in 2D space. In this context, the image 777

758 does not assist to enhance irregular text recognition per- normalization block has minimal effect on our framework 778

759 formance. We analyze the recognition accuracies of several (see Table 7). We provide a novel technique called Opti- 779

760 decoders to assess the efficacy of the bidirectional decoder: mal Adaptive Threshold-based Self-Attention (OATSA) that 780

761 The normal decoder only understands text that is read from effectively explicit sparse Transformer. The performance and 781

762 the L2R direction. Reversed recognizes text only in the R2L importance of the OATSA technique are shown in Table 8. 782

763 direction; the Bidirectional decoder works in both L2R and The OATSA technique preserves the long-term dependencies, 783

764 R2L directions and chooses the one with the greatest recogni- which are defined by the distribution of neighbour nodes. 784

765 tion accuracy. In many cases, Normal and Reversed decoders It can focus the attention of the standard Transformer on the 785

766 produce equivalent accuracies as shown in Table 6. most contributive components. 786

767 Normal surpasses reversed on SVT, IC15 and CUTE while Optimal Adaptive Threshold is integrated into self- 787

768 reversed excels IIIT5K, IC03 and SVT-P. In the worst case, attention and performs as an attention mechanism in the 788

769 the difference in recognition accuracy between Normal and decoder, allowing the model to produce more accurate words. 789

VOLUME 10, 2022 100905


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

TABLE 8. Performance comparison among four variations. ResNet50 is used as the proposed model’s feature extraction module. Rectification and no
rectification indicate the image normalization step was performed and the image normalization step not performed. 2D positional encoding represents
the model that performs recognition tasks; it keeps tracking the character position in each iteration. The OATSA represent the Optimal Adaptive
Threshold-based Self-Attention Algorithm. Different decoders are ‘‘Normal’’ denotes an L2R direction, ‘‘Reversed’’ signifies an R2L direction and
‘‘Bidirectional’’ denotes a combination of them.

FIGURE 6. The attention heat maps provide a visual representation of the 2D attention weights obtained from all of the decoding stages on a
standard benchmark.

790 After extensive comparative experiments, the optimal value the current SOTA approach of Yang et al. [36] and Lu et al. 806

791 of w and t is 4 and 0.67, respectively. Fig. 7 shows the visual [48] on standard datasets such as SVT, IC13, SVT-P, CUTE 807

792 representation of the 2D attention weights obtained from all and IC15. The proposed method produces better recognition 808

793 of the decoding stages on a standard benchmark. accuracy on regular text datasets such as IC03, IC13 and 809

IC15. We do not compare the results of Liao et al. [49] 810

794 D. COMPARISONS WITH EXISTING METHODS since they employed an extra word image with character-level 811

795 In this section, we compare the effectiveness and robustness annotations for model training. Note that Litman et al. [26] 812

796 of the proposed method by setting the number of decoder and Lu et al. [48] use an additional image dataset SynthAdd 813

797 blocks (N) to 3, the number of heads (H) to 16 and d = [47] for training and produced a better recognition accuracy 814

798 1024 in comparison with the current SOTA methods on a vari- of results of 86.9% and 84.5% on SVT-P and 87.5% on 815

799 ety of regular and irregular text benchmarks datasets. To be CUTE80. Zhang et al. [31] use the Wiki dataset for addi- 816

800 fair, we merely list all the performance results in lexicon- tional training to achieve the top performance on CUTE80. 817

801 free mode. Most of these existing approaches are trained on Still, the proposed method outperforms Litman et al. [26], 818

802 the same datasets. The proposed Transformer based scene Lu et al. [48] and Zhang et al. [31] on IC03, IC13, and 819

803 recognition model was compared with 21 existing methods on IC15. The recognition accuracy of the proposed method on 820

804 both regular and irregular datasets and the results are shown in all datasets substantially surpasses that of linguistic-based 821

805 Table 9. The proposed framework comfortably outperforms approaches, notably on irregular texts (leading by +3.4% 822

100906 VOLUME 10, 2022


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

TABLE 9. On several benchmarks, the overall performance of our STR model is compared with that of the previous state of art approaches. All values are
expressed as a percentage (%). The outcomes are all in the no lexicon represented by ‘‘None’’, ‘‘90K’’, ‘‘ST’’, ‘‘SA’’ and ‘‘Wiki’’ stand for Synth90K, SynthText,
SynthAdd and Wikitext-103, respectively; ‘‘word’’ and ‘‘char’’ denote the use of word-level or character-level annotations; and ‘‘self’’ denotes the use of a
self-designed convolution network or self-made synthetic datasets.

823 on IC15 and +2.8% on CUTE datasets). Our method out- on IC15, +9.7% on SVT-P and +5.9% on CUTE. The 826

824 performs the prior SOTA approach by Yang et al. [36] by significant improvement validates the effectiveness of our 827

825 a margin of +7.7% on SVT, +4.8% on IC13, +14.2% method. We are only 0.6% behind Zhang et al. [31] on CUTE 828

VOLUME 10, 2022 100907


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

FIGURE 7. Illustration of success and failure cases of the proposed method. ‘‘GT’’ stands for ‘‘Ground Truth,’’ ‘‘Pred’’ stands for ‘‘Prediction’’. Blurry,
low resolution (LR) and illumination are some of the reasons for failure.

829 (91.3% v.s. 91.9%). In addition, it is highlighted that the into four modules: image transformation, feature extraction, 855

830 IIIT5k dataset has plenty of images with background noise. encoder and decoder. First, the transformation module uti- 856

831 Still, the proposed approach achieves the highest recogni- lizes a Thin Plate Spline (TPS) transformation to normalize 857

832 tion accuracy of 97.7%. Our model employs only word- the irregular or arbitrary word image into a more readable 858

833 level annotation, it outperforms the character-level model word image that greatly helps to reduce the complexity of 859

834 Liao et al. [49] on IIIT5K (97.7% v.s. 91.9%). Many samples extracting text features. Second, the Visual Feature Extrac- 860

835 in the IC15 dataset were not horizontally positioned, which is tion (VFE) module uses ResNet as the CNN backbone to 861

836 beyond the scope of the present study. As a result, we normal- extract well-defined feature representations and expand the 862

837 ized the sample based on the image ratio. Most of the image standard transformer’s 1D Positional Encoding (1DPE) to 863

838 samples in the SVT-perspective dataset were perspective and 2D Positional Encoding (2DPE) information a 2D Positional 864

839 low-resolution. Although the proposed method performs as Encoding (PE) to capture the order of sequential informa- 865

840 same as the model presented by Liao et al. [12], it did tion from the 2D rectified word image. Third, Multi-Head 866

841 obtain SOTA performance in the majority of SVT-perspective Self-Attention (MHSA) and Feed-Forward Network Layers 867

842 scenarios. As shown in Fig. 8, our method outperforms (FFNL) in the encoder module perform feature aggregation 868

843 Yang et al. [36] in the STR problem, the proposed framework and feature transformation concurrently. Finally, we proposed 869

844 is capable of recognizing blurred word images and irregular a new Optimal Adaptive Threshold-based Self-Attention 870

845 text. Our model learns both self-attention and input-output (OATSA) model and an architectural level bi-directional 871

846 attention, where the encoder and decoder both preserve decoding approach comprised in the decoder module greatly 872

847 feature-feature and target-target interactions. This makes the supports the framework to generate a more accurate char- 873

848 intermediate representations more resistant to spatial distor- acter sequence. The OATSA model replaces the standard 874

849 tion. Furthermore, our approach considerably reduces the Scaled Dot-Product Attention. Moreover, it can be used in 875

850 issue of attention drifting. both encoder and decoder modules to filter noisy information 876

effectively and choose the most contributive components to 877

851 V. CONCLUSION focus on image text regions. The proposed framework is 878

852 In this paper, we presented a new simple yet powerful frame- trained with world-level annotations; it can handle words 879

853 work for both regular and irregular STR based on the trans- of any length in the lexicon-free mode. Comprehensive 880

854 former framework. The proposed framework breaks down experimental results on challenging standard benchmarks 881

100908 VOLUME 10, 2022


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

882 including IIIT5K-Words, Street View Text, CUTE80 and [22] C. Yao, X. Bai, B. Shi, and W. Liu, ‘‘Strokelets: A learned multi-scale 951

883 ICDAR datasets show that our methodology outperforms representation for scene text recognition,’’ in Proc. IEEE Conf. Comput. 952
Vis. Pattern Recognit., Jun. 2014, pp. 4042–4049. 953
884 SOTA approaches. [23] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Synthetic data 954
and artificial neural networks for natural scene text recognition,’’ in Proc. 955
Adv. Neural Inf. Process. Syst. Workshop, 2014, pp. 1–10. 956
885 REFERENCES
[24] Q. Lin, C. Luo, L. Jin, and S. Lai, ‘‘STAN: A sequential transformation 957
886 [1] Y. Zhu, C. Yao, and X. Bai, ‘‘Scene text detection and recognition: attention-based network for scene text recognition,’’ Pattern Recognit., 958
887 Recent advances and future trends,’’ Frontiers Comput. Sci., vol. 10, no. 1, vol. 111, pp. 1–9, Mar. 2021. 959
888 pp. 19–36, 2016. [25] F. Zhan and S. Lu, ‘‘ESIR: End-to-end scene text recognition via itera- 960
889 [2] Q. Ye and D. Doermann, ‘‘Text detection and recognition in imagery: tive image rectification,’’ in Proc. IEEE/CVF Conf. Comput. Vis. Pattern 961
890 A survey,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 7, Recognit. (CVPR), Jun. 2019, pp. 2054–2063. 962
891 pp. 1480–1500, Jul. 2015. [26] R. Litman, O. Anschel, S. Tsiper, R. Litman, S. Mazor, and R. Manmatha, 963
892 [3] S. Long, X. He, and C. Yao, ‘‘Scene text detection and recognition: The ‘‘SCATTER: Selective context attentional scene text recognizer,’’ in Proc. 964
893 deep learning era,’’ Int. J. Comput. Vis., vol. 129, no. 1, pp. 1–26, 2018. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, 965
894 [4] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, ‘‘ASTER: An pp. 11962–11972. 966
895 attentional scene text recognizer with flexible rectification,’’ IEEE Trans. [27] W. Liu, C. Chen, and K.-Y. K. Wong, ‘‘Char-Net: A character-aware neural 967
896 Pattern Anal. Mach. Intell., vol. 41, no. 9, pp. 2035–2048, Jun. 2019. network for distorted scene text recognition,’’ in Proc. Assoc. Advancement 968
897 [5] B. Shi, X. Wang, P. Lyu, C. Yao, and X. Bai, ‘‘Robust scene text recognition Artif. Intell., 2018, pp. 7154–7161. 969
898 with automatic rectification,’’ in Proc. IEEE Conf. Comput. Vis. Pattern [28] Z. Cheng, Y. Xu, F. Bai, Y. Niu, S. Pu, and S. Zhou, ‘‘AON: Towards 970
899 Recognit. (CVPR), Jun. 2016, pp. 4168–4176. arbitrarily-oriented text recognition,’’ in Proc. IEEE/CVF Conf. Comput. 971
900 [6] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou, ‘‘Edit probability for scene text Vis. Pattern Recognit., Jun. 2018, pp. 5571–5579. 972
901 recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), [29] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou, ‘‘Focusing attention: 973
902 Jun. 2018, pp. 1508–1516. Towards accurate text recognition in natural images,’’ in Proc. IEEE Int. 974
903 [7] B. Shi, X. Bai, and C. Yao, ‘‘An end-to-end trainable neural network Conf. Comput. Vis. (ICCV). Venice, Italy, Oct. 2017, pp. 5086–5094. 975
904 for image-based sequence recognition and its application to scene text [30] C. Wang and C.-L. Liu, ‘‘Multi-branch guided attention network for irreg- 976
905 recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11, ular text recognition,’’ Neurocomputing, vol. 425, pp. 278–289, Feb. 2021. 977
906 pp. 2298–2304, Nov. 2016. [31] Y. Zhang, Z. Fu, F. Huang, and Y. Liu, ‘‘PMMN: Pre-trained multi-modal 978
907 [8] Y. Gao, Y. Chen, J. Wang, M. Tang, and H. Lu, ‘‘Reading scene text network for scene text recognition,’’ Pattern Recognit. Lett., vol. 151, 979
908 with fully convolutional sequence modeling,’’ Neurocomputing, vol. 339, pp. 103–111, Nov. 2021. 980
909 pp. 161–170, Apr. 2019. [32] L. Dong, S. Xu, and B. Xu, ‘‘Speech-transformer: A no-recurrence 981
910 [9] P. Selvam and J. A. S. Koilraj, ‘‘A deep learning framework for grocery sequence-to-sequence model for speech recognition,’’ in Proc. IEEE 982
911 product detection and recognition,’’ Food Anal. Methods, to be published, Int. Conf. Acoust., Speech Signal Process. (ICASSP), Apr. 2018, 983
912 doi: 10.1007/s12161-022-02384-2. pp. 5884–5888. 984
913 [10] Y. Baek, S. Shin, J. Baek, S. Park, J. Lee, D. Nam, and H. Lee, ‘‘Character [33] A. W. Yu, D. Dohan, T. Luong, R. Zhao, K. Chen, and Q. Le, ‘‘QANet: 985
914 region attention for text spotting,’’ in Proc. Eur. Conf. Comput. Vis. Cham, Combining local convolution with global self attention for reading com- 986
915 Switzerland: Springer, 2020, pp. 504–521. prehension,’’ in Proc. Int. Conf. Learn. Represent., 2018, pp. 1–16. 987
916 [11] S. Prabu and K. J. A. Sundar, ‘‘Enhanced attention-based encoder–decoder [34] M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and Ł. Kaiser, ‘‘Univer- 988
917 framework for text recognition,’’ Intell. Automat. Soft Comput., vol. 35, sal transformers,’’ in Proc. Int. Conf. Learn. Represent., 2019, pp. 1–23. 989
918 no. 2, pp. 2071–2086, 2023. [35] Y. Chen, H. Shu, W. Xu, Z. Yang, Z. Hong, and M. Dong, ‘‘Trans- 990
919 [12] M. Liao, P. Lyu, M. He, C. Yao, W. Wu, and X. Bai, ‘‘Mask TextSpot- former text recognition with deep learning algorithm,’’ Comput. Commun., 991
920 ter: An end-to-end trainable neural network for spotting text with arbi- vol. 178, pp. 153–160, Oct. 2021. 992
921 trary shapes,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 43, no. 2, [36] L. Yang, P. Wang, H. Li, Z. Li, and Y. Zhang, ‘‘A holistic representation 993
922 pp. 532–548, Feb. 2021. guided attention network for scene text recognition,’’ Neurocomputing, 994
923 [13] M. Jaderberg, ‘‘Spatial transformer networks,’’ in Proc. Adv. Neural Inf. vol. 414, 2020, pp. 67–75. 995
924 Process. Syst. (NeurIPS), 2015, pp. 2017–2025. [37] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for image 996
925 [14] D. Yu, X. Li, C. Zhang, J. Han, J. Liu, and E. Ding, ‘‘Towards accu- recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. Las 997
926 rate scene text recognition with semantic reasoning networks,’’ in Proc. Vegas, NV, USA, Jun. 2016, pp. 770–778. 998
927 IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR). Seattle, WA, [38] A. Mishra, K. Alahari, and C. V. Jawahar, ‘‘Scene text recognition using 999
928 USA, Jun. 2020, pp. 12110–12119. higher order language priors,’’ in Proc. Brit. Mach. Vis. Conf. Surrey, U.K., 1000
929 [15] C. Luo, L. Jin, and Z. Sun, ‘‘MORAN: A multi-object rectified atten- 2012, pp. 127.1–127.11. 1001
930 tion network for scene text recognition,’’ Pattern Recognit., vol. 90, [39] K. Wang, B. Babenko, and S. Belongie, ‘‘End-to-end scene text recogni- 1002
931 pp. 109–118, Jun. 2019. tion,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011, pp. 1457–1464. 1003
932 [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [40] S. M. Lucas, ‘‘ICDAR 2003 robust reading competitions: Entries, results, 1004
933 L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ 2017, and future directions,’’ Int. J. Document Anal. Recognit. (IJDAR), vol. 7, 1005
934 arXiv:1706.03762. nos. 2–3, pp. 105–122, 2005. 1006
935 [17] C. Yao, X. Bai, and W. Liu, ‘‘A unified framework for multioriented text [41] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L. G. I. Bigorda, 1007
936 detection and recognition,’’ IEEE Trans. Image Process., vol. 23, no. 11, S. R. Mestre, J. Mas, D. F. Mota, J. A. Almazan, and L. P. D. L. Heras, 1008
937 pp. 4737–4749, Nov. 2014. ‘‘ICDAR 2013 robust reading competition,’’ in Proc. Int. Conf. Document 1009
938 [18] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, ‘‘Object segmentation based Anal. Recognit. (ICDAR). Washington, DC, USA, 2013, pp. 1484–1493. 1010
939 on the integration of adaptive K-means and GrabCut algorithm,’’ in Proc. [42] D. Karatzas, ‘‘ICDAR 2015 competition on robust reading,’’ in Proc. Int. 1011
940 Int. Conf. Wireless Commun. Signal Process. Netw. (WiSPNET), Mar. 2022, Conf. Document Anal. Recognit. (ICDAR). Tunis, Tunisia, Aug. 2015, 1012
941 pp. 213–216, doi: 10.1109/WiSPNET54241.2022.9767099. pp. 1156–1160. 1013
942 [19] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, ‘‘End-to-end text recognition [43] T. Q. Phan, P. Shivakumara, S. Tian, and C. L. Tan, ‘‘Recognizing text with 1014
943 with convolutional neural networks,’’ in Proc. Int. Conf. Pattern Recognit. perspective distortion in natural scenes,’’ in Proc. IEEE Int. Conf. Comput. 1015
944 (ICPR), Nov. 2012, pp. 3304–3308. Vis. Sydney, NSW, Australia, Dec. 2013, pp. 569–576. 1016
945 [20] K. Wang, B. Babenko, and S. Belongie, ‘‘End-to-end scene text recog- [44] A. Risnumawan, P. Shivakumara, C. S. Chan, and C. L. Tan, ‘‘A robust 1017
946 nition,’’ in Proc. IEEE Int. Conf. Comput. Vis. (ICCV), Nov. 2011, arbitrary text detection system for natural scene images,’’ Expert Syst. 1018
947 pp. 1457–1464. Appl., vol. 41, no. 18, pp. 8027–8048, 2014. 1019
948 [21] J.-H. Seok and J. H. Kim, ‘‘Scene text recognition using a Hough forest [45] A. Gupta, A. Vedaldi, and A. Zisserman, ‘‘Synthetic data for text localisa- 1020
949 implicit shape model and semi-Markov conditional random fields,’’ Pat- tion in natural images,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. 1021
950 tern Recognit., vol. 48, no. 11, pp. 3584–3599, 2015. (CVPR), Jun. 2016, pp. 2315–2324. 1022

VOLUME 10, 2022 100909


P. Selvam et al.: Transformer-Based Framework for Scene Text Recognition

1023 [46] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Synthetic data ABOLFAZL MEHBODNIYA (Senior Member, 1088
1024 and artificial neural networks for natural scene text recognition,’’ in Proc. IEEE) received the Ph.D. degree from the 1089
1025 Adv. Neural Inf. Process. Syst. Workshop, 2014, pp. 1–10. INRS-EMT University of Quebec, Montreal, 1090
1026 [47] H. Li, P. Wang, C. Shen, and G. Zhang, ‘‘Show, attend and read: A simple Canada, in 2010. He is currently an Asso- 1091
1027 and strong baseline for irregular text recognition,’’ in Proc. AAAI Conf. ciate Professor and the Head of the Department 1092
1028 Artif. Intell., 2019, pp. 8610–8617. of Electronics and Communication Engineering 1093
1029 [48] N. Lu, W. Yu, X. Qi, Y. Chen, P. Gong, R. Xiao, and X. Bai, ‘‘MASTER: (ECE), Kuwait College of Science and Technol- 1094
1030 Multi-aspect non-local network for scene text recognition,’’ Pattern Recog-
ogy. Before coming to KCST, he worked as a 1095
1031 nit., vol. 117, pp. 1–10, Sep. 2021.
Marie-Curie Senior Research Fellow at University 1096
1032 [49] M. Liao, J. Zhang, Z. Wan, F. Xie, J. Liang, P. Lyu, C. Yao, and X. Bai,
1033 ‘‘Scene text recognition from two-dimensional perspective,’’ in Proc. AAAI College Dublin, Ireland. Prior to that, he worked 1097

1034 Conf. Artif. Intell., 2019, pp. 8714–8721. as an Assistant Professor at Tohoku University, Japan, and as a Research 1098

1035 [50] Y. Huang, Z. Sun, L. Jin, and C. Luo, ‘‘EPAN: Effective parts attention net- Scientist at the Advanced Telecommunication Research (ATR) International, 1099

1036 work for scene text recognition,’’ Neurocomputing, vol. 376, pp. 202–213, Kyoto, Japan. His research interests include communications engineering, 1100

1037 Feb. 2020. the IoT and artificial intelligence in wireless networks, and real-world appli- 1101

1038 [51] Y. Wu, J. Fan, R. Tao, J. Wang, H. Qin, A. Liu, and X. Liu, ‘‘Sequential cations. He received numerous awards, including the JSPS Young Faculty 1102
1039 alignment attention model for scene text recognition,’’ J. Vis. Commun. Startup Grant, the KDDI Foundation Grant, the Japan Radio Communica- 1103
1040 Image Represent., vol. 80, pp. 1–8, Oct. 2021. tions Society (RCS) Active Researcher Award, the European Commission 1104
Marie Skodowska-Curie Fellowship, and the NSERC Visiting Fellowships 1105
in Canadian Government Laboratories. He is a Senior Member of IEICE. 1106
1041 PRABU SELVAM received the B.E. degree in
1042 computer science and engineering from the Shirdi
1043 Sai Engineering College, in 2011, and the M.E.
1044 degree in computer science and engineering from
1045 the Sathyabama Institute of Science and Tech-
1046 nology, in 2013. He is currently working as a
1047 Research Scholar with the School of Computing,
1048 SASTRA Deemed University, Thanjavur. His cur- JULIAN L. WEBBER (Senior Member, IEEE) 1107

1049 rent research interests include pattern recognition received the Ph.D. degree from Bristol Univer- 1108

1050 and computer vision. He carried out the coding and sity, in 2004. Following postdoctoral research on 1109

1051 the implementation of the ideas. wireless communications at Hokkaido Univer- 1110
sity, in 2007. In 2012, he joined the Advanced 1111
Telecommunications Research Institute Interna- 1112
1052 JOSEPH ABRAHAM SUNDAR KOILRAJ tional, Kyoto. Since 2018, he has been a Visiting 1113
1053 received the Ph.D. degree from SASTRA Deemed Researcher and a Research Assistant Professor at 1114
1054 University, Thanjavur, in 2017. He is currently Osaka University. He is currently an Associate 1115
1055 working as an Assistant Professor with the School Professor and the Head of the Department of Elec- 1116
1056 of Computing, SASTRA Deemed University. tronics and Communication Engineering (ECE), Kuwait College of Science 1117
1057 He has published and presented more than 20 tech- and Technology. His research interests include communications engineering, 1118
1058 nical papers in international/national journals machine learning, signal and image processing, and emphasizing real-time 1119
1059 and conferences. His current research interests implementation. He is a member of IEICE. 1120
1060 include image processing and pattern recognition.
1061 In the current study, he analyzed, interpreted, and
1062 evaluated experimental results.

1063 CARLOS ANDRÉS TAVERA ROMERO (Mem-


1064 ber, IEEE) received the degree in system engi-
1065 neering and the Ph.D. degree in computer science SUDHAKAR SENGAN (Member, IEEE) received 1121
1066 engineering from the Universidad del Valle, Cali. the M.E. degree from the Faculty of Computer Sci- 1122
1067 Since 1998, he has been a Teacher and a Project ence and Engineering, Anna University, Chennai, 1123
1068 Tutor for undergraduate students and a Tutor for Tamil Nadu, India, in 2007, and the Ph.D. degree 1124
1069 master’s and Ph.D. students’ projects. He is cur- in information and communication engineering, 1125
1070 rently a full-time Professor at the Universidad San- Anna University. He has 20 years of experience in 1126
1071 tiago de Cali, Cali, Colombia, and an Information teaching/research/industry. He is currently work- 1127
1072 Systems Developer with various registered prod- ing as a Professor and the Director of international 1128
1073 ucts. He is also the Leader of information systems development research line relations at the Department of Computer Science 1129
1074 at the COMBA R&D Laboratory, Universidad Santiago de Cali. and Engineering, PSN College of Engineering and 1130
Technology (Autonomous), Tirunelveli, Tamil Nadu, India. He guided more 1131

1075 MESHAL ALHARBI received the M.Sc. degree than 100 Projects for UG and PG students in engineering streams. He is the 1132

1076 in computer science from Wayne State Univer- Recognized Research Supervisor at Anna University, under Information and 1133

1077 sity, USA, in 2014, and the Ph.D. degree in Communication Engineering Faculty. He has published papers in 140 inter- 1134

1078 computer science from Durham University, U.K., national journals, 20 international conferences, and ten national conferences. 1135

1079 in 2020. He has ten years of experience in teach- He has published three textbooks for Anna University, Chennai Syllabus. 1136

1080 ing/research/industry. He is currently an Assistant He has filled 20 Indian and three international patents in various fields of 1137

1081 Professor of artificial intelligence with the Depart- interest. His research interests include security, MANET, the IoT, cloud com- 1138

1082 ment of Computer Science, Prince Sattam Bin puting, and machine learning. He is a member of various professional bodies 1139

1083 Abdulaziz University, Saudi Arabia. His research like MISTE, MIEEE, MIAENG, MIACSIT, MICST, MIE, and MIEDRC. 1140

1084 interests include artificial intelligence applications He received the Award of Honorary Doctorate (Doctor of Letters-D.Litt.) 1141

1085 and algorithms, agent-based modeling and simulation applications, disas- from International Economics University; SAARC Countries in Education 1142

1086 ter/emergency management and resilience, optimization applications, and and Students Empowerment, in April 2017. 1143

1087 machine learning. 1144

100910 VOLUME 10, 2022

You might also like