Scene Text Recognition Based On Improved CRNN
Scene Text Recognition Based On Improved CRNN
Article
Scene Text Recognition Based on Improved CRNN
Wenhua Yu 1,2 , Mayire Ibrayim 1,2, * and Askar Hamdulla 1,3
1 College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China;
yuwenhua@stu.xju.edu.cn (W.Y.); askar@xju.edu.cn (A.H.)
2 Xinjiang Key Laboratory of Signal Detection and Processing, Urumqi 830017, China
3 Xinjiang Key Laboratory of Multilingual Information Technology, Urumqi 830017, China
* Correspondence: mayire401@xju.edu.cn; Tel.: +86-133-1988-9043
Abstract: Text recognition is an important research topic in computer vision. Scene text, which
refers to the text in real scenes, sometimes needs to meet the requirement of attracting attention,
and there is the situation such as deformation. At the same time, the image acquisition process is
affected by factors such as occlusion, noise, and obstruction, making scene text recognition tasks more
challenging. In this paper, we improve the CRNN model for text recognition, which has relatively
low accuracy, poor performance in recognizing irregular text, and only considers obtaining text
sequence information from a single aspect, resulting in incomplete information acquisition. Firstly,
to address the problems of low text recognition accuracy and poor recognition of irregular text, we
add label smoothing to ensure the model’s generalization ability. Then, we introduce the smoothing
loss function from speech recognition into the field of text recognition, and add a language model to
increase information acquisition channels, ultimately achieving the goal of improving text recognition
accuracy. This method was experimentally verified on six public datasets and compared with other
advanced methods. The experimental results show that this method performs well in most benchmark
tests, and the improved model outperforms the original model in recognition performance.
Keywords: CRNN; text recognition; label smoothing; language model; deep learning
1. Introduction
Citation: Yu, W.; Ibrayim, M.; Text recognition is an important direction in the field of computer vision. With the
Hamdulla, A. Scene Text Recognition continuous development of deep learning fields such as computer vision, pattern recogni-
Based on Improved CRNN. tion, and machine learning, scene text recognition of deep learning has been developed
Information 2023, 14, 369. https:// on this basis. Text recognition can be divided into two branches according to recognition
doi.org/10.3390/info14070369 algorithms: segmentation-based recognition algorithms and recognition algorithms that do
not require segmentation. The segmentation-based natural scene text recognition algorithm
Academic Editor: Shmuel Tomi Klein
usually needs to locate the location of each character contained in the input text image,
Received: 27 April 2023 identify each character through a single character recognizer, and then combine all the
Revised: 17 June 2023 characters into a string sequence to obtain the final recognition result. Natural scene text
Accepted: 27 June 2023 recognition algorithms without segmentation aim to treat the entire text line as a whole
Published: 28 June 2023 and directly map the input text image to a sequence of target strings, thus avoiding the
disadvantages and performance limitations of single character segmentation, which is also
the current mainstream approach [1]. In the process of text recognition, a series of labels
are usually predicted, and the whole recognition process can be regarded as a sequence
Copyright: © 2023 by the authors.
recognition problem [2]. The CRNN [2] algorithm in sequence recognition without seg-
Licensee MDPI, Basel, Switzerland.
This article is an open access article
mentation is a neural network that integrates feature extraction, sequence modeling, and
distributed under the terms and
transcription. The feature map is first extracted using a convolutional neural network
conditions of the Creative Commons
(CNN), and then the feature dependencies are captured using a recurrent neural network
Attribution (CC BY) license (https:// (RNN), the features are predicted, and the output prediction distribution is fed to connec-
creativecommons.org/licenses/by/ tionist temporal classification (CTC) [3] for processing, and then the final text sequence is
4.0/). output. Because RNN models are an important branch of deep neural networks, they are
designed primarily to process sequences. To learn text sequences directly using RNN
models,
designed there are dependencies:
primarily the mapping
to process sequences. relationship
To learn between
text sequences input using
directly and output
RNN mod-se-
quences
els, thereneeds to be labeled inthe
are dependencies: advance,
mapping and the mapping
relationship relationship
between input and between
outputinput and
sequences
output
needssequences
to be labeled is a in
one-to-one
advance,correspondence.
and the mapping Because text and
relationship speechinput
between signalsand areoutput
con-
tinuous
sequences signals, there are segmentation
is a one-to-one correspondence. difficulties,
Because textandand
the speech
volumesignals
of textarerecognition
continuous
data is inthere
signals, the millions; it is also difficulties,
are segmentation costly, time-consuming,
and the volume and of impractical
text recognitionto achieve
data is seg-
in the
mentation
millions; andit is annotation. Therefore, it is notand
also costly, time-consuming, applicable to apply
impractical an RNN
to achieve directly to text
segmentation and
annotation.soTherefore,
recognition, the CRNN it is not applicable
network adds thetoCTCapply an RNNby
proposed directly to text et
Alex Graves recognition,
al. after theso
the CRNN
RNN network
[3] to make RNNadds the CTC
applicable proposed
to text by Alex
recognition. TheGraves et al. after
CTC algorithm theend-to-end
is an RNN [3] to
make RNN
training method applicable
for RNNs. to text recognition.
It extends The layer
the output CTC algorithm
of an RNNisby: anconverting
end-to-endthe training
data
method for of
dependency RNNs. It extendsand
“segmenting” the label
output layer ofrelationships
mapping an RNN by: converting
to extracting thefeatures
data depen-
ac-
dency of
cording to “segmenting” and label transforming
a sliding time window, mapping relationships to extracting
the input–output featuresfrom
relationship according
one-
to a sliding
to-one time window,
to many-to-one; addingtransforming the input–output
blank characters, and performing relationship from one-to-one
deduplication and blankto
character removal operations on consecutive identical characters in the sequence character
many-to-one; adding blank characters, and performing deduplication and blank output;
removalcomplexity
reducing operations and on consecutive identical
increasing speed characters
by drawing oninthetheforward
sequence and output;
backwardreducing
al-
complexity
gorithms andhidden
of the increasing
Markov speed by drawing
model (HMM) on the forward
to compute the and
loss backward
function; and algorithms
using
of the hidden
dynamic Markov to
programming model (HMM)
compute the to compute
training the avoiding
paths, loss function; and using
impractical dynamic
exhaustive
methods or violent enumeration. The decoding process of CTC maps the path generatedor
programming to compute the training paths, avoiding impractical exhaustive methods
byviolent
CTC into enumeration. The decoding
a final sequence. process
Combining of CTC
these mapsanthe
features, path generated
example by sequence
of the final CTC into a
final sequence. Combining these features, an example of the
mapped after deduplication and removal of a whitespace character using CTC is shown final sequence mapped after
deduplication
in Figure 1. and removal of a whitespace character using CTC is shown in Figure 1.
S T A T E
Output - - S S - T T - A A - T T - E E - -
RNN
Input
S T A T E
Conversion
Figure1.1.Conversion
Figure correspondencediagram.
correspondence diagram.
CRNN
CRNN models
models have
have lowlow text
text recognition
recognition accuracy
accuracy [4,5],
[4,5], poorpoor recognition
recognition of irregu-
of irregular
lar text, and incomplete information acquisition by acquiring text sequence information
text, and incomplete information acquisition by acquiring text sequence information from
from a single level. The commonly used improvement schemes are mostly focused on
a single level. The commonly used improvement schemes are mostly focused on two net-
two networks, CNN and RNN, to analyze the inadequate feature extraction of the net-
works, CNN and RNN, to analyze the inadequate feature extraction of the network, the
work, the presence of gradient explosion disappearance of the network [6], and the poor
presence of gradient explosion disappearance of the network [6], and the poor recognition
recognition of indefinitely long text sequences. Targeted replacements have been made
of indefinitely long text sequences. Targeted replacements have been made to improve
to improve feature extraction networks, replacing recurrent neural networks with long
feature extraction networks, replacing recurrent neural networks with long short-term
short-term memory networks [7] or adding residual modules. Although better performance
memory networks [7] or adding residual modules. Although better performance is
is achieved, there is the problem that the acquisition is not comprehensive and is limited to
achieved, there is the problem that the acquisition is not comprehensive and is limited to
the text domain only. Therefore, based on this, this paper takes CTC and the whole CRNN
the text domain only. Therefore, based on this, this paper takes CTC and the whole CRNN
as the entry point, adds a label smoothing strategy, introduces the smoothing loss function
as the entry point, adds a label smoothing strategy, introduces the smoothing loss function
in the field of speech recognition into the field of text recognition, and adds a language
in the field of speech recognition into the field of text recognition, and adds a language
model, taking into account the acquisition of information from various aspects to achieve
the improvement of recognition accuracy.
Information 2023, 14, 369 3 of 14
The main contributions of this paper are as follows. Firstly, for the low accuracy of
recognition results, data labeled as hard labels will introduce noise and loss of information,
leading to poor generalization of the model and recognition results being easily affected.
Secondly, adding label smoothing to obtain soft labels, which carry more information, are
more robust to noise, and improve the generalization ability of the model. Thirdly, after
combining CTC with label smoothing, the loss function after label smoothing is redefined.
Finally, the language model is connected after the CRNN model, and the CRNN prediction
results are input to the language model as the prior knowledge of the language model,
so that the complementary nature of visual information and language information can
be used to obtain text information from multiple levels to improve the accuracy of text
recognition further and achieve relatively high recognition accuracy on the six test sets. As
the improved model is divided into two main parts, namely the language model and the
CRNN with label smoothing, the latter can be replaced with other visual models. Therefore,
the two parts are relatively independent.
The rest of the paper is organized as follows. Section 2 summarizes the relevant
research in this field, with a focus on text recognition in the field of deep learning. In
Section 3, the CRNN recognition model, which combines label smoothing and the language
model, is introduced in detail. Experimental results and corresponding discussions are
provided in Section 4. The paper concludes with a summary in Section 5.
2. Related Work
As the field of deep learning continues to develop, the direction of scene text recog-
nition has also been developed, and many researchers have proposed many new and
relevant recognition algorithms. The CRNN [2] network model, proposed in 2015, is a
classical model in the field of text recognition, combining CNN, RNN, and CTC [3] to
perform text recognition from the perspective of text sequences and avoid the limitation of
accurate slicing. First, the input image is converted into a grayscale map, feature extraction
is performed using CNN, contextual information is learned using RNN, and finally the
network is optimized using CTC to solve the text alignment problem. In 2016, the RARE [8]
algorithm was proposed, combining spatial transformation networks and sequence recog-
nition networks for curved text correction recognition. In 2016, the R2AM [9] algorithm
was proposed, which for the first time introduced an attention mechanism into the field of
text recognition and implemented soft feature selection in the decoding process to utilize
image features better. The STAR-Net [10] network, proposed in 2016, uses spatial trans-
formation to remove text distortions and uses residual convolution blocks to construct
feature extractors, particularly effective in distortion-rich scenes of text. The GRCNN [11]
model was proposed in 2017. It introduces a gating strategy in the recurrent convolution
layer (RCL) to control the context information and balance the transmission of forward
and recursive information. By combining GRCNN with bidirectional LSTM, the entire
network can be trained end-to-end, thereby effectively recognizing text information in
images In 2018, the paper [12] proposed an optical character recognition system called
Rosetta. The system consists of two stages: Text detection stage, based on the Faster-RCNN
model, detects text regions in the image; Character recognition model based on fully con-
volutional networks processes the detected text regions and recognizes the text content.
Benchmark [13], presented in 2019, provides a disassembled analysis of the model for the
STR task, which helps researchers gain insight into the model and make improvements to
existing models. The semantically enhanced codec architecture for recognizing low-quality
scene text was proposed in SEED [14] in 2020. As transformers continue to evolve and
transformers as decoders become more common in STR tasks, recognition tasks are be-
ginning to focus on more than just recognition accuracy. ViTSTR [15], proposed in 2021,
uses a simple single-stage model architecture built on a computationally and parametri-
cally efficient visual transformer (ViT) to maximize accuracy, speed, and computational
efficiency. TRBA [16] makes full use of real data through data augmentation, collecting
unlabeled data and introducing semi-supervised and self-supervised improvements to the
Information 2023, 14, 369 4 of 14
model, moving in the direction of text recognition for scenes with fewer labels. Ref. [17]
proposed cascaded attention networks using three attention modules, from horizontal
continuity properties, contextual information, and two-bit visual distribution, addressing
the drift phenomenon in encoding and decoding architectures. Text is Text [18] uses a single
model to deal with scene text recognition (STR) and handwritten text recognition (HTR)
for handwritten text, introducing a knowledge distillation (KD)-based framework to deal
with the combination of STR and HTR, while proposing four distillation losses specifically
designed to deal with the unique features of the aforementioned text recognition. Proposed
in 2022, character-context decoupling [19] focuses on open-set text recognition tasks and
proposes a character-context decoupling framework to alleviate the problem of confound-
ing effects of contextual information over visual information of individual characters by
separating contextual and character-visual information, with good results on both open
and closed datasets. With the continuous development of text recognition algorithms,
the addition of attention mechanisms and various encoding and decoding networks has
achieved good results in terms of recognition accuracy. However, the overall network
models have become more complex and less easy to understand. Typically, these models
require high-performance experimental equipment. Therefore, compared to other models,
this paper proposes a text recognition approach based on the classic CRNN network model,
which has a clear and understandable structure. It achieves good recognition results while
requiring lower experimental equipment requirements.
3. Methods
From the overall network structure diagram in Figure 2, it can be seen that the CRNN
network with label smoothing (LS) added is used as the visual model, and the images with
text areas cropped out are fed into the visual model, and features are extracted from the
input images by the CNN in the visual model to obtain the feature maps. Next, the feature
sequence is fed into a two-layer BiLSTM network for prediction (BiLSTM is an improvement
over the bidirectional RNN network). The BiLSTM network learns the feature vectors in
the sequence and outputs a predicted label distribution. Using a modified CTC loss, the
series of label distributions obtained in the RNN are converted into a final label order as
the prediction result of the visual model, and the prediction result is fed into the language
model bidirectional cloze network (BCN). After a multi-headed attention mechanism and a
feed-forward network, it is then subjected to linear variation to obtain the language model
prediction results; the final output is the text STATE in the picture. Figure 3 is a flow chart
of the improved CRNN network recognition process. The serial numbers 1 and 2 are
the improved parts of the CRNN network, and the improvement points and the overall
recognition process can be clarified by the color change. The main network structure
used is a three-layer structure consisting of convolutional layers, recurrent layers, and
transcription layers, using CNN+RNN+CTC. The convolutional layers consist of 7 layers
of convolutional neural networks, and the basic structure uses the VGG structure. First, the
input image is converted to a grayscale image, and then the grayscale image is resized to a
size of W*32, with a fixed height. In the third and fourth pooling layers, a kernel size of
1*2 (rather than 2*2) is used to pursue the true aspect ratio, and a batchnorm (BN) layer is
introduced to speed up convergence. The feature sequence obtained from the convolutional
layers is predicted by the recurrent layers using an RNN (the RNN network can be a type
of recurrent neural network such as LSTM or GRU) to predict the label distribution of the
feature sequence, which represents the probability distribution of the true label of each
time step in the feature sequence. The feature maps extracted by the CNN are split by
column, and each column of 512-dimensional features is input into two layers of 256-unit
bidirectional LSTMs for classification. The label distribution obtained from the recurrent
layers is converted into the final recognition result by the transcription layer using CTC. The
CTC algorithm performs deduplication and other operations to obtain the final recognition
result, and label smoothing is added to this process. The recognition result is used as prior
Information 2023, 14, 369 5 of 14
Information 2023, 14, 369 5 of 14
CRNN(CNN+RNN+CTC)+label probablity
CRNN(CNN+RNN+CTC)+label probablity
smoothing
smoothing
Vision
Vision
prediction
Input image Vison Model prediction
Input image Vison Model
Predicted
output BCN Predicted
output BCN sequence
sequence
Figure 3. Overall
Overall networkflow
flow diagram.
Figure 3. Overall network
Figure 3. network flow diagram.
diagram.
As canbe
As be seenininFigure
Figure 1, in the text recognition task, the role of the CTC transcrip-
As cancan beseen seen in Figure1,1,ininthe thetext recognition
text recognition task, thethe
task, role of the
role CTC
of the CTCtranscription
transcrip-
tion layer
layer is to the
is to take takeoutput
the output
predicted predicted text sequence
text sequence from the fromRNNthe asRNN inputas and
input and trans-
transform it
tion layer is to take the output predicted text sequence from the RNN as input and trans-
forma itlabel
into intosequence.
a label sequence. Mathematically,
Mathematically, transcription
transcription is usedistousedfindtothe find
tagthe tag sequence
sequence with
form it into a label sequence. Mathematically, transcription is used to find the tag sequence
the
with biggest probability
the biggest based on
probability basedthe on
prediction [2]. The[2].
the prediction probabilities of the label
The probabilities of the sequences
label se-
with the biggest probability based on the prediction [2]. The probabilities of the label se-
use the conditional
quences probabilities
use the conditional from CTCfrom
probabilities [3]. CTC [3].
quences use the conditional probabilities from CTC [3].
In
In the
the CRNN
CRNN network
network model,
model, from from the
the input
input image
image to to the
the output
output text text recognition
recognition
In the CRNN network model, from the input image to the output text recognition
result,
result, itit can
can be be considered
considered that that thethe model
model applies
applies feature
feature information
information from from the the visual
visual
result, it can be considered that the model applies feature information from the visual
aspect
aspect to tothe
thetexttextsequence
sequencerecognition
recognition withoutwithout applying
applying feature
feature information
information from from other
other
aspect to the text sequence recognition without applying feature information from other
modalities. Considering that the overall recognition accuracy
modalities. Considering that the overall recognition accuracy of the CRNN network of the CRNN network model
modalities. Considering that the overall recognition accuracy of the CRNN network
ismodel
still low and low
is still that and
relying
thatonrelying
visual on information from a single
visual information frommodality
a single formodality
text recognition
for text
model is still low and that relying on visual information from a single modality for text
isrecognition
not informative enough, it is necessary to add other auxiliary
is not informative enough, it is necessary to add other auxiliary information information to improve
recognition is not informative enough, it is necessary to add other auxiliary information
the overall recognition
to improve accuracy based
the overall recognition on the
accuracy visual
based onmodel effect.
the visual At this
model stage,
effect. linguistic
At this stage,
to improve
features thebeen
overall recognition accuracy based ondomain.
the visual model effect. At this stage,
linguistichave features used
have to some
been usedeffect
to somein the
effecttext in the text Linguistic features
domain. Linguistic refer
featuresto
linguistic
the features of
consideration have
the been
context used to somecharacters
between effect in the to text domain.
infer the class Linguistic
of that features
character,
refer to the consideration of the context between characters to infer the class of that char-
refer tothan
rather the consideration of the context between
of theofcharacters toIninfer the classwe of choose
that char-
acter, ratherbased on theon
than based glyphic
the glyphic features
features character.
the character. this
In thispaper,
paper, we chooseto
acter,
add a rather
linguisticthan based
model, on
a the
network glyphic features
model to of
obtain the character.
information In this
relying paper,
on we
both choose
visual
to add a linguistic model, a network model to obtain information relying on both visual
to add
and a linguistic model, a network model tofeatures
obtain information relying on both visual
and linguistic features
linguistic features rather
rather thanon
than on visual
visual features alone, alone, and andmore more comprehensive
comprehensive in-
and linguistic
information features rather
acquisition, whichthan helpson tovisual
improve features alone, and
the accuracy
accuracy more comprehensive
of text in-
formation acquisition, which helps to improve the of text recognition.
recognition. The The data
data
formation
used acquisition, which helps to improve the accuracy of text recognition. The data
usedfor fortext
textrecognition
recognitionare arelabeled
labeleddata.data. Taking
Takingthe theEnglish
Englishalphabet
alphabetas asananexample:
example:ififitit
used
isisnot for text recognition are labeled data. Taking the English alphabet as an example: if it
notcase
casesensitive,
sensitive,the theEnglish
Englishalphabet
alphabet has has 26 26letters,
letters,andandthenthenconsidering
consideringthe theArabic
Arabic
is not case sensitive, the English alphabet has 26 letters, and
numerals 0–9, the whole data eventually correspond to 37 characters, so the text recognition then considering the Arabic
problem is essentially a multivariate classification problem, with the option of adding label
Information 2023, 14, 369 6 of 14
3.1.
3.1. Label
Label Smoothing
Smoothing (LS) (LS)
Label
Label smoothing is
smoothing is aamethod
method of ofmodel
modelregularization
regularization that that can
can significantly
significantly improve
improve
the generalization ability and learning speed of multi-class
the generalization ability and learning speed of multi-class neural networks, neural networks, andandis typically
is typi-
used
cally to prevent
used modelmodel
to prevent overfitting [20]. Smoothing
overfitting [20]. Smoothing labels labels
in thisinway thisprevents networks
way prevents net-
from
works becoming overconfident
from becoming and has been
overconfident and hasusedbeenin many
usedstate-of-the-art models, including
in many state-of-the-art models,
image classification,
including language translation,
image classification, language and speech recognition.
translation, and speechIn addition to In
recognition. improving
addition
generalization, label smoothing improves model calibration
to improving generalization, label smoothing improves model calibration and can and can significantly improve
signif-
beam
icantly search [21].beam search [21].
improve
Considering
Consideringapplying
applyinglabellabelsmoothing
smoothingto totext
textrecognition,
recognition,since sincethe theCTC
CTCcomponent
component
of
of the CRNN model comes from the direction of speech recognition, labelsmoothing
the CRNN model comes from the direction of speech recognition, label smoothing isis aa
general
general method
method of of improving
improvinggeneralization
generalization ability
ability by
by adding
adding label
label noise,
noise, which
which has has the
the
effect of penalizing low-entropy output distributions (i.e., overconfident
effect of penalizing low-entropy output distributions (i.e., overconfident predictions). predictions). In the In
classification process,
the classification one-hot
process, encoding,
one-hot whichwhich
encoding, is commonly
is commonly used, used,
has poorhas generalization
poor generali-
ability and tends to believe too much in labels, assuming that the differences between each
zation ability and tends to believe too much in labels, assuming that the differences be-
category are large, which is actually difficult to achieve [22]. To address the issues with
tween each category are large, which is actually difficult to achieve [22]. To address the
one-hot encoding, label smoothing is proposed [21], and the calculation formula is shown
issues with one-hot encoding, label smoothing is proposed [21], and the calculation for-
in (5).
mula is shown in (5). E
Label Smoothing = onehot ∗ (1 − E ) + ℰ, (1)
Label Smoothing = onehot ∗ (1 − ℰ) + , c (1)
𝑐
In Equation (1), one-hot is the unique hot encoding variable of the tag, such as [0, 1],
Inetc.,
[1, 0, 0], Equation
where(1),ε is one-hot is the unique
a hyper-parameter lesshot
thanencoding variable
1 and greater than of 0the tag,
and c issuch as [0, 1],
the number
[1,categories.
of 0, 0], etc., The
where ε is avalue
default hyper-parameter
of the parameter less εthan
is 0.1.1 When
and greater
one-hot than 0 and
is [0, 1], thec is
codethe
number [0.05,
becomes of categories.
0.95] after The default
label value of
smoothing the parameter
according ε is 0.1. When
to the formula. one-hot
The process of is [0, 1],
adding
the code
label becomes
smoothing [0.05, 0.95]
is achieved byafter label
adding smoothing according
a regularization term to to thethe formula.
CTC objectiveThefunction,
process
which consists of KL scattering between the predictive distribution, P, of the networkobjec-
of adding label smoothing is achieved by adding a regularization term to the CTC and
tiveuniform
the function, which consists
distribution, U onoftheKLlabels
scattering
[23]. between the predictive distribution, P, of the
network and the uniform distribution, U on the labels [23].
T
L(θonline ) , (1 − α) LCTC + α ∑t=1𝑇 DKL ( Pt |U ), (2)
L(𝜃𝑜𝑛𝑙𝑖𝑛𝑒 ) ≜ (1 − 𝛼)𝐿𝐶𝑇𝐶 + 𝛼 ∑𝑡=1 𝐷𝐾𝐿 (𝑃𝑡 |𝑈), (2)
The
Theadjustable
adjustableparameter,
parameter,α,𝛼,ininEquation
Equation (2)(2)
is is
used
usedto to
balance
balance thethe
weight
weightregulariza-
regulari-
tion term
zation term and
and CTCCTC loss.
loss.From
FromEquation
Equation(2),(2),ititcan
canbe beintuitively
intuitivelyseen
seenthat
thatthe
thewhole
wholeloss
loss
function after adding label smoothing contains the CTC part and
function after adding label smoothing contains the CTC part and KL scatter part. Both KL scatter part. Both
parts are
parts are related
related to 𝛼, and
to α, and when
when α𝛼 takes
takes00then becomesL𝐿CTC
thenititbecomes , and when α takes 1 then it
𝐶𝑇𝐶 , and when 𝛼 takes 1 then it
becomes
becomes 𝐷KLD ( P | U ) . The CRNN model diagram after adding label smoothing is shown in
𝐾𝐿 (𝑃𝑡 |𝑈). The CRNN model diagram after adding label smoothing is shown in
t
Figure 4.
Figure 4.
CTC+label Vision
CNN RNN
smoothing
prediction
Figure4.4. Schematic
Figure Schematicdiagram
diagramof
ofthe
theCRNN+label
CRNN + labelsmoothing
smoothingmodel.
model.
end-to-end model training and do not require pre-alignment of data; only input and output
sequences are required for training. The main drawbacks of the CTC model are that it still
has the assumption of conditional independence between data, and that the CTC model
only has the ability to model acoustically and lacks some language model capabilities [24].
Enhanced language modeling usually requires the acquisition of linguistic features, which
refers to inferring the class of a character by considering the context between characters,
not based on the glyphic features of the character, and is usually paired with visual features
extracted by visual models, The language model BCN is proposed in ABINet [25] because
the purpose of the language model in ABINet [25] is to iteratively correct and check letters
by fusing visual and language model features and iteratively checking n times. Considering
the actual running time and the limited semantic information extracted after the iterative
iterations, only the language model BCN is added to the CRNN network after the feature
fusion iterations without adding the fused iterative language model. From Figure 5 it can be
seen that the output of the CRNN is used as input to the language model to provide a priori
knowledge for the language model to carry out the acquisition of semantic information,
Information 2023, 14, 369 and the gradient is back-propagated as an auxiliary task for CRNN recognition so that 8 of
the14
output of the CRNN contains more semantic information.
probablity
Linear S T A T E
2 X
Feed Forward
3 S T X T E
5 X
Multi-Head Attention masks
Attention
V K & V
M Q K
...
...
1 2 3 4 5 Vision prediction
Figure5.5.Language
Figure Languagemodel
modelBCN.
BCN.
The prediction results of the visual model are used to provide a priori knowledge
for the language model, and text recognition is performed at both visual and contextual
semantic levels to obtain more comprehensive information about the text. Each layer of
the BCN is a series of multi-head attention probablity
feed forward networks, residual connections,
and layer normalization. The network takes as input a sequential encoding of character
Vision
positions as a non-character
Vison probability
Model vector, and the character probability vector is passed
prediction
Input image
directly into the multi-head attention module. In the multi-head attention mechanism, the
diagonal attention mask is designed to avoid seeing the current character and to achieve
output
simultaneous access to information to the left and right of the character, combining the left
and right information to make a prediction simultaneously.
M
probablity
Vision
Vison Model
prediction
Input image
output
4. Experiments
4. Experiments
4.1.4.1. Datasets
Datasets
InIn thefield
the fieldofoftext
textrecognition,
recognition, the
thedemand
demandforfordata
datavolume
volumeis very large.
is very Compared
large. Compared
with the training data volume of a few thousand or a few hundred
with the training data volume of a few thousand or a few hundred in the field in the field of text
of text
detection, the training data volume in the field of text recognition is in the millions. The
detection, the training data volume in the field of text recognition is in the millions. The
datasets used in this paper are all public datasets, which can be divided into synthetic
datasets used in this paper are all public datasets, which can be divided into synthetic
datasets and real datasets. Since the training of network models requires a large amount of
datasets
data asand real datasets.
support, most of the Since
text the trainingmodels
recognition of network models
are trained requires
using a large
synthetic amount
datasets;
of real
datadatasets
as support,
are used for the evaluation of the training results of text recognition models. da-
most of the text recognition models are trained using synthetic
tasets; real datasetsdataset
1 Synthetic are used for the evaluation of the training results of text recognition
for training
models.
Information 2023, 14, 369 9 of 14
MJSynth [26] is a synthetic plain English text dataset. The dataset contains 3000 folders
with about 9 million images, rendering the text onto natural images and then performing
a random transformation. The words in each image in the dataset are labeled and the
character orientation is mainly horizontal. SynthText [27] is also a synthetic text dataset, but
the difference is that SynthText is designed mainly for text detection, so the text is rendered
onto the complete natural image, consisting of 800,000 scene images. To accommodate text
recognition, the words are cropped according to the word annotation bounding boxes in
the experiments, and a total of about 7 million text images are cropped.
2 Real datasets for evaluation
Depending on the format of the text in the image, the real dataset can be divided into
regular and irregular text. Regular datasets include IIIT5k-Words (IIIT5k), SVT (Street View
Text), and IC13 (ICDAR2013). Irregular datasets include IC15 (ICDAR2015), SVTP (Street
View Text Perspective), and CUTE80.
IIIT5k-Words (IIIT5k) [28] consists of 3000 test images collected from the Internet.
The text in the images is mostly regular text, and, for each image, two dictionaries of
different sizes, 1000 words and 50 words, are matched. Each dictionary consists of real
annotations and other commonly used words; the images in SVT [29] are mainly cropped
from 249 Google Street View views and contain 647 low-resolution and low-noise text
images, most of which are horizontal text images. Each image matches a 50-word dictionary;
the vast majority of text images in ICDAR 2013 (IC13) [30] are from IC03, with new images
added for data augmentation, for a total of 1015 text images, most of which are regular text
images, with some text images blurred due to uneven illumination.
ICDAR 2015 [31] from Challenge 4 of the ICDAR 2015 Robust Reading Competition,
called incidental scene text, consists mainly of plain English text of multi-directional scenes.
This dataset consists of some randomly taken street view images with low resolution, and
most of the text in the figures is relatively small and blurred, so it is relatively difficult to
detect. ICDAR 2015 divides the training and test sets into 1000 and 500 images, each of
which contains multi-directional text, which is labeled in word units using a rectangular
box of 4 points; SVTP [32] was selected from the side view of Google Street View. The
dataset consists of 645 cropped text images, most of which have distortion factors such as
low resolution, noise, and blur. Each image provides a dictionary of 50 words. CUTE80
(CUTE) [33] contains 288 high-resolution text images cropped from the original dataset.
The text in this dataset is mainly curved and directional text, and no associated dictionaries
are provided.
In this paper, we are limited by the experimental equipment and the actual running
time, so we use the synthetic dataset MJ in the training phase, and then we can add the
dataset ST for training if the experimental conditions allow.
word recognition accuracy is used as an evaluation index for the merits of text recognition
algorithms, which is calculated as in Equation (6):
N
Accuracy = × 100%, (6)
M
In Equation (6), M represents the number of all recognized characters, and N represents
the number of samples with completely correct recognition.
4.2.1. Comparison of Experimental Results between the Actual Run Baseline Model and
the Original Baseline Model
In order to evaluate the text recognition performance of the improved CRNN network
model objectively, considering that the test data in the original CRNN paper did not include
all six datasets, the CRNN test data involved in the comparison are from the replication
of Benchmark [16], and the CRNN-base represents the results obtained from the actual
experiments on the Linux experimental platform. From Table 1, we can see that the CRNN-
base has 0.05%, 3.83%, 3.01%, 6.53%, 5.07%, and 0.75% growth over CRNN in IC13, SVT,
IIIT5K, IC15, SVTP, and CUTE, respectively, and the experimental data are real and reliable.
Table 1. Comparison table of CRNN baseline experiments (accuracy of English dataset (%)).
4.2.2. Comparison of Adding Label Smoothing with the Baseline Model Approach
The experimental comparison table after adding label smoothing is shown in Table 2.
Considering that label smoothing is mostly applied in the classification task and the default
value of α is 0.1, the optimal value of α can be taken as 0.01 when adding label smoothing
in other tasks. Although text recognition can also be considered a multi-classification
problem, the applicability of label smoothing in the field of text recognition for the value of
α needs to be considered. The optimal result of the label smoothing parameter, α, needs to
be determined, and this paper uses a GridSearch method to verify it around 0.01 versus
0.1. Because the GridSearch is a means of tuning the parameters by the exhaustive search
method, in all the candidates of the parameter selection, by circular traversal, trying every
possibility, the best performing parameter is the final result of the selection. The specific
experimental results are shown in Table 2.
Table 2. Comparison table of α values (accuracy of English dataset (%), in the Table 2 “+” indicates
an increase over CRNN-base and “−” indicates a decrease under CRNN-base).
From the experimental results of GridSearch, it can be seen that, when the default
value of 0.1 is used for label smoothing, there is a decrease in the accuracy rate based on
the CRNN-base, indicating that the parameter default value is not ideal in the field of text
recognition; therefore, the algorithms and modules in other fields, when applied to new
fields, require parameter adjustment to find the best parameters. From the experimental
results, it can be seen that, except for an α default value of 0.1, other values have different
degrees of growth. Especially, the best effect is achieved when α = 0.005, where the growth
points on the irregular dataset are more than those on the regular dataset, which indicates
Information 2023, 14, 369 11 of 14
that the applicability is stronger in irregular text, and in IC13, SVT, IIIT5K, IC15, SVTP,
and CUTE, the CRNN-base models grow 3.64%, 5.01%, 3.70%, 6.22%, 5.63% and 7.15%,
respectively, which is significant. As a result, the field of label smoothing applications
has been extended from classification, machine translation, image segmentation, and
speech recognition to text recognition. By implementing regularization, the model is
prevented from predicting labels too confidently during training to improve generalization
ability, and can also be tuned with parameters to achieve significant improvements in
recognition accuracy.
4.2.3. Comparison of Adding Language Model with the Baseline Model Approach
The experimental comparison table after adding the BCN language model is shown
in Table 3. From the experimental results, it can be seen that adding the language model
to extract semantic information and adding label smoothing can achieve the effect of
improving the accuracy rate, and there are different degrees of improvement in both regular
and irregular texts, with more growth points in the irregular datasets than in the regular
datasets. In IC13, SVT, IIIT5K, IC15, SVTP, and CUTE, the increases over the base model
are 2.35%, 3.24%, 2.62%, 4.71%, 4.39%, and 5.21%, respectively. From the experimental
results, it can be seen that there is a problem of incomplete information acquisition when
considering the text information from the visual model alone for recognition. The addition
of the language model not only considers the visual aspect of a single character’s glyphic
features but also infers the category of the character from the context between characters,
broadening the access to information and improving the overall effect.
Table 3. Comparison table after adding the language model (accuracy of English dataset (%)).
5. Conclusions
In this paper, we propose an improved CRNN for scene text recognition. The improved
CRNN model is used for sequence recognition of text in images, trained on synthetic
datasets, and tested on six public datasets. Experimental results show that the improved
CRNN accuracy outperforms the original CRNN network and is compared with other
experimental methods with good results. The main contribution of this paper is to improve
the original CRNN network and propose a new idea of text recognition based on a neural
network. The CRNN network model retains the overall architecture of the original network,
and the use of label smoothing in predicting output results can effectively improve the
generalization ability of the model, improve the anti-interference ability of the model,
Information 2023, 14, 369 13 of 14
prevent the generation of over-fitting, and thus improve the recognition accuracy. The
smoothing loss function in speech recognition is introduced into the field of text recognition,
and the CTC loss function is redefined. The smoothing loss function in speech recognition
is introduced into the field of text recognition, and the CTC loss function is redefined. A
language model is added to fuse sequence information with language information for
text recognition, increasing the information acquisition channels and ultimately achieving
the goal of improving the accuracy of text recognition. With the addition of the language
model, the overall recognition accuracy is improved, but the number of parameters also
increases and the network run time becomes longer. Therefore, network optimization for
the improved network model is needed in future work to reduce the number of parameters
and running time. In addition, the public dataset is mainly in English, and CRNN has
achieved certain results in both Chinese and English recognition, but real-life text images
can be mixed with multiple languages, and further research is needed for the recognition
of mixed text.
Author Contributions: Conceptualization, W.Y.; methodology, W.Y.; software, W.Y.; validation, W.Y.
and M.I.; formal analysis, W.Y.; investigation, W.Y., A.H. and M.I.; resource, A.H. and M.I.; data
curation, W.Y.; writing—original draft preparation, W.Y.; writing—review and editing, M.I. All
authors have read and agreed to the published version of the manuscript.
Funding: This work has been supported by the Natural Science Foundation of China (62166043,
U2003207).
Data Availability Statement: Publicly available datasets were analyzed in this study. Our datasets.
can be obtained from [https://github.com/clovaai/deep-text-recognition-benchmark] (11 December
2020), and [https://github.com/FangShancheng/ABINet] (10 June 2021).
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Liu, C.; Chen, X.; Luo, C.; Jin, L.; Xue, Y.; Liu, Y. A deep learning approach for natural scene text detection and recognition. Chin.
J. Graph. 2021, 26, 1330–1367. [CrossRef]
2. Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to
Scene Text Recognition. arXiv 2015, arXiv:1507.05717. [CrossRef] [PubMed]
3. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence
data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA,
USA, 25–29 June 2006; pp. 369–376.
4. Liu, Y.; Wang, Y.; Shi, H. A Convolutional Recurrent Neural-Network-Based Machine Learning for Scene Text Recognition
Application. Symmetry 2023, 15, 849. [CrossRef]
5. Lei, Z.; Zhao, S.; Song, H.; Shen, J. Scene text recognition using residual convolutional recurrent neural network. Mach. Vis. Appl.
2018, 29, 861–871. [CrossRef]
6. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw.
1994, 5, 157–166. [CrossRef] [PubMed]
7. Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649.
8. Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust scene text recognition with automatic rectification. In Proceedings of the 2016
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4168–4176.
9. Lee, C.-Y.; Osindero, S. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2231–2239.
10. Liu, W.; Chen, C.; Wong, K.-Y.K.; Su, Z.; Han, J. Star-net: A spatial attention residue network for scene text recognition. BMVC
2016, 2, 7.
11. Wang, J.; Hu, X. Gated recurrent convolution neural network for OCR. Adv. Neural Inf. Process. Syst. 2017, 30, 334–343.
12. Borisyuk, F.; Gordo, A.; Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of
the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp.
71–79.
13. Baek, J.; Kim, G.; Lee, J.; Park, S.; Han, D.; Yun, S.; Oh, S.J.; Lee, H. What is wrong with scene text recognition model comparisons?
Dataset and model analysis. In Proceedings of the 2019 IEEE/CVF international Conference on Computer Vision, Seoul, Republic
of Korea, 27 October–2 November 2019; pp. 4715–4723.
Information 2023, 14, 369 14 of 14
14. Qiao, Z.; Zhou, Y.; Yang, D.; Zhou, Y.; Wang, W. Seed: Semantics enhanced encoder-decoder framework for scene text recognition.
In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June
2020; pp. 13528–13537.
15. Atienza, R. Vision transformer for fast and efficient scene text recognition. In Proceedings of the Document Analysis and
Recognition—ICDAR 2021: 16th International Conference, Lausanne, Switzerland, 5–10 September 2021; Proceedings, Part I 16.
Springer International Publishing: Cham, Switzerland, 2021; pp. 319–334.
16. Baek, J.; Matsui, Y.; Aizawa, K. What if we only use real datasets for scene text recognition? Toward scene text recognition with
fewer labels. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN,
USA, 20–25 June 2021; pp. 3113–3122.
17. Zhang, M.; Ma, M.; Wang, P. Scene text recognition with cascade attention network. In Proceedings of the 2021 International
Conference on Multimedia Retrieval, New York, NY, USA, 21–24 August 2021; pp. 385–393.
18. Bhunia, A.K.; Sain, A.; Chowdhury, P.N.; Song, Y.-Z. Text is text, no matter what: Unifying text recognition using knowledge
distillation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada,
10–17 October 2021; pp. 983–992.
19. Liu, C.; Yang, C.; Yin, X.C. Open-Set Text Recognition via Character-Context Decoupling. In Proceedings of the 2022 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4523–4532.
20. Liu, M.; Zhou, L. A cervical cell classification method based on migration learning and label smoothing strategy. Mod. Comput.
2022, 28, 1–9+32.
21. Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? Adv. Neural Inf. Process. Syst. 2019, 32, 422.
22. Zhao, L. Research on User Behavior Recognition Based on CNN and LSTM. Master’s Thesis, Nanjing University of Information
Engineering, Nanjing, China, 2021. [CrossRef]
23. Kim, S.; Seltzer, M.L.; Li, J.; Zhao, R. Improved training for online end-to-end speech recognition systems. arXiv 2017,
arXiv:1711.02212.
24. Qin, C. Research on End-to-End Speech Recognition Technology. Ph.D. Thesis, Strategic Support Force Information Engineering
University, Zhengzhou, China, 2020. [CrossRef]
25. Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read like humans: Autonomous, bidirectional and iterative language modeling for
scene text recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville,
TN, USA, 20–25 June 2021; pp. 7098–7107.
26. Cheng, L.; Yan, J.; Chen, M.; Lu, Y.; Li, Y.; Hu, L. A multi-scale deformable convolution network model for text recognition. In
Proceedings of the Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021); SPIE: Paris, France, 2022;
Volume 12083, pp. 627–635.
27. Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324.
28. Mishra, A.; Alahari, K.; Jawahar, C.V. Top-down and bottom-up cues for scene text recognition. In Proceedings of the 2012 IEEE
Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2687–2694.
29. Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on
Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464.
30. Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; de las Heras, L.P.
ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and
Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493.
31. Phan, T.Q.; Shivakumara, P.; Tian, S.; Tan, C.L. Recognizing text with perspective distortion in natural scenes. In Proceedings of
the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 569–576.
32. Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert
Syst. Appl. 2014, 41, 8027–8048. [CrossRef]
33. Cheng, Z.; Xu, Y.; Bai, F.; Niu, Y.; Pu, S.; Zhou, S. Aon: Towards arbitrarily-oriented text recognition. In Proceedings of the 2018
IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5571–5579.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.