0% found this document useful (0 votes)

36 views14 pages

Scene Text Recognition Based On Improved CRNN

This paper presents an improved CRNN model for scene text recognition, addressing issues of low accuracy and poor performance on irregular text by incorporating label smoothing and a language model. The enhancements lead to better generalization and information acquisition, resulting in improved recognition accuracy across multiple datasets. Experimental results demonstrate that the new model outperforms the original CRNN in various benchmark tests.

Uploaded by

ducminh5404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

36 views14 pages

Scene Text Recognition Based On Improved CRNN

Uploaded by

ducminh5404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

information

Article
Scene Text Recognition Based on Improved CRNN
Wenhua Yu 1,2 , Mayire Ibrayim 1,2, * and Askar Hamdulla 1,3

1 College of Information Science and Engineering, Xinjiang University, Urumqi 830017, China;
yuwenhua@stu.xju.edu.cn (W.Y.); askar@xju.edu.cn (A.H.)
2 Xinjiang Key Laboratory of Signal Detection and Processing, Urumqi 830017, China
3 Xinjiang Key Laboratory of Multilingual Information Technology, Urumqi 830017, China
* Correspondence: mayire401@xju.edu.cn; Tel.: +86-133-1988-9043

Abstract: Text recognition is an important research topic in computer vision. Scene text, which
refers to the text in real scenes, sometimes needs to meet the requirement of attracting attention,
and there is the situation such as deformation. At the same time, the image acquisition process is
affected by factors such as occlusion, noise, and obstruction, making scene text recognition tasks more
challenging. In this paper, we improve the CRNN model for text recognition, which has relatively
low accuracy, poor performance in recognizing irregular text, and only considers obtaining text
sequence information from a single aspect, resulting in incomplete information acquisition. Firstly,
to address the problems of low text recognition accuracy and poor recognition of irregular text, we
add label smoothing to ensure the model’s generalization ability. Then, we introduce the smoothing
loss function from speech recognition into the field of text recognition, and add a language model to
increase information acquisition channels, ultimately achieving the goal of improving text recognition
accuracy. This method was experimentally verified on six public datasets and compared with other
advanced methods. The experimental results show that this method performs well in most benchmark
tests, and the improved model outperforms the original model in recognition performance.

Keywords: CRNN; text recognition; label smoothing; language model; deep learning

1. Introduction
Citation: Yu, W.; Ibrayim, M.; Text recognition is an important direction in the field of computer vision. With the
Hamdulla, A. Scene Text Recognition continuous development of deep learning fields such as computer vision, pattern recogni-
Based on Improved CRNN. tion, and machine learning, scene text recognition of deep learning has been developed
Information 2023, 14, 369. https:// on this basis. Text recognition can be divided into two branches according to recognition
doi.org/10.3390/info14070369 algorithms: segmentation-based recognition algorithms and recognition algorithms that do
not require segmentation. The segmentation-based natural scene text recognition algorithm
Academic Editor: Shmuel Tomi Klein
usually needs to locate the location of each character contained in the input text image,
Received: 27 April 2023 identify each character through a single character recognizer, and then combine all the
Revised: 17 June 2023 characters into a string sequence to obtain the final recognition result. Natural scene text
Accepted: 27 June 2023 recognition algorithms without segmentation aim to treat the entire text line as a whole
Published: 28 June 2023 and directly map the input text image to a sequence of target strings, thus avoiding the
disadvantages and performance limitations of single character segmentation, which is also
the current mainstream approach [1]. In the process of text recognition, a series of labels
are usually predicted, and the whole recognition process can be regarded as a sequence
Copyright: © 2023 by the authors.
recognition problem [2]. The CRNN [2] algorithm in sequence recognition without seg-
Licensee MDPI, Basel, Switzerland.
This article is an open access article
mentation is a neural network that integrates feature extraction, sequence modeling, and
distributed under the terms and
transcription. The feature map is first extracted using a convolutional neural network
conditions of the Creative Commons
(CNN), and then the feature dependencies are captured using a recurrent neural network
Attribution (CC BY) license (https:// (RNN), the features are predicted, and the output prediction distribution is fed to connec-
creativecommons.org/licenses/by/ tionist temporal classification (CTC) [3] for processing, and then the final text sequence is
4.0/). output. Because RNN models are an important branch of deep neural networks, they are

Information 2023, 14, 369. https://doi.org/10.3390/info14070369 https://www.mdpi.com/journal/information

Information 2023, 14, 369 2 of 14
Information 2023, 14, 369 2 of 14

designed primarily to process sequences. To learn text sequences directly using RNN
models,
designed there are dependencies:
primarily the mapping
to process sequences. relationship
To learn between
text sequences input using
directly and output
RNN mod-se-
quences
els, thereneeds to be labeled inthe
are dependencies: advance,
mapping and the mapping
relationship relationship
between input and between
outputinput and
sequences
output
needssequences
to be labeled is a in
one-to-one
advance,correspondence.
and the mapping Because text and
relationship speechinput
between signalsand areoutput
con-
tinuous
sequences signals, there are segmentation
is a one-to-one correspondence. difficulties,
Because textandand
the speech
volumesignals
of textarerecognition
continuous
data is inthere
signals, the millions; it is also difficulties,
are segmentation costly, time-consuming,
and the volume and of impractical
text recognitionto achieve
data is seg-
in the
mentation
millions; andit is annotation. Therefore, it is notand
also costly, time-consuming, applicable to apply
impractical an RNN
to achieve directly to text
segmentation and
annotation.soTherefore,
recognition, the CRNN it is not applicable
network adds thetoCTCapply an RNNby
proposed directly to text et
Alex Graves recognition,
al. after theso
the CRNN
RNN network
[3] to make RNNadds the CTC
applicable proposed
to text by Alex
recognition. TheGraves et al. after
CTC algorithm theend-to-end
is an RNN [3] to
make RNN
training method applicable
for RNNs. to text recognition.
It extends The layer
the output CTC algorithm
of an RNNisby: anconverting
end-to-endthe training
data
method for of
dependency RNNs. It extendsand
“segmenting” the label
output layer ofrelationships
mapping an RNN by: converting
to extracting thefeatures
data depen-
ac-
dency of
cording to “segmenting” and label transforming
a sliding time window, mapping relationships to extracting
the input–output featuresfrom
relationship according
one-
to a sliding
to-one time window,
to many-to-one; addingtransforming the input–output
blank characters, and performing relationship from one-to-one
deduplication and blankto
character removal operations on consecutive identical characters in the sequence character
many-to-one; adding blank characters, and performing deduplication and blank output;
removalcomplexity
reducing operations and on consecutive identical
increasing speed characters
by drawing oninthetheforward
sequence and output;
backwardreducing
al-
complexity
gorithms andhidden
of the increasing
Markov speed by drawing
model (HMM) on the forward
to compute the and
loss backward
function; and algorithms
using
of the hidden
dynamic Markov to
programming model (HMM)
compute the to compute
training the avoiding
paths, loss function; and using
impractical dynamic
exhaustive
methods or violent enumeration. The decoding process of CTC maps the path generatedor
programming to compute the training paths, avoiding impractical exhaustive methods
byviolent
CTC into enumeration. The decoding
a final sequence. process
Combining of CTC
these mapsanthe
features, path generated
example by sequence
of the final CTC into a
final sequence. Combining these features, an example of the
mapped after deduplication and removal of a whitespace character using CTC is shown final sequence mapped after
deduplication
in Figure 1. and removal of a whitespace character using CTC is shown in Figure 1.

S T A T E

Output - - S S - T T - A A - T T - E E - -

RNN

Input
S T A T E
Conversion
Figure1.1.Conversion
Figure correspondencediagram.
correspondence diagram.

CRNN
CRNN models
models have
have lowlow text
text recognition
recognition accuracy
accuracy [4,5],
[4,5], poorpoor recognition
recognition of irregu-
of irregular
lar text, and incomplete information acquisition by acquiring text sequence information
text, and incomplete information acquisition by acquiring text sequence information from
from a single level. The commonly used improvement schemes are mostly focused on
a single level. The commonly used improvement schemes are mostly focused on two net-
two networks, CNN and RNN, to analyze the inadequate feature extraction of the net-
works, CNN and RNN, to analyze the inadequate feature extraction of the network, the
work, the presence of gradient explosion disappearance of the network [6], and the poor
presence of gradient explosion disappearance of the network [6], and the poor recognition
recognition of indefinitely long text sequences. Targeted replacements have been made
of indefinitely long text sequences. Targeted replacements have been made to improve
to improve feature extraction networks, replacing recurrent neural networks with long
feature extraction networks, replacing recurrent neural networks with long short-term
short-term memory networks [7] or adding residual modules. Although better performance
memory networks [7] or adding residual modules. Although better performance is
is achieved, there is the problem that the acquisition is not comprehensive and is limited to
achieved, there is the problem that the acquisition is not comprehensive and is limited to
the text domain only. Therefore, based on this, this paper takes CTC and the whole CRNN
the text domain only. Therefore, based on this, this paper takes CTC and the whole CRNN
as the entry point, adds a label smoothing strategy, introduces the smoothing loss function
as the entry point, adds a label smoothing strategy, introduces the smoothing loss function
in the field of speech recognition into the field of text recognition, and adds a language
in the field of speech recognition into the field of text recognition, and adds a language
model, taking into account the acquisition of information from various aspects to achieve
the improvement of recognition accuracy.
Information 2023, 14, 369 3 of 14

The main contributions of this paper are as follows. Firstly, for the low accuracy of
recognition results, data labeled as hard labels will introduce noise and loss of information,
leading to poor generalization of the model and recognition results being easily affected.
Secondly, adding label smoothing to obtain soft labels, which carry more information, are
more robust to noise, and improve the generalization ability of the model. Thirdly, after
combining CTC with label smoothing, the loss function after label smoothing is redefined.
Finally, the language model is connected after the CRNN model, and the CRNN prediction
results are input to the language model as the prior knowledge of the language model,
so that the complementary nature of visual information and language information can
be used to obtain text information from multiple levels to improve the accuracy of text
recognition further and achieve relatively high recognition accuracy on the six test sets. As
the improved model is divided into two main parts, namely the language model and the
CRNN with label smoothing, the latter can be replaced with other visual models. Therefore,
the two parts are relatively independent.
The rest of the paper is organized as follows. Section 2 summarizes the relevant
research in this field, with a focus on text recognition in the field of deep learning. In
Section 3, the CRNN recognition model, which combines label smoothing and the language
model, is introduced in detail. Experimental results and corresponding discussions are
provided in Section 4. The paper concludes with a summary in Section 5.

2. Related Work
As the field of deep learning continues to develop, the direction of scene text recog-
nition has also been developed, and many researchers have proposed many new and
relevant recognition algorithms. The CRNN [2] network model, proposed in 2015, is a
classical model in the field of text recognition, combining CNN, RNN, and CTC [3] to
perform text recognition from the perspective of text sequences and avoid the limitation of
accurate slicing. First, the input image is converted into a grayscale map, feature extraction
is performed using CNN, contextual information is learned using RNN, and finally the
network is optimized using CTC to solve the text alignment problem. In 2016, the RARE [8]
algorithm was proposed, combining spatial transformation networks and sequence recog-
nition networks for curved text correction recognition. In 2016, the R2AM [9] algorithm
was proposed, which for the first time introduced an attention mechanism into the field of
text recognition and implemented soft feature selection in the decoding process to utilize
image features better. The STAR-Net [10] network, proposed in 2016, uses spatial trans-
formation to remove text distortions and uses residual convolution blocks to construct
feature extractors, particularly effective in distortion-rich scenes of text. The GRCNN [11]
model was proposed in 2017. It introduces a gating strategy in the recurrent convolution
layer (RCL) to control the context information and balance the transmission of forward
and recursive information. By combining GRCNN with bidirectional LSTM, the entire
network can be trained end-to-end, thereby effectively recognizing text information in
images In 2018, the paper [12] proposed an optical character recognition system called
Rosetta. The system consists of two stages: Text detection stage, based on the Faster-RCNN
model, detects text regions in the image; Character recognition model based on fully con-
volutional networks processes the detected text regions and recognizes the text content.
Benchmark [13], presented in 2019, provides a disassembled analysis of the model for the
STR task, which helps researchers gain insight into the model and make improvements to
existing models. The semantically enhanced codec architecture for recognizing low-quality
scene text was proposed in SEED [14] in 2020. As transformers continue to evolve and
transformers as decoders become more common in STR tasks, recognition tasks are be-
ginning to focus on more than just recognition accuracy. ViTSTR [15], proposed in 2021,
uses a simple single-stage model architecture built on a computationally and parametri-
cally efficient visual transformer (ViT) to maximize accuracy, speed, and computational
efficiency. TRBA [16] makes full use of real data through data augmentation, collecting
unlabeled data and introducing semi-supervised and self-supervised improvements to the
Information 2023, 14, 369 4 of 14

model, moving in the direction of text recognition for scenes with fewer labels. Ref. [17]
proposed cascaded attention networks using three attention modules, from horizontal
continuity properties, contextual information, and two-bit visual distribution, addressing
the drift phenomenon in encoding and decoding architectures. Text is Text [18] uses a single
model to deal with scene text recognition (STR) and handwritten text recognition (HTR)
for handwritten text, introducing a knowledge distillation (KD)-based framework to deal
with the combination of STR and HTR, while proposing four distillation losses specifically
designed to deal with the unique features of the aforementioned text recognition. Proposed
in 2022, character-context decoupling [19] focuses on open-set text recognition tasks and
proposes a character-context decoupling framework to alleviate the problem of confound-
ing effects of contextual information over visual information of individual characters by
separating contextual and character-visual information, with good results on both open
and closed datasets. With the continuous development of text recognition algorithms,
the addition of attention mechanisms and various encoding and decoding networks has
achieved good results in terms of recognition accuracy. However, the overall network
models have become more complex and less easy to understand. Typically, these models
require high-performance experimental equipment. Therefore, compared to other models,
this paper proposes a text recognition approach based on the classic CRNN network model,
which has a clear and understandable structure. It achieves good recognition results while
requiring lower experimental equipment requirements.

3. Methods
From the overall network structure diagram in Figure 2, it can be seen that the CRNN
network with label smoothing (LS) added is used as the visual model, and the images with
text areas cropped out are fed into the visual model, and features are extracted from the
input images by the CNN in the visual model to obtain the feature maps. Next, the feature
sequence is fed into a two-layer BiLSTM network for prediction (BiLSTM is an improvement
over the bidirectional RNN network). The BiLSTM network learns the feature vectors in
the sequence and outputs a predicted label distribution. Using a modified CTC loss, the
series of label distributions obtained in the RNN are converted into a final label order as
the prediction result of the visual model, and the prediction result is fed into the language
model bidirectional cloze network (BCN). After a multi-headed attention mechanism and a
feed-forward network, it is then subjected to linear variation to obtain the language model
prediction results; the final output is the text STATE in the picture. Figure 3 is a flow chart
of the improved CRNN network recognition process. The serial numbers 1 and 2 are
the improved parts of the CRNN network, and the improvement points and the overall
recognition process can be clarified by the color change. The main network structure
used is a three-layer structure consisting of convolutional layers, recurrent layers, and
transcription layers, using CNN+RNN+CTC. The convolutional layers consist of 7 layers
of convolutional neural networks, and the basic structure uses the VGG structure. First, the
input image is converted to a grayscale image, and then the grayscale image is resized to a
size of W*32, with a fixed height. In the third and fourth pooling layers, a kernel size of
1*2 (rather than 2*2) is used to pursue the true aspect ratio, and a batchnorm (BN) layer is
introduced to speed up convergence. The feature sequence obtained from the convolutional
layers is predicted by the recurrent layers using an RNN (the RNN network can be a type
of recurrent neural network such as LSTM or GRU) to predict the label distribution of the
feature sequence, which represents the probability distribution of the true label of each
time step in the feature sequence. The feature maps extracted by the CNN are split by
column, and each column of 512-dimensional features is input into two layers of 256-unit
bidirectional LSTMs for classification. The label distribution obtained from the recurrent
layers is converted into the final recognition result by the transcription layer using CTC. The
CTC algorithm performs deduplication and other operations to obtain the final recognition
result, and label smoothing is added to this process. The recognition result is used as prior
Information 2023, 14, 369 5 of 14
Information 2023, 14, 369 5 of 14

Information 2023, 14, 369 5 of 14

the transcription layer using CTC. The CTC algorithm performs deduplication and other
the transcription layer using CTC. The CTC algorithm performs deduplication and other
operations to obtain the final recognition result, and label smoothing is added to this pro-
operations to obtain the final recognition result, and label smoothing is added to this pro-
cess. The recognition
knowledge result
and sent toresult is used as
the language prior knowledge
model and sent to thethe
language model
cess. The recognition is used as priorfor character correction,
knowledge and sent toand final result
the language modelis
for character
output. correction, and the final result is output.
for character correction, and the final result is output.

CRNN(CNN+RNN+CTC)+label probablity
CRNN(CNN+RNN+CTC)+label probablity
smoothing
smoothing
Vision
Vision
prediction
Input image Vison Model prediction
Input image Vison Model

Feed Forward Multi-Head

Linear Feed Forward Multi-Head
Linear Attention
Attention
probablity
probablity Language
Language Bidirectional
predition Bidirectional
predition Language Model
Language Model
output STATE
output STATE

Figure 2. Improved overall network structure.

Figure 2.
Figure 2. Improved
Improved overall
overall network
network structure.
structure.

Feature CTC+label Predictive

Input image CNN+RNN Feature CTC+label Predictive
Input image CNN+RNN sequence smoothing disbritution
sequence smoothing disbritution

Predicted
output BCN Predicted
output BCN sequence
sequence
Figure 3. Overall
Overall networkflow
flow diagram.
Figure 3. Overall network
Figure 3. network flow diagram.
diagram.

As canbe
As be seenininFigure
Figure 1, in the text recognition task, the role of the CTC transcrip-
As cancan beseen seen in Figure1,1,ininthe thetext recognition
text recognition task, thethe
task, role of the
role CTC
of the CTCtranscription
transcrip-
tion layer
layer is to the
is to take takeoutput
the output
predicted predicted text sequence
text sequence from the fromRNNthe asRNN inputas and
input and trans-
transform it
tion layer is to take the output predicted text sequence from the RNN as input and trans-
forma itlabel
into intosequence.
a label sequence. Mathematically,
Mathematically, transcription
transcription is usedistousedfindtothe find
tagthe tag sequence
sequence with
form it into a label sequence. Mathematically, transcription is used to find the tag sequence
the
with biggest probability
the biggest based on
probability basedthe on
prediction [2]. The[2].
the prediction probabilities of the label
The probabilities of the sequences
label se-
with the biggest probability based on the prediction [2]. The probabilities of the label se-
use the conditional
quences probabilities
use the conditional from CTCfrom
probabilities [3]. CTC [3].
quences use the conditional probabilities from CTC [3].
In
In the
the CRNN
CRNN network
network model,
model, from from the
the input
input image
image to to the
the output
output text text recognition
recognition
In the CRNN network model, from the input image to the output text recognition
result,
result, itit can
can be be considered
considered that that thethe model
model applies
applies feature
feature information
information from from the the visual
visual
result, it can be considered that the model applies feature information from the visual
aspect
aspect to tothe
thetexttextsequence
sequencerecognition
recognition withoutwithout applying
applying feature
feature information
information from from other
other
aspect to the text sequence recognition without applying feature information from other
modalities. Considering that the overall recognition accuracy
modalities. Considering that the overall recognition accuracy of the CRNN network of the CRNN network model
modalities. Considering that the overall recognition accuracy of the CRNN network
ismodel
still low and low
is still that and
relying
thatonrelying
visual on information from a single
visual information frommodality
a single formodality
text recognition
for text
model is still low and that relying on visual information from a single modality for text
isrecognition
not informative enough, it is necessary to add other auxiliary
is not informative enough, it is necessary to add other auxiliary information information to improve
recognition is not informative enough, it is necessary to add other auxiliary information
the overall recognition
to improve accuracy based
the overall recognition on the
accuracy visual
based onmodel effect.
the visual At this
model stage,
effect. linguistic
At this stage,
to improve
features thebeen
overall recognition accuracy based ondomain.
the visual model effect. At this stage,
linguistichave features used
have to some
been usedeffect
to somein the
effecttext in the text Linguistic features
domain. Linguistic refer
featuresto
linguistic
the features of
consideration have
the been
context used to somecharacters
between effect in the to text domain.
infer the class Linguistic
of that features
character,
refer to the consideration of the context between characters to infer the class of that char-
refer tothan
rather the consideration of the context between
of theofcharacters toIninfer the classwe of choose
that char-
acter, ratherbased on theon
than based glyphic
the glyphic features
features character.
the character. this
In thispaper,
paper, we chooseto
acter,
add a rather
linguisticthan based
model, on
a the
network glyphic features
model to of
obtain the character.
information In this
relying paper,
on we
both choose
visual
to add a linguistic model, a network model to obtain information relying on both visual
to add
and a linguistic model, a network model tofeatures
obtain information relying on both visual
and linguistic features
linguistic features rather
rather thanon
than on visual
visual features alone, alone, and andmore more comprehensive
comprehensive in-
and linguistic
information features rather
acquisition, whichthan helpson tovisual
improve features alone, and
the accuracy
accuracy more comprehensive
of text in-
formation acquisition, which helps to improve the of text recognition.
recognition. The The data
data
formation
used acquisition, which helps to improve the accuracy of text recognition. The data
usedfor fortext
textrecognition
recognitionare arelabeled
labeleddata.data. Taking
Takingthe theEnglish
Englishalphabet
alphabetas asananexample:
example:ififitit
used
isisnot for text recognition are labeled data. Taking the English alphabet as an example: if it
notcase
casesensitive,
sensitive,the theEnglish
Englishalphabet
alphabet has has 26 26letters,
letters,andandthenthenconsidering
consideringthe theArabic
Arabic
is not case sensitive, the English alphabet has 26 letters, and
numerals 0–9, the whole data eventually correspond to 37 characters, so the text recognition then considering the Arabic
problem is essentially a multivariate classification problem, with the option of adding label
Information 2023, 14, 369 6 of 14

Information 2023, 14, 369 6 of 14

numerals 0–9, the whole data eventually correspond to 37 characters, so the text recogni-
tion problem is essentially a multivariate classification problem, with the option of adding
label smoothing.
smoothing. Label Label smoothing
smoothing is a regularization
is a regularization technique
technique that ensures
that ensures the generali-
the generalization
zation of
ability ability of the and
the model model and improves
improves the resistance
the resistance of theto
of the model model to interference.
interference.

3.1.
3.1. Label
Label Smoothing
Smoothing (LS) (LS)
Label
Label smoothing is
smoothing is aamethod
method of ofmodel
modelregularization
regularization that that can
can significantly
significantly improve
improve
the generalization ability and learning speed of multi-class
the generalization ability and learning speed of multi-class neural networks, neural networks, andandis typically
is typi-
used
cally to prevent
used modelmodel
to prevent overfitting [20]. Smoothing
overfitting [20]. Smoothing labels labels
in thisinway thisprevents networks
way prevents net-
from
works becoming overconfident
from becoming and has been
overconfident and hasusedbeenin many
usedstate-of-the-art models, including
in many state-of-the-art models,
image classification,
including language translation,
image classification, language and speech recognition.
translation, and speechIn addition to In
recognition. improving
addition
generalization, label smoothing improves model calibration
to improving generalization, label smoothing improves model calibration and can and can significantly improve
signif-
beam
icantly search [21].beam search [21].
improve
Considering
Consideringapplying
applyinglabellabelsmoothing
smoothingto totext
textrecognition,
recognition,since sincethe theCTC
CTCcomponent
component
of
of the CRNN model comes from the direction of speech recognition, labelsmoothing
the CRNN model comes from the direction of speech recognition, label smoothing isis aa
general
general method
method of of improving
improvinggeneralization
generalization ability
ability by
by adding
adding label
label noise,
noise, which
which has has the
the
effect of penalizing low-entropy output distributions (i.e., overconfident
effect of penalizing low-entropy output distributions (i.e., overconfident predictions). predictions). In the In
classification process,
the classification one-hot
process, encoding,
one-hot whichwhich
encoding, is commonly
is commonly used, used,
has poorhas generalization
poor generali-
ability and tends to believe too much in labels, assuming that the differences between each
zation ability and tends to believe too much in labels, assuming that the differences be-
category are large, which is actually difficult to achieve [22]. To address the issues with
tween each category are large, which is actually difficult to achieve [22]. To address the
one-hot encoding, label smoothing is proposed [21], and the calculation formula is shown
issues with one-hot encoding, label smoothing is proposed [21], and the calculation for-
in (5).
mula is shown in (5). E
Label Smoothing = onehot ∗ (1 − E ) + ℰ, (1)
Label Smoothing = onehot ∗ (1 − ℰ) + , c (1)
𝑐
In Equation (1), one-hot is the unique hot encoding variable of the tag, such as [0, 1],
Inetc.,
[1, 0, 0], Equation
where(1),ε is one-hot is the unique
a hyper-parameter lesshot
thanencoding variable
1 and greater than of 0the tag,
and c issuch as [0, 1],
the number
[1,categories.
of 0, 0], etc., The
where ε is avalue
default hyper-parameter
of the parameter less εthan
is 0.1.1 When
and greater
one-hot than 0 and
is [0, 1], thec is
codethe
number [0.05,
becomes of categories.
0.95] after The default
label value of
smoothing the parameter
according ε is 0.1. When
to the formula. one-hot
The process of is [0, 1],
adding
the code
label becomes
smoothing [0.05, 0.95]
is achieved byafter label
adding smoothing according
a regularization term to to thethe formula.
CTC objectiveThefunction,
process
which consists of KL scattering between the predictive distribution, P, of the networkobjec-
of adding label smoothing is achieved by adding a regularization term to the CTC and
tiveuniform
the function, which consists
distribution, U onoftheKLlabels
scattering
[23]. between the predictive distribution, P, of the
network and the uniform distribution, U on the labels [23].
T
L(θonline ) , (1 − α) LCTC + α ∑t=1𝑇 DKL ( Pt |U ), (2)
L(𝜃𝑜𝑛𝑙𝑖𝑛𝑒 ) ≜ (1 − 𝛼)𝐿𝐶𝑇𝐶 + 𝛼 ∑𝑡=1 𝐷𝐾𝐿 (𝑃𝑡 |𝑈), (2)
The
Theadjustable
adjustableparameter,
parameter,α,𝛼,ininEquation
Equation (2)(2)
is is
used
usedto to
balance
balance thethe
weight
weightregulariza-
regulari-
tion term
zation term and
and CTCCTC loss.
loss.From
FromEquation
Equation(2),(2),ititcan
canbe beintuitively
intuitivelyseen
seenthat
thatthe
thewhole
wholeloss
loss
function after adding label smoothing contains the CTC part and
function after adding label smoothing contains the CTC part and KL scatter part. Both KL scatter part. Both
parts are
parts are related
related to 𝛼, and
to α, and when
when α𝛼 takes
takes00then becomesL𝐿CTC
thenititbecomes , and when α takes 1 then it
𝐶𝑇𝐶 , and when 𝛼 takes 1 then it
becomes
becomes 𝐷KLD ( P | U ) . The CRNN model diagram after adding label smoothing is shown in
𝐾𝐿 (𝑃𝑡 |𝑈). The CRNN model diagram after adding label smoothing is shown in
t
Figure 4.
Figure 4.

CTC+label Vision
CNN RNN
smoothing
prediction

Input image CRNN+Label smooothing probablity

Figure4.4. Schematic
Figure Schematicdiagram
diagramof
ofthe
theCRNN+label
CRNN + labelsmoothing
smoothingmodel.
model.

3.2. Language Model

CTC is a classification algorithm that solves temporal data, where training involves
obtaining labels for each frame of data. Models using the CTC criterion as a loss function are
Information 2023, 14, 369 7 of 14

end-to-end model training and do not require pre-alignment of data; only input and output
sequences are required for training. The main drawbacks of the CTC model are that it still
has the assumption of conditional independence between data, and that the CTC model
only has the ability to model acoustically and lacks some language model capabilities [24].
Enhanced language modeling usually requires the acquisition of linguistic features, which
refers to inferring the class of a character by considering the context between characters,
not based on the glyphic features of the character, and is usually paired with visual features
extracted by visual models, The language model BCN is proposed in ABINet [25] because
the purpose of the language model in ABINet [25] is to iteratively correct and check letters
by fusing visual and language model features and iteratively checking n times. Considering
the actual running time and the limited semantic information extracted after the iterative
iterations, only the language model BCN is added to the CRNN network after the feature
fusion iterations without adding the fused iterative language model. From Figure 5 it can be
seen that the output of the CRNN is used as input to the language model to provide a priori
knowledge for the language model to carry out the acquisition of semantic information,
Information 2023, 14, 369 and the gradient is back-propagated as an auxiliary task for CRNN recognition so that 8 of
the14
output of the CRNN contains more semantic information.

probablity
Linear S T A T E

Add & Normalize 1 X

2 X
Feed Forward
3 S T X T E

Add & Normalize 4 X

5 X
Multi-Head Attention masks
Attention

V K & V
M Q K
...

Encoding Probablity Mapping

...
1 2 3 4 5 Vision prediction

Allow to attend X Prevent from attending

Figure5.5.Language
Figure Languagemodel
modelBCN.
BCN.

The prediction results of the visual model are used to provide a priori knowledge
for the language model, and text recognition is performed at both visual and contextual
semantic levels to obtain more comprehensive information about the text. Each layer of
the BCN is a series of multi-head attention probablity
feed forward networks, residual connections,
and layer normalization. The network takes as input a sequential encoding of character
Vision
positions as a non-character
Vison probability
Model vector, and the character probability vector is passed
prediction
Input image
directly into the multi-head attention module. In the multi-head attention mechanism, the
diagonal attention mask is designed to avoid seeing the current character and to achieve
output

STATE Language Model

Information 2023, 14, 369 8 of 14

probablity
Linear S T A T E

simultaneous access to information to the left and right of the character, combining the left
and right information to make a prediction simultaneously.
M

Add & Normalize 1 X

0, i 6= j
2Mij = X , (3)
∞, i=j
Feed Forward
3 S T X T E
Ki = VI = P(yi )Wl , (4)
Add & Normalize 4 X
!
5 QK T X
Multi-Head
Fmha = so f tmax √ + M V, (5)
Attention Cmasks
Attention
where, from Equations (3)–(5), given a text string y = (y1 , · · · · · · , yn ) with text length n and
V
category c, C is the feature size, Q ∈ R T ×C is theK &position
V encoding of the character order,
M Q K
the position encoding of the first layer of character order, and the others are the output
...
of the last layer, and T is the length of the character sequence. K, V ∈ R T ×C is obtained
from the character Rc , Wl ∈ R
probability P(yi ) ∈ Probablity c×C
Encoding Mappingis the linear mapping matrix, and
M∈R T × T is the attention mask matrix, blocking attention to the current position, after
stacking the BCN layers into a depth architecture, determining the bidirectional expression,
...
Fl , of the text, y.
1 2BCN
The 3 does
4 5not use self-attention Vision prediction
to avoid information leakage across steps, and
each time step of the BCN is computed independently and in parallel, which is efficient.
The schematic diagram of adding the language model to CRNN is shown in Figure 6.
Considering
Allow to attendFigures 5 and 6 together,X itPrevent
can be visualized that the prediction results from
from attending
the visual model are input to the language model after masking, multi-headed attention,
Figure 5. Language
feed-forward model and
network, BCN.normalization to obtain the final prediction result STATE.

probablity

Vision
Vison Model
prediction
Input image

output

STATE Language Model

Figure 6. Schematic diagram of CRNN+BCN.

4. Experiments
4. Experiments
4.1.4.1. Datasets
Datasets
InIn thefield
the fieldofoftext
textrecognition,
recognition, the
thedemand
demandforfordata
datavolume
volumeis very large.
is very Compared
large. Compared
with the training data volume of a few thousand or a few hundred
with the training data volume of a few thousand or a few hundred in the field in the field of text
of text
detection, the training data volume in the field of text recognition is in the millions. The
detection, the training data volume in the field of text recognition is in the millions. The
datasets used in this paper are all public datasets, which can be divided into synthetic
datasets used in this paper are all public datasets, which can be divided into synthetic
datasets and real datasets. Since the training of network models requires a large amount of
datasets
data asand real datasets.
support, most of the Since
text the trainingmodels
recognition of network models
are trained requires
using a large
synthetic amount
datasets;
of real
datadatasets
as support,
are used for the evaluation of the training results of text recognition models. da-
most of the text recognition models are trained using synthetic
tasets; real datasetsdataset
1 Synthetic are used for the evaluation of the training results of text recognition
for training
models.
Information 2023, 14, 369 9 of 14

MJSynth [26] is a synthetic plain English text dataset. The dataset contains 3000 folders
with about 9 million images, rendering the text onto natural images and then performing
a random transformation. The words in each image in the dataset are labeled and the
character orientation is mainly horizontal. SynthText [27] is also a synthetic text dataset, but
the difference is that SynthText is designed mainly for text detection, so the text is rendered
onto the complete natural image, consisting of 800,000 scene images. To accommodate text
recognition, the words are cropped according to the word annotation bounding boxes in
the experiments, and a total of about 7 million text images are cropped.
2 Real datasets for evaluation
Depending on the format of the text in the image, the real dataset can be divided into
regular and irregular text. Regular datasets include IIIT5k-Words (IIIT5k), SVT (Street View
Text), and IC13 (ICDAR2013). Irregular datasets include IC15 (ICDAR2015), SVTP (Street
View Text Perspective), and CUTE80.
IIIT5k-Words (IIIT5k) [28] consists of 3000 test images collected from the Internet.
The text in the images is mostly regular text, and, for each image, two dictionaries of
different sizes, 1000 words and 50 words, are matched. Each dictionary consists of real
annotations and other commonly used words; the images in SVT [29] are mainly cropped
from 249 Google Street View views and contain 647 low-resolution and low-noise text
images, most of which are horizontal text images. Each image matches a 50-word dictionary;
the vast majority of text images in ICDAR 2013 (IC13) [30] are from IC03, with new images
added for data augmentation, for a total of 1015 text images, most of which are regular text
images, with some text images blurred due to uneven illumination.
ICDAR 2015 [31] from Challenge 4 of the ICDAR 2015 Robust Reading Competition,
called incidental scene text, consists mainly of plain English text of multi-directional scenes.
This dataset consists of some randomly taken street view images with low resolution, and
most of the text in the figures is relatively small and blurred, so it is relatively difficult to
detect. ICDAR 2015 divides the training and test sets into 1000 and 500 images, each of
which contains multi-directional text, which is labeled in word units using a rectangular
box of 4 points; SVTP [32] was selected from the side view of Google Street View. The
dataset consists of 645 cropped text images, most of which have distortion factors such as
low resolution, noise, and blur. Each image provides a dictionary of 50 words. CUTE80
(CUTE) [33] contains 288 high-resolution text images cropped from the original dataset.
The text in this dataset is mainly curved and directional text, and no associated dictionaries
are provided.
In this paper, we are limited by the experimental equipment and the actual running
time, so we use the synthetic dataset MJ in the training phase, and then we can add the
dataset ST for training if the experimental conditions allow.

4.2. Implementation Details

In this paper, we use public datasets in the field of text recognition for training and
testing. The experimental platform is a Linux system, GPU: RTX 3090 (24 GB)*3, CPU
45Vcpu AMD EPYC 7543 32-Core Processor, memory 240 GB. The deep learning tool
is Pytorch.
To verify the effectiveness of text recognition algorithms, evaluation metrics usually
use a character recognition rate or word recognition rate, both of which compare the
predicted value with the real value. The character recognition rate is calculated in character
units, counting the ratio of the number of correctly recognized characters to the total number
of actual characters. The word recognition rate is calculated in word units, counting the
ratio of the number of correctly recognized words to the total number of actual words in the
image. Word recognition rates are more demanding than character recognition rates. In the
process of character recognition, a single character error can be tolerated; however, when
a word recognition rate is used, each character in a word must be recognized correctly.
The above two evaluation metrics are commonly used in English datasets. In this paper,
Information 2023, 14, 369 10 of 14

word recognition accuracy is used as an evaluation index for the merits of text recognition
algorithms, which is calculated as in Equation (6):

N
Accuracy = × 100%, (6)
M
In Equation (6), M represents the number of all recognized characters, and N represents
the number of samples with completely correct recognition.

4.2.1. Comparison of Experimental Results between the Actual Run Baseline Model and
the Original Baseline Model
In order to evaluate the text recognition performance of the improved CRNN network
model objectively, considering that the test data in the original CRNN paper did not include
all six datasets, the CRNN test data involved in the comparison are from the replication
of Benchmark [16], and the CRNN-base represents the results obtained from the actual
experiments on the Linux experimental platform. From Table 1, we can see that the CRNN-
base has 0.05%, 3.83%, 3.01%, 6.53%, 5.07%, and 0.75% growth over CRNN in IC13, SVT,
IIIT5K, IC15, SVTP, and CUTE, respectively, and the experimental data are real and reliable.

Table 1. Comparison table of CRNN baseline experiments (accuracy of English dataset (%)).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

CRNN 91.10 81.60 82.90 69.40 70.00 65.50
CRNN-base 91.15 85.43 85.91 75.93 75.07 66.25

4.2.2. Comparison of Adding Label Smoothing with the Baseline Model Approach
The experimental comparison table after adding label smoothing is shown in Table 2.
Considering that label smoothing is mostly applied in the classification task and the default
value of α is 0.1, the optimal value of α can be taken as 0.01 when adding label smoothing
in other tasks. Although text recognition can also be considered a multi-classification
problem, the applicability of label smoothing in the field of text recognition for the value of
α needs to be considered. The optimal result of the label smoothing parameter, α, needs to
be determined, and this paper uses a GridSearch method to verify it around 0.01 versus
0.1. Because the GridSearch is a means of tuning the parameters by the exhaustive search
method, in all the candidates of the parameter selection, by circular traversal, trying every
possibility, the best performing parameter is the final result of the selection. The specific
experimental results are shown in Table 2.

Table 2. Comparison table of α values (accuracy of English dataset (%), in the Table 2 “+” indicates
an increase over CRNN-base and “−” indicates a decrease under CRNN-base).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

CRNN-base 91.15 85.43 85.91 75.93 75.07 66.25
LS (α = 0.005) 94.79 (+3.64) 90.44 (+5.01) 89.61 (+3.70) 82.15 (+6.22) 80.70 (+5.63) 73.40 (+7.15)
LS (α = 0.01) 93.42 (+2.27) 89.30 (+3.87) 89.27 (+3.36) 82.28 (+6.35) 80.04 (+4.97) 72.40 (+6.15)
LS (α = 0.05) 94.12 (+2.97) 90.12 (+4.69) 90.56 (+4.65) 81.70 (+5.77) 79.22 (+4.15) 73.34 (+7.09)
LS (α = 0.1) 89.71 (−1.44) 83.30 (−2.13) 83.91 (−2.00) 71.98 (−3.95) 69.61 (−5.46) 60.48 (−5.77)

From the experimental results of GridSearch, it can be seen that, when the default
value of 0.1 is used for label smoothing, there is a decrease in the accuracy rate based on
the CRNN-base, indicating that the parameter default value is not ideal in the field of text
recognition; therefore, the algorithms and modules in other fields, when applied to new
fields, require parameter adjustment to find the best parameters. From the experimental
results, it can be seen that, except for an α default value of 0.1, other values have different
degrees of growth. Especially, the best effect is achieved when α = 0.005, where the growth
points on the irregular dataset are more than those on the regular dataset, which indicates
Information 2023, 14, 369 11 of 14

that the applicability is stronger in irregular text, and in IC13, SVT, IIIT5K, IC15, SVTP,
and CUTE, the CRNN-base models grow 3.64%, 5.01%, 3.70%, 6.22%, 5.63% and 7.15%,
respectively, which is significant. As a result, the field of label smoothing applications
has been extended from classification, machine translation, image segmentation, and
speech recognition to text recognition. By implementing regularization, the model is
prevented from predicting labels too confidently during training to improve generalization
ability, and can also be tuned with parameters to achieve significant improvements in
recognition accuracy.

4.2.3. Comparison of Adding Language Model with the Baseline Model Approach
The experimental comparison table after adding the BCN language model is shown
in Table 3. From the experimental results, it can be seen that adding the language model
to extract semantic information and adding label smoothing can achieve the effect of
improving the accuracy rate, and there are different degrees of improvement in both regular
and irregular texts, with more growth points in the irregular datasets than in the regular
datasets. In IC13, SVT, IIIT5K, IC15, SVTP, and CUTE, the increases over the base model
are 2.35%, 3.24%, 2.62%, 4.71%, 4.39%, and 5.21%, respectively. From the experimental
results, it can be seen that there is a problem of incomplete information acquisition when
considering the text information from the visual model alone for recognition. The addition
of the language model not only considers the visual aspect of a single character’s glyphic
features but also infers the category of the character from the context between characters,
broadening the access to information and improving the overall effect.

Table 3. Comparison table after adding the language model (accuracy of English dataset (%)).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

CRNN-base 91.15 85.43 85.91 75.93 75.07 66.25
CRNN+BCN 93.50 (+2.35) 88.67 (+3.24) 88.53 (+2.62) 80.64 (+4.71) 79.46 (+4.39) 71.46 (+5.21)

4.2.4. Comparison of Ablation Experiments

To understand the effect of label smoothing and the language model better, the
experimental effects of adding label smoothing and the language model are compared
with adding label smoothing alone, adding the language model alone, and the baseline
model, respectively.
As seen in the comparison table of ablation experiments in Table 4, compared with
the CRNN-base, the addition of label smoothing (α = 0.005) and the language model BCN
resulted in increases of 4.00%, 5.27%, 5.22%, 7.77%, 7.37%, and 9.03% in IC13, SVT, IIIT5K,
IC15, SVTP and CUTE, respectively. Compared with smoothing with the label addition
alone (α = 0.005), the addition of label smoothing (α = 0.005) and the language model
BCN resulted in increases of 0.36%, 0.26%, 1.52%, 1.55%, 1.74%, and 1.88% in IC13, SVT,
IIIT5K, IC15, SVTP, and CUTE, respectively. Compared with the language model BCN
alone, the addition of label smoothing (α = 0.005) and the language model BCN resulted in
increases of 1.63%, 2.03%, 2.60%, 3.06%, 2.98%, and 3.82% in IC13, SVT, IIIT5K, IC15, SVTP,
and CUTE, respectively. Comparing the growth of label smoothing alone with the growth
of the language model alone, it can be seen intuitively that the growth points of adding
language model accuracy are much higher than the growth points of label smoothing, and
the result of fusing label smoothing and the language model is higher than the simple
sum of the growth points of label smoothing and the language model, which proves the
effectiveness of verifying the effectiveness of label smoothing and the language model
on text recognition, with the highest growth point number reaching 9.03%. The overall
experimental effect is improved significantly.
Information 2023, 14, 369 12 of 14

Table 4. Comparison table of ablation experiments (accuracy of English dataset (%)).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

CRNN-base 91.15 85.43 85.91 75.93 75.07 66.25
LS (α = 0.005) 94.79 90.44 89.61 82.15 80.70 73.40
CRNN+BCN 93.50 88.67 88.53 80.64 79.46 71.46
CRNN+BCN+LS (α = 0.005) 95.15 90.70 91.13 83.70 82.44 75.28

4.2.5. Comparison of the Improved CRNN Model with Other Methods

The results in the comparison table in Table 5 show that the improved CRNN net-
work, on the first five evaluation datasets, significantly outperforms the models RARE [8],
R2AM [9], STAR-Net [10], GRCNN [11], Rosetta [12], Benchmark [13], SEED [14], and
ViTSTR [15] in recognition performance. The models TRBA [16], Cascade Attention [17],
Text is Text [18], and Character-Context Decoupling [19] outperform the improved CRNN
network effect on some datasets, this is indicated by highlighting the results in bold in
Table 5; although there is a gap, the gap is not obvious, and the main gap is in the sixth
dataset, CUTE. Analyzing the reasons for the gap: compared with other low-resolution,
multi-directional irregular text datasets, the CUTE dataset has mainly curved text and
directional text, and no associated lexicon is provided, while other datasets, although
there are directional and curved characteristics, provide an associated lexicon. Therefore,
the gap in recognition effect in different datasets in this paper is related to the lexicon,
and the comparison with other methods in the table further demonstrates the superiority
of the method in this paper. In the process of experimental comparison, the degree of
improvement of the effect of the irregular dataset is greater than that of the regular text, but,
from the overall recognition accuracy, it can be seen that the recognition of regular text has
reached more than 90%, while the recognition accuracy of irregular text has not yet reached
90%. So the network in the recognition of irregular text should be further improved.

Table 5. Comparison table of methods (accuracy of English dataset (%)).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

CRNN-base 91.15 85.43 85.91 75.93 75.07 66.25
RARE [8] 92.60 85.80 86.20 74.50 76.20 70.40
R2AM [9] 90.20 82.40 83.40 68.90 72.10 64.90
STAR-Net [10] 92.80 86.90 87.00 76.10 77.50 71.70
GRCNN [11] 90.90 83.70 84.20 71.40 73.60 68.10
Rosetta [12] 90.90 84.70 84.30 71.20 73.80 69.20
Benchmark [13] 93.60 87.50 87.90 77.60 79.20 74.00
SEED [14] 92.80 89.60 93.80 80.00 81.40 83.60
ViTSTR [15] 92.40 87.70 88.40 78.50 81.80 81.30
TRBA [16] 93.10 88.90 92.10 - 79.50 78.20
Cascade Attention [17] 96.80 89.50 90.30 - 78.50 78.90
Text is Text [18] 93.30 89.90 92.30 76.90 84.40 86.30
Character Context Decoupling [19] 92.21 85.93 91.90 - - 83.68
CRNN+BCN+LS (our) 95.15 90.70 91.13 83.70 82.44 75.28

5. Conclusions
In this paper, we propose an improved CRNN for scene text recognition. The improved
CRNN model is used for sequence recognition of text in images, trained on synthetic
datasets, and tested on six public datasets. Experimental results show that the improved
CRNN accuracy outperforms the original CRNN network and is compared with other
experimental methods with good results. The main contribution of this paper is to improve
the original CRNN network and propose a new idea of text recognition based on a neural
network. The CRNN network model retains the overall architecture of the original network,
and the use of label smoothing in predicting output results can effectively improve the
generalization ability of the model, improve the anti-interference ability of the model,
Information 2023, 14, 369 13 of 14

prevent the generation of over-fitting, and thus improve the recognition accuracy. The
smoothing loss function in speech recognition is introduced into the field of text recognition,
and the CTC loss function is redefined. The smoothing loss function in speech recognition
is introduced into the field of text recognition, and the CTC loss function is redefined. A
language model is added to fuse sequence information with language information for
text recognition, increasing the information acquisition channels and ultimately achieving
the goal of improving the accuracy of text recognition. With the addition of the language
model, the overall recognition accuracy is improved, but the number of parameters also
increases and the network run time becomes longer. Therefore, network optimization for
the improved network model is needed in future work to reduce the number of parameters
and running time. In addition, the public dataset is mainly in English, and CRNN has
achieved certain results in both Chinese and English recognition, but real-life text images
can be mixed with multiple languages, and further research is needed for the recognition
of mixed text.

Author Contributions: Conceptualization, W.Y.; methodology, W.Y.; software, W.Y.; validation, W.Y.
and M.I.; formal analysis, W.Y.; investigation, W.Y., A.H. and M.I.; resource, A.H. and M.I.; data
curation, W.Y.; writing—original draft preparation, W.Y.; writing—review and editing, M.I. All
authors have read and agreed to the published version of the manuscript.
Funding: This work has been supported by the Natural Science Foundation of China (62166043,
U2003207).
Data Availability Statement: Publicly available datasets were analyzed in this study. Our datasets.
can be obtained from [https://github.com/clovaai/deep-text-recognition-benchmark] (11 December
2020), and [https://github.com/FangShancheng/ABINet] (10 June 2021).
Conflicts of Interest: The authors declare no conflict of interest.

References
1. Liu, C.; Chen, X.; Luo, C.; Jin, L.; Xue, Y.; Liu, Y. A deep learning approach for natural scene text detection and recognition. Chin.
J. Graph. 2021, 26, 1330–1367. [CrossRef]
2. Shi, B.; Bai, X.; Yao, C. An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to
Scene Text Recognition. arXiv 2015, arXiv:1507.05717. [CrossRef] [PubMed]
3. Graves, A.; Fernández, S.; Gomez, F.; Schmidhuber, J. Connectionist temporal classification: Labelling unsegmented sequence
data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA,
USA, 25–29 June 2006; pp. 369–376.
4. Liu, Y.; Wang, Y.; Shi, H. A Convolutional Recurrent Neural-Network-Based Machine Learning for Scene Text Recognition
Application. Symmetry 2023, 15, 849. [CrossRef]
5. Lei, Z.; Zhao, S.; Song, H.; Shen, J. Scene text recognition using residual convolutional recurrent neural network. Mach. Vis. Appl.
2018, 29, 861–871. [CrossRef]
6. Bengio, Y.; Simard, P.; Frasconi, P. Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw.
1994, 5, 157–166. [CrossRef] [PubMed]
7. Graves, A.; Mohamed, A.; Hinton, G. Speech recognition with deep recurrent neural networks. In Proceedings of the 2013 IEEE
International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 6645–6649.
8. Shi, B.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Robust scene text recognition with automatic rectification. In Proceedings of the 2016
IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4168–4176.
9. Lee, C.-Y.; Osindero, S. Recursive recurrent nets with attention modeling for OCR in the wild. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 2231–2239.
10. Liu, W.; Chen, C.; Wong, K.-Y.K.; Su, Z.; Han, J. Star-net: A spatial attention residue network for scene text recognition. BMVC
2016, 2, 7.
11. Wang, J.; Hu, X. Gated recurrent convolution neural network for OCR. Adv. Neural Inf. Process. Syst. 2017, 30, 334–343.
12. Borisyuk, F.; Gordo, A.; Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of
the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp.
71–79.
13. Baek, J.; Kim, G.; Lee, J.; Park, S.; Han, D.; Yun, S.; Oh, S.J.; Lee, H. What is wrong with scene text recognition model comparisons?
Dataset and model analysis. In Proceedings of the 2019 IEEE/CVF international Conference on Computer Vision, Seoul, Republic
of Korea, 27 October–2 November 2019; pp. 4715–4723.
Information 2023, 14, 369 14 of 14

14. Qiao, Z.; Zhou, Y.; Yang, D.; Zhou, Y.; Wang, W. Seed: Semantics enhanced encoder-decoder framework for scene text recognition.
In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June
2020; pp. 13528–13537.
15. Atienza, R. Vision transformer for fast and efficient scene text recognition. In Proceedings of the Document Analysis and
Recognition—ICDAR 2021: 16th International Conference, Lausanne, Switzerland, 5–10 September 2021; Proceedings, Part I 16.
Springer International Publishing: Cham, Switzerland, 2021; pp. 319–334.
16. Baek, J.; Matsui, Y.; Aizawa, K. What if we only use real datasets for scene text recognition? Toward scene text recognition with
fewer labels. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN,
USA, 20–25 June 2021; pp. 3113–3122.
17. Zhang, M.; Ma, M.; Wang, P. Scene text recognition with cascade attention network. In Proceedings of the 2021 International
Conference on Multimedia Retrieval, New York, NY, USA, 21–24 August 2021; pp. 385–393.
18. Bhunia, A.K.; Sain, A.; Chowdhury, P.N.; Song, Y.-Z. Text is text, no matter what: Unifying text recognition using knowledge
distillation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada,
10–17 October 2021; pp. 983–992.
19. Liu, C.; Yang, C.; Yin, X.C. Open-Set Text Recognition via Character-Context Decoupling. In Proceedings of the 2022 IEEE/CVF
Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4523–4532.
20. Liu, M.; Zhou, L. A cervical cell classification method based on migration learning and label smoothing strategy. Mod. Comput.
2022, 28, 1–9+32.
21. Müller, R.; Kornblith, S.; Hinton, G.E. When does label smoothing help? Adv. Neural Inf. Process. Syst. 2019, 32, 422.
22. Zhao, L. Research on User Behavior Recognition Based on CNN and LSTM. Master’s Thesis, Nanjing University of Information
Engineering, Nanjing, China, 2021. [CrossRef]
23. Kim, S.; Seltzer, M.L.; Li, J.; Zhao, R. Improved training for online end-to-end speech recognition systems. arXiv 2017,
arXiv:1711.02212.
24. Qin, C. Research on End-to-End Speech Recognition Technology. Ph.D. Thesis, Strategic Support Force Information Engineering
University, Zhengzhou, China, 2020. [CrossRef]
25. Fang, S.; Xie, H.; Wang, Y.; Mao, Z.; Zhang, Y. Read like humans: Autonomous, bidirectional and iterative language modeling for
scene text recognition. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville,
TN, USA, 20–25 June 2021; pp. 7098–7107.
26. Cheng, L.; Yan, J.; Chen, M.; Lu, Y.; Li, Y.; Hu, L. A multi-scale deformable convolution network model for text recognition. In
Proceedings of the Thirteenth International Conference on Graphics and Image Processing (ICGIP 2021); SPIE: Paris, France, 2022;
Volume 12083, pp. 627–635.
27. Gupta, A.; Vedaldi, A.; Zisserman, A. Synthetic data for text localisation in natural images. In Proceedings of the 2016 IEEE
Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2315–2324.
28. Mishra, A.; Alahari, K.; Jawahar, C.V. Top-down and bottom-up cues for scene text recognition. In Proceedings of the 2012 IEEE
Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 2687–2694.
29. Wang, K.; Babenko, B.; Belongie, S. End-to-end scene text recognition. In Proceedings of the 2011 International Conference on
Computer Vision, Barcelona, Spain, 6–13 November 2011; pp. 1457–1464.
30. Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; Bigorda, L.G.; Mestre, S.R.; Mas, J.; Mota, D.F.; Almazan, J.A.; de las Heras, L.P.
ICDAR 2013 robust reading competition. In Proceedings of the 2013 12th International Conference on Document Analysis and
Recognition, Washington, DC, USA, 25–28 August 2013; pp. 1484–1493.
31. Phan, T.Q.; Shivakumara, P.; Tian, S.; Tan, C.L. Recognizing text with perspective distortion in natural scenes. In Proceedings of
the 2013 IEEE International Conference on Computer Vision, Sydney, NSW, Australia, 1–8 December 2013; pp. 569–576.
32. Risnumawan, A.; Shivakumara, P.; Chan, C.S.; Tan, C.L. A robust arbitrary text detection system for natural scene images. Expert
Syst. Appl. 2014, 41, 8027–8048. [CrossRef]
33. Cheng, Z.; Xu, Y.; Bai, F.; Niu, Y.; Pu, S.; Zhou, S. Aon: Towards arbitrarily-oriented text recognition. In Proceedings of the 2018
IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5571–5579.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.

2022 Text Recognition in The Wild
No ratings yet
2022 Text Recognition in The Wild
35 pages
Symmetry 12 01956 v2
No ratings yet
Symmetry 12 01956 v2
27 pages
Scene Text Detection With Fully Convolutional Neural Networks
No ratings yet
Scene Text Detection With Fully Convolutional Neural Networks
23 pages
FedOCR Arxiv2007.11462
No ratings yet
FedOCR Arxiv2007.11462
16 pages
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
No ratings yet
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
22 pages
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
No ratings yet
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
12 pages
Report
No ratings yet
Report
39 pages
Unconstrained Text Recognition With Convolutional Neural Networks
No ratings yet
Unconstrained Text Recognition With Convolutional Neural Networks
13 pages
Deep Learning Approaches To Scene Text Detection A
No ratings yet
Deep Learning Approaches To Scene Text Detection A
61 pages
Textfield: Learning A Deep Direction Field For Irregular Scene Text Detection
No ratings yet
Textfield: Learning A Deep Direction Field For Irregular Scene Text Detection
14 pages
Text Detection OCR Reseacrh Paper
No ratings yet
Text Detection OCR Reseacrh Paper
26 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Rec Pa Mi
No ratings yet
Rec Pa Mi
14 pages
Techniques of Text Detection and Extraction
No ratings yet
Techniques of Text Detection and Extraction
18 pages
Robust Scene Text Recognition With Automatic Rectification
No ratings yet
Robust Scene Text Recognition With Automatic Rectification
9 pages
A Convolutional Recurrent Neural Network For The Handwritten Text Recognition of Historical Greek Manuscripts
No ratings yet
A Convolutional Recurrent Neural Network For The Handwritten Text Recognition of Historical Greek Manuscripts
14 pages
Cursive Text Recognition in Natural Scene Images Using Deep Convolutional Recurrent Neural Network
No ratings yet
Cursive Text Recognition in Natural Scene Images Using Deep Convolutional Recurrent Neural Network
17 pages
Applied Sciences: Scene Text Detection Using Attention With Depthwise Separable Convolutions
No ratings yet
Applied Sciences: Scene Text Detection Using Attention With Depthwise Separable Convolutions
18 pages
Real-Time Scene Text Detection Based On Global Level and Word Level Features
No ratings yet
Real-Time Scene Text Detection Based On Global Level and Word Level Features
12 pages
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
No ratings yet
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
11 pages
A Transformer-Based Framework For Scene Text Recognition
No ratings yet
A Transformer-Based Framework For Scene Text Recognition
16 pages
PGNet AAAI-2885.WangP
No ratings yet
PGNet AAAI-2885.WangP
9 pages
8 e 58227702 CD 9 Aaf
No ratings yet
8 e 58227702 CD 9 Aaf
15 pages
Plagiarism Checker X Originality Report: Similarity Found: 26%
No ratings yet
Plagiarism Checker X Originality Report: Similarity Found: 26%
29 pages
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
No ratings yet
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
6 pages
Enhancing Text Spotting With A Language Model and Visual Context Information
No ratings yet
Enhancing Text Spotting With A Language Model and Visual Context Information
10 pages
Scene Text Detection and Recognition USING DL PDF
No ratings yet
Scene Text Detection and Recognition USING DL PDF
20 pages
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
No ratings yet
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
6 pages
CRNN Model For Text Detection and Classification From Natural Scenes
No ratings yet
CRNN Model For Text Detection and Classification From Natural Scenes
11 pages
Deep Scene Text Detection With Connected Component Proposals
No ratings yet
Deep Scene Text Detection With Connected Component Proposals
10 pages
Char RCG TH
No ratings yet
Char RCG TH
11 pages
Wang Shape Robust Text Detection With Progressive Scale Expansion Network CVPR 2019 Paper
No ratings yet
Wang Shape Robust Text Detection With Progressive Scale Expansion Network CVPR 2019 Paper
10 pages
Jaderberg 16
No ratings yet
Jaderberg 16
20 pages
How To Write Cold Emails That Get Clients (Masterclass)
No ratings yet
How To Write Cold Emails That Get Clients (Masterclass)
10 pages
Reading Scene Text in Deep Convolutional Sequences: Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang
No ratings yet
Reading Scene Text in Deep Convolutional Sequences: Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang
8 pages
Haramaya University Computer Science Student
No ratings yet
Haramaya University Computer Science Student
15 pages
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
No ratings yet
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
11 pages
Icdar 2019 B
No ratings yet
Icdar 2019 B
6 pages
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
No ratings yet
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
10 pages
Dtrocr: Decoder-Only Transformer For Optical Character Recognition
No ratings yet
Dtrocr: Decoder-Only Transformer For Optical Character Recognition
11 pages
Detection of Text From Lecture Video Images
No ratings yet
Detection of Text From Lecture Video Images
5 pages
Text Detection and Character Recognition in Scene Images With Unsupervised Feature Learning
No ratings yet
Text Detection and Character Recognition in Scene Images With Unsupervised Feature Learning
6 pages
Artificial Intelligence and Machine Learning Approaches To Text Recognition: A Research Overview
No ratings yet
Artificial Intelligence and Machine Learning Approaches To Text Recognition: A Research Overview
5 pages
Yerrijdnewpaper
No ratings yet
Yerrijdnewpaper
5 pages
5G-Advanced: January 2022 Antti Toskala, Matthew Baker, Sari Nielsen, Yannick Lair
100% (1)
5G-Advanced: January 2022 Antti Toskala, Matthew Baker, Sari Nielsen, Yannick Lair
37 pages
TextBoxes A Fast Text Detector With DL
No ratings yet
TextBoxes A Fast Text Detector With DL
7 pages
Improving Irregular Text Recognition With Adaptive Feature Compression
No ratings yet
Improving Irregular Text Recognition With Adaptive Feature Compression
5 pages
Gupta Synthetic Data For CVPR 2016 Paper
No ratings yet
Gupta Synthetic Data For CVPR 2016 Paper
10 pages
Convolutional Character Networks
No ratings yet
Convolutional Character Networks
11 pages
Show, Attend and Read: A Simple and Strong Baseline For Irregular Text Recognition
No ratings yet
Show, Attend and Read: A Simple and Strong Baseline For Irregular Text Recognition
9 pages
Long2021 Article SceneTextDetectionAndRecogniti
No ratings yet
Long2021 Article SceneTextDetectionAndRecogniti
24 pages
SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
No ratings yet
SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
8 pages
Performance Evaluation of Efficient and Accurate Text Detection
No ratings yet
Performance Evaluation of Efficient and Accurate Text Detection
10 pages
Research PaPer EAST
No ratings yet
Research PaPer EAST
10 pages
Ijarcce 208
No ratings yet
Ijarcce 208
3 pages
Kami Export - 1904.01941
No ratings yet
Kami Export - 1904.01941
5 pages
2017 - Yingying - R2CNN-Rotational-Region-CNN-for-Orientation-Robust-Scene-Text-Detection
No ratings yet
2017 - Yingying - R2CNN-Rotational-Region-CNN-for-Orientation-Robust-Scene-Text-Detection
8 pages
Devops (Unit IV &V)
No ratings yet
Devops (Unit IV &V)
76 pages
CNN-RNN Based Handwritten Text Recognition: G.R. Hemanth, M. Jayasree, S. Keerthi Venii, P. Akshaya, and R. Saranya
No ratings yet
CNN-RNN Based Handwritten Text Recognition: G.R. Hemanth, M. Jayasree, S. Keerthi Venii, P. Akshaya, and R. Saranya
7 pages
Adobe Scan 06-Nov-2024
No ratings yet
Adobe Scan 06-Nov-2024
1 page
User Guide For Mirth® Connect by Nextgen Healthcare, 4 5 2024-09-21-20-11-07
No ratings yet
User Guide For Mirth® Connect by Nextgen Healthcare, 4 5 2024-09-21-20-11-07
619 pages
Latest Base Paper
No ratings yet
Latest Base Paper
4 pages
NMAP Testing: Iptables Flushed in The Target (Default)
No ratings yet
NMAP Testing: Iptables Flushed in The Target (Default)
5 pages
CET 324 Advance Cybersecurity Part2
0% (1)
CET 324 Advance Cybersecurity Part2
32 pages
Lesson 6 Revolved Features: Objectives
No ratings yet
Lesson 6 Revolved Features: Objectives
54 pages
Seminar Report New
No ratings yet
Seminar Report New
27 pages
Fs Mini Project Report
No ratings yet
Fs Mini Project Report
25 pages
Unit - 2: Onventional Ncryption Principles
No ratings yet
Unit - 2: Onventional Ncryption Principles
35 pages
Gauss Contest Paper 2021
No ratings yet
Gauss Contest Paper 2021
4 pages
Lecture 2 Introduction To Client Server Computing
No ratings yet
Lecture 2 Introduction To Client Server Computing
19 pages
Pseudo Dionysius of Areopagite - The Celestial & Ecclesiastical Hierarchy Transl John Parker (1894)
100% (2)
Pseudo Dionysius of Areopagite - The Celestial & Ecclesiastical Hierarchy Transl John Parker (1894)
119 pages
User Manual: Rev X9
No ratings yet
User Manual: Rev X9
41 pages
Security Assessment Agreement
No ratings yet
Security Assessment Agreement
6 pages
PRAESENSA 2.10 Configuration Manual EnUS 100857072779
No ratings yet
PRAESENSA 2.10 Configuration Manual EnUS 100857072779
212 pages
Charter & WBS For OCR
No ratings yet
Charter & WBS For OCR
3 pages
Key Differences Between BioMérieux MALDI-ToF MS (V
No ratings yet
Key Differences Between BioMérieux MALDI-ToF MS (V
9 pages
Ai and Pentesting
No ratings yet
Ai and Pentesting
5 pages
Electrical Turrets: Instruction For Use and Maintenance
No ratings yet
Electrical Turrets: Instruction For Use and Maintenance
14 pages
x4PCIE Splitter4 Hwmanual-R1.4
No ratings yet
x4PCIE Splitter4 Hwmanual-R1.4
16 pages
STE Micorproject
No ratings yet
STE Micorproject
19 pages
Globalization: I. Attention Grabber
No ratings yet
Globalization: I. Attention Grabber
4 pages
Project Ep Iii
No ratings yet
Project Ep Iii
12 pages
Unit 3
No ratings yet
Unit 3
12 pages
Paperwork Aims To Be An Open
No ratings yet
Paperwork Aims To Be An Open
10 pages
EMP10
No ratings yet
EMP10
3 pages
MIS501 Assessment 3-Final (1) - 220726 - 061808-2
No ratings yet
MIS501 Assessment 3-Final (1) - 220726 - 061808-2
9 pages
RIT AR 20 B TECH III YEAR II SEMESTER MID I EXAMINATIONS TIME TABLE February 2025
No ratings yet
RIT AR 20 B TECH III YEAR II SEMESTER MID I EXAMINATIONS TIME TABLE February 2025
1 page
Veritas Cluster Cheat Sheet
No ratings yet
Veritas Cluster Cheat Sheet
7 pages
Optical Character Recognition Technologies and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Optical Character Recognition Technologies and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Scene Text Recognition Based On Improved CRNN

Uploaded by

Scene Text Recognition Based On Improved CRNN

Uploaded by

information

Information 2023, 14, 369. https://doi.org/10.3390/info14070369 https://www.mdpi.com/journal/information

Information 2023, 14, 369 5 of 14

Feed Forward Multi-Head

Figure 2. Improved overall network structure.

Feature CTC+label Predictive

Information 2023, 14, 369 6 of 14

Input image CRNN+Label smooothing probablity

3.2. Language Model

Add & Normalize 1 X

Add & Normalize 4 X

Encoding Probablity Mapping

Allow to attend X Prevent from attending

STATE Language Model

Information 2023, 14, 369 8 of 14

Add & Normalize 1 X

STATE Language Model

Figure 6. Schematic diagram of CRNN+BCN.

4.2. Implementation Details

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

4.2.4. Comparison of Ablation Experiments

Table 4. Comparison table of ablation experiments (accuracy of English dataset (%)).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

4.2.5. Comparison of the Improved CRNN Model with Other Methods

Table 5. Comparison table of methods (accuracy of English dataset (%)).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Scene Text Recognition Based On Improved CRNN

Uploaded by

Scene Text Recognition Based On Improved CRNN

Uploaded by

information

Information 2023, 14, 369. https://doi.org/10.3390/info14070369 https://www.mdpi.com/journal/information

Information 2023, 14, 369 5 of 14

Feed Forward Multi-Head

Figure 2. Improved overall network structure.

Feature CTC+label Predictive

Information 2023, 14, 369 6 of 14

Input image CRNN+Label smooothing probablity

3.2. Language Model

Add & Normalize 1 X

Add & Normalize 4 X

Encoding Probablity Mapping

Allow to attend X Prevent from attending

STATE Language Model

Information 2023, 14, 369 8 of 14

Add & Normalize 1 X 

STATE Language Model

Figure 6. Schematic diagram of CRNN+BCN.

4.2. Implementation Details

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

4.2.4. Comparison of Ablation Experiments

Table 4. Comparison table of ablation experiments (accuracy of English dataset (%)).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

4.2.5. Comparison of the Improved CRNN Model with Other Methods

Table 5. Comparison table of methods (accuracy of English dataset (%)).

Methods IC13 SVT IIIT5K IC15 SVTP CUTE

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Add & Normalize 1 X