0% found this document useful (0 votes)

8 views6 pages

Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov

This paper presents a novel approach for lexicon-free scene text recognition using an LSTM-based soft visual attention model that processes convolutional features from images. The model selectively focuses on different parts of the image to recognize characters without requiring predefined dictionaries, achieving state-of-the-art performance on various datasets. The proposed framework integrates an attention mechanism to enhance recognition accuracy while maintaining lower computational complexity compared to existing methods.

Uploaded by

bob wu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views6 pages

Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov

Uploaded by

bob wu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

2017 14th IAPR International Conference on Document Analysis and Recognition

Visual attention models for scene text recognition

Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov

Computer Vision Center, Dept. Ciències de la Computació Media Integration and Communication Center (MICC)
Universitat Autònoma de Barcelona Universita di Firenze, Firenze
08193 Bellaterra (Barcelona), Spain Email: bagdanov@cvc.uab.es.
Email: sghosh,ernest@cvc.uab.es

Abstract—In this paper we propose an approach to lexicon- Existing word recognition methods can be broadly divided
free recognition of text in scene images. Our approach relies into dictionay-based methods, using some kind of predefined
on a LSTM-based soft visual attention model learned from lexicon to guide the recognition, and unconstrained methods,
convolutional features. A set of feature vectors are derived able to recognize any word.
from an intermediate convolutional layer corresponding to Dictionary-based scene text recognition. Traditionally,
different areas of the image. This permits encoding of spatial scene text recognition systems use character recognizers
information into the image representation. In this way, the in a sequential way by localizing characters using a slid-
framework is able to learn how to selectively focus on different ing window [9], [12], [18] and then grouping responses
parts of the image. At every time step the recognizer emits one by arranging the character windows from left to right as
character using a weighted combination of the convolutional words. A variety of techniques have been used to classify
feature vectors according to the learned attention model. Train- character bounding boxes, including random ferns [18], in-
ing can be done end-to-end using only word level annotations. teger programming [14] and Convolutional Neural Networks
In addition, we show that modifying the beam search algorithm (CNNs) [9]. These methods often use the lexical constraints
by integrating an explicit language model leads to significantly imposed by a fixed lexicon while grouping the character
better recognition results. We validate the performance of hypotheses into words.
our approach on standard SVT, ICDAR’03 and MS-COCO In contrast to sequential character recognizer models,
scene text datasets, showing state-of-the-art performance in holistic fixed-length representations have been proposed
unconstrained text recognition. in [1], [4], [7], [8], [9], [13]. In [1], [4], [13], a holistic
signature derived from a set of training images is used to
1. Introduction learn a joint embedding space between images and words.
The first attempt using CNN features was made by Jaderberg
et al.in [9], where a sliding window over CNN features
The increasing ability to capture images in any condition
is used for robust scene text recognition using a fixed
and situation poses many challenges and opportunities for
lexicon. Later, the same authors also proposed a fixed-length
extracting visual information from images. One such chal-
representation [7] using convolutional features trained on a
lenge is the detection and recognition of text “in the wild”.
synthetic dataset of 9 million images [8]
Text in natural images is a high level semantic information
that can aid automatic image understanding and retrieval. Unconstrained scene text recognition. Though most of
However, robust reading of text in uncontrolled environ- the works in scene text recognition focus on fixed-lexicon
ments is very different from text recognition in document recognition, a few attempts at unconstrained text recognition
images and much more challenging due to multiple factors have also been made.
such as difficult acquisition conditions, low resolution, font Biassco et al.in [3] rely on sequential character clas-
variability, complex backgrounds, different lighting condi- sifiers. They use a massive number of annotated character
tions, blur, etc. Therefore, OCR techniques used in docu- bounding boxes to learn character classifiers. Binarization
ment images do not generalize well to recognition of scene and sliding window methods are used to generate character
text. proposals followed by a text/background classifier. Finally,
The problem of end-to-end scene text recognition is character probabilities given by character classifiers are used
usually divided in two different tasks: word detection and in a beam search to recognize words. They also integrate a
word recognition. The goal of the word detection stage static character n-gram language model in every step of the
is to generate bounding boxes around potential words in beam search to incorporate an underlying language model.
the images. Subsequently, the words in these bounding Though CNN models have achieved great success in
boxes are recognized in the word recognition stage. This lexicon-based text recognition, word recognition in un-
paper is focused on this second stage, word recognition. constrained scenarios requires modeling the underlying

2379-2140/17 $31.00 © 2017 IEEE 943

DOI 10.1109/ICDAR.2017.158
character-level language model. Jaderberg et al.in [6] pro-
posed to use two separate CNNs, one modeling character
unigram sequences and another n-gram language statistics.
They additionally use a Conditional Random Field to model
the interdependence of characters (n-grams). However, this

signiﬁcantly increases the computational complexity. In ad-

dition, to detect the presence of character n-grams in word
images as neural activations, character n-grams are used as
output nodes, leading to a huge (10k output units for n=4)
output layer.
In contrast to the above strategies our approach neither
recognizes individual characters in the word image nor uses
any holistic representation to recognize the word. It rather
uses a LSTM-based visual attention model on top of CNN
features (based on [19]) to focus attention on relevant parts
of the image at every step and infer a character present in
the image (see figure 1). Thus, the system does not require
explicit character segmentation and is able to recognize
any word, without the help of any predefined dictionary.
The visual attention model can be trained using only word
bounding boxes and does not need explicit character bound-
ing boxes at training time.
Recently, visual attention models have gained a lot of

attention and have been used for machine translation [2],
image captioning [19] and also text recognition [10]. In Figure 1. Overall scheme of the proposed recognition framework. Given a
[19] the attention model is combined with an LSTM on cropped word image, a set of spatially localized features are obtained using
top of CNN features. The LSTM outputs one caption word a CNN. Then, an LSTM decoder is combined with an attention model to
at every step focusing on a specific part of the image driven generate the sequence of characters. At every time step the attention model
weights the set of feature vectors to make the LSTM focus on a specific
by the attention model. In our work, we mainly follow part of the image.
this attention model, adapted to the particular case of text
the final recognition result without having to resort
recognition. Although the work of [10] also makes use of
to a fixed lexicon. For that, We modify the beam
a soft attention model for text recognition in wild, there
search to take into account the language model.
are significant differences with respect our work. Firstly,
Additionaly, the beam search can also incorporate
their model relies on Recursive CNN features to model the
a lexicon whenever it is available.
dependencies between characters. Instead we use traditional,
• We experimentally validate that our approach with
much simpler CNN features and it is the visual attention
weak language modeling outperforms the state-of-
model which learns to selectively attend to parts of the
the-art in unconstrained scene text recognition and
image and the dependencies between them. Secondly, Lee
performs comparably to lexicon-based approaches
et al. [10] used the features from the fully connected layer,
with a model complexity lower than similar ap-
while we use features from an earlier convolutional layer,
proaches.
thus preserving the local spatial characteristics of the image
and reducing the model complexity. This also allows the The rest of the paper is organized as follows. In Section
model to focus on a subset of features corresponding to 2, we present our attention-based recognition approach and
certain area of the image and learn the underlying inter- 3. In Section 4 we experimentally validate the model on
dependencies. Thirdly, we used LSTM instead of RNN a variety of standard and public benchmark datasets. We
which has been shown to learn long term dependencies conclude in Section 5 with a summary of our contributions
better than traditional RNNs. and a discussion of future research directions.
Our contributions with respect to the state-of-the-art.
In summary the contributions of our work are: 2. Visual attention for scene text recognition
• We introduce a LSTM-based visual attention model Our recognition approach is based on an encoder-
on top of CNN features for unconstrained scene text decoder framework for sequence to sequence learning. An
recognition. This model is able to selectively attend overall scheme of the framework is illustrated in figure 2.
to specific parts of word images, allowing it to model The encoder takes an image of a cropped word as input and
inter-character dependencies as needed and thus to encodes this image as a sequence of convolutional features.
implicitly model the underlying language. The attention model in between the encoder and the decoder
• We show that weak explicit language models (in the drives, at every step, the focus of attention of the decoder
form of prefix probabilities) can significantly boost towards a specific part of the sequence of features. Then,

944
Decoder at every time step t, a vector zˆt that will is the input to
D O O R
the LSTM decoder. This vector zˆt can be expressed as a
y1 y2 y3 y4 weighted combination of the set Ψ of feature vectors xi
LSTM LSTM LSTM LSTM extracted from the image:
ẑ1 ẑ2 ẑ3 ẑ4 K

zˆt = βt,i xi (2)
h1

h3
β1 β2 β3 β4
i=1
MLP MLP MLP MLP Thus, the vector zˆt encodes the relative importance of
each part of the image in order to predict the next character
ψ Attention Model for the underlying word. At every time step t, and for
Encoder CNN
each location i a positive weight βt,i is assigned such that

(βi ) = 1. These weights are obtained as the softmax
output of a Multi Layer Perpectron (denoted as Φ) using the
set of feature vectors Ψ and the hidden state of the LSTM
decoder at the previous time step, ht−1 . More formally:
αti = Φ (xi , ht−1 ) (3)
exp (αti )
βti = K (4)
Figure 2. The proposed Encoder-decoder framework with attention model. j=1 exp (αt,j )
an LSTM-based decoder generates a sequence of alpha- This model is smooth and differentiable and thus it can be
numeric symbols as output, one at every time step, termi- learned using standard back propagation.
nating when a special stop symbol is output by the LSTM. Decoder: Our decoder is a Long Short Term Memory
Below we describe the details of each of the components of (LSTM) network [5] which produces one symbol from the
the framework. given symbol set L, at every time step. The output of the
Encoder: The encoder uses a convolutional neural network LSTM is a vector yt of |L| character probabilities which
to extract a set of features from the image. Specifically, we represents the probability of emitting each of the characters
make use of the CNN model proposed by Jaderberg et al. [7] in the symbol set L at time t. It depends on the output vector
for scene text recognition – however we do not use the of the soft attention model zˆt , the hidden state at previous
fully connected layer as a fixed-length representation as it step ht−1 and the output of the LSTM at previous step yt−1 .
is common in previuos works. Instead, we take the features We follow the notation introduced in [19] where the network
produced by the last convolutional layer. In this way we is described ⎛ by:⎞ ⎛ ⎞ ⎛
it σ ⎞
can produce a set of feature vectors, each of them linked Eyt−1
⎜ ft ⎟ ⎜ σ ⎟ ⎝
to a specific spatial location of the image through its corre- ⎝ o ⎠ = ⎝ σ ⎠T ht−1 ⎠ (5)
t
sponding receptive field. This preserves spatial information zˆt
about the image and reduces model complexity. Through gt tanh
the attention model, the decoder is able to use this spatial ct = ft ct−1 + it gt (6)
information to selectively focus on the most relevant parts ht = ot tanh (ct ) , (7)
of the image at every step. where T is the matrix of weights learned by the network and
Thus, given an input image of a cropped word, the it , ft , ct , ot , and ht are the input, forget, memory, output
encoder generates a set of feature vectors: and hidden state of the LSTM, respectively. In the above
Ψ = {xi : i = 1 . . . K}, (1) definition, denotes the element-wise multiplication and
where xi denotes the feature vector corresponding to ith E is an embedding of the output character probabilities
part of the image. Each xi corresponds to a spatial location that is also learned by the network. σ and tanh denote the
in the image and contains the activations of all feature maps activation functions that are applied after the multiplication
at that location in the last convolutional layer of the CNN. by the matrix of weights
Finally, to compute the output character probability yt ,
Attention model: For the attention model, we adapt the
a deep output layer is added that takes as input the character
soft attention model of [19] for image captioning, originallly
probability at the previous step, the current LSTM hidden
introduced by [2] for neural machine translation. In [19]
state, and the current feature vector. The output character
slightly better results are obtained using the hard version
probability is:
of the model that focuses, at every time step, on a single
P (yt |Ψ, yt−1 ) ∼ exp (L0 (Eyt−1 + Lh ht + Lz zˆt )) (8)
feature vector. However, we argue that, in the case of text
where L0 , Lh and Lz are the parameters of the deep output
recognition, the soft version is more appropriate since a
layer that are learned using back-propagation.
single character will usually span more than one spatial cell
of the image corresponding to each of the feature vectors.
The soft version of the model can combine several feature 3. Inference
vectors with different weights into the final representation.
As shown in figure 2, the attention model generates, We use beam search over LSTM outputs to perform
word inference. We first introduce the basic procedure, and

945
then describe how we extend it to incorporate language beam search any alternative that do not correspond to any
models. partial branch of the trie.

3.1. The basic inference procedure 4. Experimental Results

Once the model is trained, we use a beam search to
approximately maximize the following score function over 4.1. Datasets and experimental protocols
every possible word: w = [c1 , . . . , cn ]:
N

S (w, x) = log (P (ct |ct−1 )) , (9) We evaluate the performance of the proposed method
t=1
using the following standard datasets.
where cn is a special symbol signifying the end of a word, Street View Text (SVT) dataset: this dataset contains
which immediately stops the beam search. 647 cropped word images downloaded from Google Street
The beam search keeps track at every step of the top View. Results using the predefined lexicons defined by Wang
N most probable sequences of characters. For every active et al.in [18] of 50 words for each image refereed as SVT-50.
branch of the beam search, given the previous character of
the sequence, ct−1 , the output character probability yt of ICDAR’03 text dataset: this dataset dataset contains 251
the LSTM is used to obtain P (ct |ct−1 ) for all characters ct full images and 860 cropped word images [15]. We used
in the symbol set L. the same protocol as [1], [10], [18] and evaluate cropped
word images for which the groundtruth text contains only
alphanumeric characters and contains at least three charac-
3.2. Incorporating language models and Lexicon ters.
MSCOCO [17] text Dataset: This is a recently published
Text is a strongly contextual. There are some strict
dataset. This dataset is also challenging as none of the
constraints imposed by the grammar of the language. For
images are captured specifically with text recognition in
example any word in English cannot carry more than two
mind. Also this dataset is much bigger than previous scene
consecutive occurrences of any alphabet letter. Leveraging
text datasets.
such knowledge can positively impact the final recogni-
Synth90k text dataset: this dataset is used only for
tion output. Although the LSTM implicitly learns some
dependences between consecutive characters, we show that training [8]. It contains 9 million synthetically-generated text
adding an explicit language model that takes into account images. We use the official partition for training as in other
longer dependencies gives a significant boost to recognition works like [10].
accuracy. Evaluation protocol: We use the standard evaluation
In this work we use a standard n-gram based language protocol adopted in most previous work on text recognition
model during inference to leverage the language prior. The in scene images [6], [10], [18]. The accepted metric is word
character n-gram model gives probability of a character level accuracy in percentage. SVT and ICDAR’03 are used
conditioned on k previous characters, where k is a parameter for evaluation. For lexicon-based recognition, we used the
of the model: same set of 50 for all images in for SVT and ICDAR’03
# (c1 c2 ...ck−1 )
Θ (ck |ck−1 , ck−2 ..., c1 ) = , (10) dataset, as proposed buy Wang et al. [18].
# (ck ck ...ck )
where, #(c1 , . . . cn ) is the number of occurrences of a Implementation details: The CNN encoder used in this
particular substring in a training corpus. work is the Dictnet model by Jaderberg et al. [8]. Their
Finally, the score function in equation 9 can be modified deep convolutional network consists of four convolutional
to take the n-gram language model into account as: layers and two fully connected layers. In this work we used
N features from the last convolutional layer. Thus, the feature
S (w, x) = log (P (ct |ct−1 )) map used is of size 4 × 13 and therefore, the LSTM takes
t=1 input in the form of 52 × 512.
+ α log Θ (wt |wt−1 , wt−2 ..., w1 ) (11) For lexicon-based recognition when we do not use the
At every step we fix the parameter k of the language lexicon-based inference explained in section 3.2. Instead,
model to the number of previously generated characters in we take the output of unconstrained recognition and find
order to take into account the longest possible sequence. the closest word in the lexicon using the Levenshtein
Although our method is originally designed for uncon- edit distance. For lexicon-based inference in unsconstrained
strained text recognition, it can also leverage a lexicon datasets (SVT and ICDAR’03) we use the 90k-words lexi-
whenever available. The use of a lexicon D can be integrated con provided by Jaderberg et al.in [8]. The explicit language
by modifying the beam search so that all active sequences model is also learned using this 90k word lexicon.
that do not correspond to any valid word are automatically The parameter α (see equation 11) to weight the lan-
removed from the beam. guage model with respect to LSTM character probability is
This can be efficiently implemented by storing the lexi- empirically established. In our experiments we found the
con in a trie structure and automatically removing from the best results with α between 0.25 to 0.3

946
4.2. Baseline performance analysis these methods, our visual attention based model performs
significantly better than Bissacco et al. [3] and Jaderberg
In this section we analyze the impact on performance of et al. [6] in both SVT and ICDAR’03 datasets. Our model
all the components of the proposed model. We start with a also performs as good as Lee et al. [10] in SVT dataset
baseline that consists of a simple one layer LSTM network and outperforms them by 3% in ICDAR’03 dataset, which
as decoder, without any attention or explicit language model. is significant given the high recognition rates.
As we are interested mainly in the impact of the attention If we further compare our model with that of Lee et al.
model, we use a simple version in which CNN features from [10], that also uses different variants of RNN architectures
the encoder are fed to the LSTM only at the first time step. and an attention model on top of CNN features, we find
At every step the output character is determined based on that they use recursive CNN features. They report that
the output of the previous step and the previous hidden state. this gives an 8% increase in accuracy over the baseline.
In an effort to evaluate each of our contributions, we This success is due to the recurrent nature of the CNN
trained the baseline system and our model with exactly the feature which implicitly model the conditional probability
same training data. For this purpose we randomly sampled of character sequences.using recursive CNN performs better
one million training samples from the Synth90k [8] dataset. than the traditional convolutional feature. However, the RNN
For validation we used 300,000 samples randomly taken architecture they use improves only 4% over the baseline.
from the same synth90K dataset. In contrast our method rely on traditional CNN features
We present the results for each of the component of (which can possibly encodes the presence of individual
the framework as described above in Table 1. The attention characters as shown in [6] from lower convolutional layer
model outperforms the baseline by a significant margin preserving local spatial characteristics, which reduces the
(around 7%). Also these results confirm the advantage of complexity of the model. In addition, as reported in table 1,
using an explicit language model in addition to the implicit our combination of LSTM and soft attention model achieves
conditional character probabilities learned by the LSTM a much larger margin, 14%, over the baseline. Theses results
model. Using the language model improves accuracy in an- show that a combination of local convolutional features
other 7%. We also see that further constraining the inference using the context based attention attention performs better
wih a dictionary does not improve the result much, probably or comparable to the previous state-of-the- art results.
because the language model is learned from the same 90K We provide the results on the COCO text dataset in
dictionary proposed by Jaderberg et al.in [8]. Table 3: Being this the most recenlty released dataset in
In comparison with other related works on unconstrained this domain, there are no published results that could be
text recognition, it is noteworthy that with only one million comparable with our work. To make a valid comparison we
training samples our complete framework can learn a better used two neural network based approaches by M. Jaderberg
model than Jaderberg et al. [6] and obtain results that are et. al. [9] as they have made their models available online.
close to other state-of-the-art methods that are using the We also fine-tuned the models on COCO dataset which leads
whole 9 million sample training dataset (see table 2). to significant improvement (last row in Table 3). We can
see that our simplest model is comparable to Jaderberg’s
Methods SVT
Baseline (LSTM-no attention) 61.7
results while including the explicit language model leads to
Proposed (LSTM + attention model) 68.16 a significant improvement by a large margin.
Proposed (LSTM + attention model + LM) 75.57 Lexicon-based recognition For SVT-50 we can observe
Proposed (LSTM + attention model+LM+dict) 76.04
TABLE 1. I MPACT OF THE DIFFERENT COMPONENTS OF OUR that our method obtain a similar result than the best of the
FRAMEWORK WITH RESPECT TO THE BASELINE . W E COMPARE THE methods [8] specifically designed to work in a lexicon-based
BASELINE (LSTM WITH NO ATTENTION MODEL ) WITH ALL THE scenario. Comparing with methods for unsconstrained text
VARIANTS OF THE PROPOSED METHOD
recognition, only the method of Leeet al. [10] outperforms
our best setting. But as we have already discussed, part of
this better performance can be explained by the use of the
more complex recursive CNN features.
4.3. Comparison with state of the art
Concerning ICDAR’03-50 and ICDAR’03-full, our re-
In this section we will compare our result with other sults, although do not beat current state of the art are very
related works on scene text recognition. The results of this competitive and comparable to the best performing methods.
comparison are shown in table 2 for SVT and ICDAR’03
and 3 for COCO dataset. First, we will discuss results on 5. Conclusions
unconstrained text recognition which is the main focus of
our work. Then, we will analyze results for lexicon-based In this paper we proposed an LSTM-based visual at-
recognition. tention model for scene text recognition. The model uses
Unconstrained text recognition: apart from our method convolutional features from a standard CNN as input to an
Jaderberg et al. [6], Lee et al. [10] and Bissaccco et al. LSTM network that selectively attends to parts of the image
[3] are the only methods which are capable of perform- at each time step in order to recognize words without re-
ing totally unconstrained recognition of scene text. Among sorting to a fixed lexicon. We also propose a modified beam

947
Methods SVT-50 SVT ICDAR’03-50 ICDAR’03-full ICDAR’03
Almazan et al. [1] 89.2 - - - -
Lee et al. [11] 80.0 - 88.0 76.0 -
Lexicon-based

Yao et al. [20] 75.9 - 88.5 80.3 -

Rodriguez-Serrano et al. [13] 70.0 - - - -
Jaderberg et al. [7] 86.1 - 96.2 91.5 -
Su and Luet al. [16] 83.0 - 92.0 82.0 -
Gordo et al. [4] 90.7 - - - -
*DICT Jaderberg et al. [8] 95.4 80.7 98.7 98.6 93.1
Bissacco et al. [3] 90.4 78.0 - - -
Unconstrained

Jaderberget al. [6] 93.2 71.7 97.8 97.0 89.6

Lee et al. [10] 96.3 80.7 97.9 97.0 88.7
Proposed (LSTM + attention model) 91.7 75.1 93.4 91.0 89.3
Proposed (LSTM + attention model + LM) 95.2 80.4 95.7 94.1 92.6
Proposed (LSTM + attention model+LM+dict) 95.4 - 96.2 95.7 -
TABLE 2. S CENE TEXT RECOGNITION ACCURACY. “50” AND “F ULL” DENOTE THE LEXICON SIZE USED FOR CONSTRAINED TEXT RECOGNITION AS
DEFINED IN [18]. R ESULTS ARE DIVIDED INTO LEXICON - BASED AND UNCONSTRAINED ( LEXICON - FREE ) APPROACHES . *DICT [8] IS NOT
LEXICON - FREE DUE TO INCORPORATING GROUND - TRUTH LABELS DURING TRAINING .

Methods Accuracy
[6] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep
Charnet [9] 24.72
structured output learning for unconstrained text recognition,” ICLR
Dictnet [9] 26.79
2015, 2014.
Proposed (LSTM + attention model) 24.11
Proposed (LSTM + attention model+LM) 33.67 [7] ——, “Reading text in the wild with convolutional neural networks,”
Proposed (LSTM + attention model+LM+FT) 43.86 CORR/abs/1412.1842, 2014.
TABLE 3. P ERFORMANCE OF OUR METHODS ON RECENTLY RELEASED
[8] ——, “Synthetic data and artiﬁcial neural networks for natural scene
COCO-T EXT DATASET, W E COMPARE DIFFERENT VARIANTS OF OUR
text recognition,” in NIPS Deep Learning Workshop 2014, 2014.
METHOD USING ONLY THE ATTENTION MODEL , INTEGRATING EXPLICIT
LANGUAGE MODEL AND ALSO F INE TUNING THE MODEL ON COCO [9] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep features for text
T EXT DATASET ). spotting,” in ECCV 2014, 2014.
[10] C. Lee and S. Osindero, “Recursive recurrent nets with attention
search strategy that is able to incorporate weak language modeling for OCR in the wild,” CoRR, vol. abs/1603.03101, 2016.
models (n-grams) to improve recognition accuracy. Experi- [Online]. Available: http://arxiv.org/abs/1603.03101
mental results demonstrate that our approach outperforms or [11] S. Lee, M. S. Cho, K. Jung, and J. H. Kim, “Scene text extraction
performs comparably to state-of-the-art approaches that use with edge constraint and text collinearity,” in Proc. ICPR, 2010.
lexicons to constrain inferred output words. Experimental [12] L. Neumann and J. Matas, “Real-time scene text localization and
results shows that context plays a important part in case of recognition,” in Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on. IEEE, 2012, pp. 3538–3545.
real data, thus using a explicit language model always helps
to improve the result. [13] J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin, “Label embed-
ding: A frugal baseline for text recognition,” International Journal of
In future we can extend the attention model for the text Computer Vision, vol. 113, no. 3, pp. 193–207, 2015.
detection task, which will lead to an end-to-end framework
[14] D. L. Smith, J. Field, and E. Learned-Miller, “Enforcing similarity
for text recognition from images. Moreover, in our current constraints with integer programming for better scene text recogni-
framework convolutional features are taken from one single tion,” in Computer Vision and Pattern Recognition (CVPR), 2011
layer, which can lead to poorer results when the text is IEEE Conference on. IEEE, 2011, pp. 73–80.
either too big or too small. This can be dealt with combining [15] L. P. Sosa, S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and
features from multiple layers. R. Young, “Icdar 2003 robust reading competitions,” in In Proceed-
ings of the Seventh International Conference on Document Analysis
and Recognition. IEEE Press, 2003, pp. 682–687.
References
[16] B. Su and S. Lu, “Accurate scene text recognition based on recurrent
neural network,” in Asian Conference on Computer Vision. Springer,
[1] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word spotting and 2014, pp. 35–48.
recognition with embedded attributes,” IEEE transactions on pattern
analysis and machine intelligence, vol. 36, no. 12, pp. 2552–2566, [17] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-
2014. text: Dataset and benchmark for text detection and recognition in
natural images,” in arXiv preprint arXiv:1601.07140, 2016.
[2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, [18] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text
2014. [Online]. Available: http://arxiv.org/abs/1409.0473 recognition,” in 2011 International Conference on Computer Vision.
IEEE, 2011, pp. 1457–1464.
[3] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Photoocr:
Reading text in uncontrolled conditions,” in 2013 IEEE International [19] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,
Conference on Computer Vision, Dec 2013, pp. 785–792. R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image
[4] A. Gordo, “Supervised mid-level features for word image represen- caption generation with visual attention,” CoRR, vol. abs/1502.03044,
tation,” in The IEEE Conference on Computer Vision and Pattern 2015. [Online]. Available: http://arxiv.org/abs/1502.03044
Recognition (CVPR), June 2015. [20] C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A learned multi-scale
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural representation for scene text recognition,” in The IEEE Conference
computation, vol. 9, no. 8, pp. 1735–1780, 1997. on Computer Vision and Pattern Recognition (CVPR), June 2014.

948

Haramaya University Computer Science Student
No ratings yet
Haramaya University Computer Science Student
15 pages
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
No ratings yet
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
12 pages
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
No ratings yet
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
22 pages
SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
No ratings yet
SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
8 pages
Enhancing Text Spotting With A Language Model and Visual Context Information
No ratings yet
Enhancing Text Spotting With A Language Model and Visual Context Information
10 pages
Scene Text Recognition Based On Improved CRNN
No ratings yet
Scene Text Recognition Based On Improved CRNN
14 pages
Jaderberg 16
No ratings yet
Jaderberg 16
20 pages
Reading Scene Text in Deep Convolutional Sequences: Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang
No ratings yet
Reading Scene Text in Deep Convolutional Sequences: Pan He, Weilin Huang, Yu Qiao, Chen Change Loy, and Xiaoou Tang
8 pages
Ijarcce 208
No ratings yet
Ijarcce 208
3 pages
Unconstrained Text Recognition With Convolutional Neural Networks
No ratings yet
Unconstrained Text Recognition With Convolutional Neural Networks
13 pages
Char RCG TH
No ratings yet
Char RCG TH
11 pages
Text Detection and Character Recognition in Scene Images With Unsupervised Feature Learning
No ratings yet
Text Detection and Character Recognition in Scene Images With Unsupervised Feature Learning
6 pages
Top-Down and Bottom-Up Cues For Scene Text Recognition: Anand Mishra Karteek Alahari C. V. Jawahar
No ratings yet
Top-Down and Bottom-Up Cues For Scene Text Recognition: Anand Mishra Karteek Alahari C. V. Jawahar
8 pages
Yerrijdnewpaper
No ratings yet
Yerrijdnewpaper
5 pages
Deep Scene Text Detection With Connected Component Proposals
No ratings yet
Deep Scene Text Detection With Connected Component Proposals
10 pages
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
No ratings yet
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
10 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Vintext CVPR21
No ratings yet
Vintext CVPR21
10 pages
CLIP4STR: A Simple Baseline For Scene Text Recognition With Pre-Trained Vision-Language Model
No ratings yet
CLIP4STR: A Simple Baseline For Scene Text Recognition With Pre-Trained Vision-Language Model
13 pages
Scene Text Detection With Fully Convolutional Neural Networks
No ratings yet
Scene Text Detection With Fully Convolutional Neural Networks
23 pages
A Transformer-Based Framework For Scene Text Recognition
No ratings yet
A Transformer-Based Framework For Scene Text Recognition
16 pages
Kami Export - 1904.01941
No ratings yet
Kami Export - 1904.01941
5 pages
SVTR: Scene Text Recognition With A Single Visual Model
No ratings yet
SVTR: Scene Text Recognition With A Single Visual Model
7 pages
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
No ratings yet
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
11 pages
TextBoxes A Fast Text Detector With DL
No ratings yet
TextBoxes A Fast Text Detector With DL
7 pages
Text Detection OCR Reseacrh Paper
No ratings yet
Text Detection OCR Reseacrh Paper
26 pages
Rec Pa Mi
No ratings yet
Rec Pa Mi
14 pages
Applied Sciences: Scene Text Detection Using Attention With Depthwise Separable Convolutions
No ratings yet
Applied Sciences: Scene Text Detection Using Attention With Depthwise Separable Convolutions
18 pages
Scene Text Detection and Recognition USING DL PDF
No ratings yet
Scene Text Detection and Recognition USING DL PDF
20 pages
Multimodal Visual-Semantic Representations Learning For Scene Text Recognition
No ratings yet
Multimodal Visual-Semantic Representations Learning For Scene Text Recognition
19 pages
Long2021 Article SceneTextDetectionAndRecogniti
No ratings yet
Long2021 Article SceneTextDetectionAndRecogniti
24 pages
Scenetextrecognition
No ratings yet
Scenetextrecognition
15 pages
DAN: A Segmentation-Free Document Attention Network For Handwritten Document Recognition
No ratings yet
DAN: A Segmentation-Free Document Attention Network For Handwritten Document Recognition
17 pages
Deep Learning Approaches To Scene Text Detection A
No ratings yet
Deep Learning Approaches To Scene Text Detection A
61 pages
CRNN Model For Text Detection and Classification From Natural Scenes
No ratings yet
CRNN Model For Text Detection and Classification From Natural Scenes
11 pages
2022 Text Recognition in The Wild
No ratings yet
2022 Text Recognition in The Wild
35 pages
Gupta Synthetic Data For CVPR 2016 Paper
No ratings yet
Gupta Synthetic Data For CVPR 2016 Paper
10 pages
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
No ratings yet
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
11 pages
TCS Ocr
No ratings yet
TCS Ocr
39 pages
Convolutional Character Networks
No ratings yet
Convolutional Character Networks
11 pages
Icdar 2019 B
No ratings yet
Icdar 2019 B
6 pages
Techniques of Text Detection and Extraction
No ratings yet
Techniques of Text Detection and Extraction
18 pages
Report
No ratings yet
Report
39 pages
8 e 58227702 CD 9 Aaf
No ratings yet
8 e 58227702 CD 9 Aaf
15 pages
Detection of Text From Lecture Video Images
No ratings yet
Detection of Text From Lecture Video Images
5 pages
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
No ratings yet
NRTR: A No-Recurrence Sequence-to-Sequence Model For Scene Text Recognition
6 pages
Robust Scene Text Recognition With Automatic Rectification
No ratings yet
Robust Scene Text Recognition With Automatic Rectification
9 pages
Research PaPer EAST
No ratings yet
Research PaPer EAST
10 pages
Scene Text Recognition Using Co-Occurrence of Histogram of Oriented Gradients
No ratings yet
Scene Text Recognition Using Co-Occurrence of Histogram of Oriented Gradients
5 pages
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques (Veronica - Vilaplana, Antoni - Gasull, Ferran - Marques) @upc - Edu
No ratings yet
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques (Veronica - Vilaplana, Antoni - Gasull, Ferran - Marques) @upc - Edu
4 pages
Dtrocr: Decoder-Only Transformer For Optical Character Recognition
No ratings yet
Dtrocr: Decoder-Only Transformer For Optical Character Recognition
11 pages
Textboxes
No ratings yet
Textboxes
15 pages
Deep Visual-Semantic Alignments For Generating Image Descriptions
No ratings yet
Deep Visual-Semantic Alignments For Generating Image Descriptions
17 pages
Video Text WACV
No ratings yet
Video Text WACV
8 pages
Latest Base Paper
No ratings yet
Latest Base Paper
4 pages
FedOCR Arxiv2007.11462
No ratings yet
FedOCR Arxiv2007.11462
16 pages
Scene Text Visual Question Answering
No ratings yet
Scene Text Visual Question Answering
11 pages
Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
No ratings yet
Multi-Script-Oriented Text Detection and Recognition in Video/Scene/Born Digital Images
18 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet
Building A Compact MQDF Classifier by Sparse Coding and Vector Quantization Technique
No ratings yet
Building A Compact MQDF Classifier by Sparse Coding and Vector Quantization Technique
6 pages
Extremely Sparse Deep Learning Using Inception Modules With Dropfilters
No ratings yet
Extremely Sparse Deep Learning Using Inception Modules With Dropfilters
6 pages
Benchmarking Keypoint Filtering Approaches For Document Image Matching
No ratings yet
Benchmarking Keypoint Filtering Approaches For Document Image Matching
6 pages
Classification of Graphomotor Impressions Using Convolutional Neural Networks: An Application To Automated Neuro-Psychological Screening Tests
No ratings yet
Classification of Graphomotor Impressions Using Convolutional Neural Networks: An Application To Automated Neuro-Psychological Screening Tests
6 pages
Cascading Training For Relaxation CNN On Handwritten Character Recognition
No ratings yet
Cascading Training For Relaxation CNN On Handwritten Character Recognition
6 pages
Segmentation Free Spotting of Cuneiform Using Part Structured Models
No ratings yet
Segmentation Free Spotting of Cuneiform Using Part Structured Models
6 pages
Sheet Music Statistical Layout Analysis: 2016 15th International Conference On Frontiers in Handwriting Recognition
No ratings yet
Sheet Music Statistical Layout Analysis: 2016 15th International Conference On Frontiers in Handwriting Recognition
6 pages
Line-of-Sight Stroke Graphs and Parzen Shape Context Features For Handwritten Math Formula Representation and Symbol Segmentation
No ratings yet
Line-of-Sight Stroke Graphs and Parzen Shape Context Features For Handwritten Math Formula Representation and Symbol Segmentation
7 pages
Phocnet: A Deep Convolutional Neural Network For Word Spotting in Handwritten Documents
No ratings yet
Phocnet: A Deep Convolutional Neural Network For Word Spotting in Handwritten Documents
6 pages
Convolutional Multi-Directional Recurrent Network For of Ine Handwritten Text Recognition
No ratings yet
Convolutional Multi-Directional Recurrent Network For of Ine Handwritten Text Recognition
6 pages
Zoning Aggregated Hypercolumns For Keyword Spotting
No ratings yet
Zoning Aggregated Hypercolumns For Keyword Spotting
6 pages
Discovering Visual Element Evolutions For Historical Document Dating
No ratings yet
Discovering Visual Element Evolutions For Historical Document Dating
6 pages
On The Design of Personal Digital Bodyguards: Impact of Hardware Resolution On Handwriting Analysis
No ratings yet
On The Design of Personal Digital Bodyguards: Impact of Hardware Resolution On Handwriting Analysis
6 pages
Online Handwritten Mathematical Expressions Recognition by Merging Multiple 1D Interpretations
No ratings yet
Online Handwritten Mathematical Expressions Recognition by Merging Multiple 1D Interpretations
6 pages
A Lexicon Verification Strategy in A BLSTM Cascade Framework
No ratings yet
A Lexicon Verification Strategy in A BLSTM Cascade Framework
6 pages
Fourier Coefficients For Fraud Handwritten Document Classification Through Age Analysis
No ratings yet
Fourier Coefficients For Fraud Handwritten Document Classification Through Age Analysis
6 pages
Recognizing Off-Line Flowcharts by Reconstructing Strokes and Using On-Line Recognition Techniques
No ratings yet
Recognizing Off-Line Flowcharts by Reconstructing Strokes and Using On-Line Recognition Techniques
6 pages
On The Parametrization of The Three-Dimensional Rotation Group
No ratings yet
On The Parametrization of The Three-Dimensional Rotation Group
10 pages
The First Handwritten Balinese Palm Leaf Manuscripts Dataset
No ratings yet
The First Handwritten Balinese Palm Leaf Manuscripts Dataset
6 pages
Multiple Generation of Bengali Static Signatures
No ratings yet
Multiple Generation of Bengali Static Signatures
6 pages
New Tampered Features For Scene and Caption Text Classification in Video Frame
No ratings yet
New Tampered Features For Scene and Caption Text Classification in Video Frame
6 pages
Efficient Inference in Fully Connected CRFs
No ratings yet
Efficient Inference in Fully Connected CRFs
9 pages
Automatic Signature Segmentation Using Hyper-Spectral Imaging
No ratings yet
Automatic Signature Segmentation Using Hyper-Spectral Imaging
6 pages
Recent Advances in Simultaneous Localiza
No ratings yet
Recent Advances in Simultaneous Localiza
34 pages
Defensive Patches For Robust Recognition in The Physical World
No ratings yet
Defensive Patches For Robust Recognition in The Physical World
10 pages
ML - UNIT 5 - Material - SVCK - CSE
No ratings yet
ML - UNIT 5 - Material - SVCK - CSE
22 pages
Human Behavior and Emerging Technologies - 2023 - Xu - Transparency Enhances Positive Perceptions of Social Artificial
No ratings yet
Human Behavior and Emerging Technologies - 2023 - Xu - Transparency Enhances Positive Perceptions of Social Artificial
15 pages
Cerasis Digital Supply Chain Ebook PDF
No ratings yet
Cerasis Digital Supply Chain Ebook PDF
85 pages
Computing Essentials 2013 Making IT Work For You 23rd Edition OLeary Test Bank 1
100% (76)
Computing Essentials 2013 Making IT Work For You 23rd Edition OLeary Test Bank 1
31 pages
Syllabus Term Wise Class X 2024 25 Google Docs
No ratings yet
Syllabus Term Wise Class X 2024 25 Google Docs
17 pages
Understanding Time Delay Disputes in Construction Contracts
No ratings yet
Understanding Time Delay Disputes in Construction Contracts
11 pages
Ese Regular TT May 2024
No ratings yet
Ese Regular TT May 2024
43 pages
4144-Article Text-12168-1-10-20240926
No ratings yet
4144-Article Text-12168-1-10-20240926
12 pages
Thesis Defense Slideshare
100% (3)
Thesis Defense Slideshare
6 pages
OMIS
No ratings yet
OMIS
7 pages
2018 Shareholder Letter
No ratings yet
2018 Shareholder Letter
7 pages
Traffic Management
No ratings yet
Traffic Management
5 pages
Ai 900
No ratings yet
Ai 900
10 pages
Criminal Identification - Basepaper
No ratings yet
Criminal Identification - Basepaper
15 pages
2 Marks MLT Ai&ds
No ratings yet
2 Marks MLT Ai&ds
2 pages
IAIS Report On FinTech Developments in The Insurance Sector
No ratings yet
IAIS Report On FinTech Developments in The Insurance Sector
18 pages
Unit 2
No ratings yet
Unit 2
4 pages
TacticAI An AI Assistant For Football Tactics
No ratings yet
TacticAI An AI Assistant For Football Tactics
14 pages
Text To Video - Model
No ratings yet
Text To Video - Model
2 pages
Generative AI Database
No ratings yet
Generative AI Database
14 pages
Transform Your Job Search With An AI Powered Resume Builder
No ratings yet
Transform Your Job Search With An AI Powered Resume Builder
9 pages
Memoire: Universite Ibn Khaldoun - Tiaret
No ratings yet
Memoire: Universite Ibn Khaldoun - Tiaret
88 pages
Unit - 4 KR
No ratings yet
Unit - 4 KR
25 pages
ML Module 5
No ratings yet
ML Module 5
14 pages
20 Types Prompting Styles
No ratings yet
20 Types Prompting Styles
22 pages
Sage Estimating Brochure
No ratings yet
Sage Estimating Brochure
4 pages
TMP DeepMind Profiles
No ratings yet
TMP DeepMind Profiles
35 pages
AIoT Projects
No ratings yet
AIoT Projects
12 pages
BCA Software Engineering 1-5 Unit
No ratings yet
BCA Software Engineering 1-5 Unit
120 pages
MUJ Newsletter April 2024
No ratings yet
MUJ Newsletter April 2024
1 page

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov

Uploaded by

Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov

Uploaded by

2017 14th IAPR International Conference on Document Analysis and Recognition

Visual attention models for scene text recognition

Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov

2379-2140/17 $31.00 © 2017 IEEE 943

signiﬁcantly increases the computational complexity. In ad-

3.1. The basic inference procedure 4. Experimental Results

Yao et al. [20] 75.9 - 88.5 80.3 -

Jaderberget al. [6] 93.2 71.7 97.8 97.0 89.6

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.