Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
Abstract—In this paper we propose an approach to lexicon- Existing word recognition methods can be broadly divided
free recognition of text in scene images. Our approach relies into dictionay-based methods, using some kind of predefined
on a LSTM-based soft visual attention model learned from lexicon to guide the recognition, and unconstrained methods,
convolutional features. A set of feature vectors are derived able to recognize any word.
from an intermediate convolutional layer corresponding to Dictionary-based scene text recognition. Traditionally,
different areas of the image. This permits encoding of spatial scene text recognition systems use character recognizers
information into the image representation. In this way, the in a sequential way by localizing characters using a slid-
framework is able to learn how to selectively focus on different ing window [9], [12], [18] and then grouping responses
parts of the image. At every time step the recognizer emits one by arranging the character windows from left to right as
character using a weighted combination of the convolutional words. A variety of techniques have been used to classify
feature vectors according to the learned attention model. Train- character bounding boxes, including random ferns [18], in-
ing can be done end-to-end using only word level annotations. teger programming [14] and Convolutional Neural Networks
In addition, we show that modifying the beam search algorithm (CNNs) [9]. These methods often use the lexical constraints
by integrating an explicit language model leads to significantly imposed by a fixed lexicon while grouping the character
better recognition results. We validate the performance of hypotheses into words.
our approach on standard SVT, ICDAR’03 and MS-COCO In contrast to sequential character recognizer models,
scene text datasets, showing state-of-the-art performance in holistic fixed-length representations have been proposed
unconstrained text recognition. in [1], [4], [7], [8], [9], [13]. In [1], [4], [13], a holistic
signature derived from a set of training images is used to
1. Introduction learn a joint embedding space between images and words.
The first attempt using CNN features was made by Jaderberg
et al.in [9], where a sliding window over CNN features
The increasing ability to capture images in any condition
is used for robust scene text recognition using a fixed
and situation poses many challenges and opportunities for
lexicon. Later, the same authors also proposed a fixed-length
extracting visual information from images. One such chal-
representation [7] using convolutional features trained on a
lenge is the detection and recognition of text “in the wild”.
synthetic dataset of 9 million images [8]
Text in natural images is a high level semantic information
that can aid automatic image understanding and retrieval. Unconstrained scene text recognition. Though most of
However, robust reading of text in uncontrolled environ- the works in scene text recognition focus on fixed-lexicon
ments is very different from text recognition in document recognition, a few attempts at unconstrained text recognition
images and much more challenging due to multiple factors have also been made.
such as difficult acquisition conditions, low resolution, font Biassco et al.in [3] rely on sequential character clas-
variability, complex backgrounds, different lighting condi- sifiers. They use a massive number of annotated character
tions, blur, etc. Therefore, OCR techniques used in docu- bounding boxes to learn character classifiers. Binarization
ment images do not generalize well to recognition of scene and sliding window methods are used to generate character
text. proposals followed by a text/background classifier. Finally,
The problem of end-to-end scene text recognition is character probabilities given by character classifiers are used
usually divided in two different tasks: word detection and in a beam search to recognize words. They also integrate a
word recognition. The goal of the word detection stage static character n-gram language model in every step of the
is to generate bounding boxes around potential words in beam search to incorporate an underlying language model.
the images. Subsequently, the words in these bounding Though CNN models have achieved great success in
boxes are recognized in the word recognition stage. This lexicon-based text recognition, word recognition in un-
paper is focused on this second stage, word recognition. constrained scenarios requires modeling the underlying
944
Decoder at every time step t, a vector zˆt that will is the input to
D O O R
the LSTM decoder. This vector zˆt can be expressed as a
y1 y2 y3 y4 weighted combination of the set Ψ of feature vectors xi
LSTM LSTM LSTM LSTM extracted from the image:
ẑ1 ẑ2 ẑ3 ẑ4 K
zˆt = βt,i xi (2)
h1
h2
h3
β1 β2 β3 β4
i=1
MLP MLP MLP MLP Thus, the vector zˆt encodes the relative importance of
each part of the image in order to predict the next character
ψ Attention Model for the underlying word. At every time step t, and for
Encoder CNN
each location i a positive weight βt,i is assigned such that
(βi ) = 1. These weights are obtained as the softmax
output of a Multi Layer Perpectron (denoted as Φ) using the
set of feature vectors Ψ and the hidden state of the LSTM
decoder at the previous time step, ht−1 . More formally:
αti = Φ (xi , ht−1 ) (3)
exp (αti )
βti = K (4)
Figure 2. The proposed Encoder-decoder framework with attention model. j=1 exp (αt,j )
an LSTM-based decoder generates a sequence of alpha- This model is smooth and differentiable and thus it can be
numeric symbols as output, one at every time step, termi- learned using standard back propagation.
nating when a special stop symbol is output by the LSTM. Decoder: Our decoder is a Long Short Term Memory
Below we describe the details of each of the components of (LSTM) network [5] which produces one symbol from the
the framework. given symbol set L, at every time step. The output of the
Encoder: The encoder uses a convolutional neural network LSTM is a vector yt of |L| character probabilities which
to extract a set of features from the image. Specifically, we represents the probability of emitting each of the characters
make use of the CNN model proposed by Jaderberg et al. [7] in the symbol set L at time t. It depends on the output vector
for scene text recognition – however we do not use the of the soft attention model zˆt , the hidden state at previous
fully connected layer as a fixed-length representation as it step ht−1 and the output of the LSTM at previous step yt−1 .
is common in previuos works. Instead, we take the features We follow the notation introduced in [19] where the network
produced by the last convolutional layer. In this way we is described ⎛ by:⎞ ⎛ ⎞ ⎛
it σ ⎞
can produce a set of feature vectors, each of them linked Eyt−1
⎜ ft ⎟ ⎜ σ ⎟ ⎝
to a specific spatial location of the image through its corre- ⎝ o ⎠ = ⎝ σ ⎠T ht−1 ⎠ (5)
t
sponding receptive field. This preserves spatial information zˆt
about the image and reduces model complexity. Through gt tanh
the attention model, the decoder is able to use this spatial ct = ft ct−1 + it gt (6)
information to selectively focus on the most relevant parts ht = ot tanh (ct ) , (7)
of the image at every step. where T is the matrix of weights learned by the network and
Thus, given an input image of a cropped word, the it , ft , ct , ot , and ht are the input, forget, memory, output
encoder generates a set of feature vectors: and hidden state of the LSTM, respectively. In the above
Ψ = {xi : i = 1 . . . K}, (1) definition, denotes the element-wise multiplication and
where xi denotes the feature vector corresponding to ith E is an embedding of the output character probabilities
part of the image. Each xi corresponds to a spatial location that is also learned by the network. σ and tanh denote the
in the image and contains the activations of all feature maps activation functions that are applied after the multiplication
at that location in the last convolutional layer of the CNN. by the matrix of weights
Finally, to compute the output character probability yt ,
Attention model: For the attention model, we adapt the
a deep output layer is added that takes as input the character
soft attention model of [19] for image captioning, originallly
probability at the previous step, the current LSTM hidden
introduced by [2] for neural machine translation. In [19]
state, and the current feature vector. The output character
slightly better results are obtained using the hard version
probability is:
of the model that focuses, at every time step, on a single
P (yt |Ψ, yt−1 ) ∼ exp (L0 (Eyt−1 + Lh ht + Lz zˆt )) (8)
feature vector. However, we argue that, in the case of text
where L0 , Lh and Lz are the parameters of the deep output
recognition, the soft version is more appropriate since a
layer that are learned using back-propagation.
single character will usually span more than one spatial cell
of the image corresponding to each of the feature vectors.
The soft version of the model can combine several feature 3. Inference
vectors with different weights into the final representation.
As shown in figure 2, the attention model generates, We use beam search over LSTM outputs to perform
word inference. We first introduce the basic procedure, and
945
then describe how we extend it to incorporate language beam search any alternative that do not correspond to any
models. partial branch of the trie.
946
4.2. Baseline performance analysis these methods, our visual attention based model performs
significantly better than Bissacco et al. [3] and Jaderberg
In this section we analyze the impact on performance of et al. [6] in both SVT and ICDAR’03 datasets. Our model
all the components of the proposed model. We start with a also performs as good as Lee et al. [10] in SVT dataset
baseline that consists of a simple one layer LSTM network and outperforms them by 3% in ICDAR’03 dataset, which
as decoder, without any attention or explicit language model. is significant given the high recognition rates.
As we are interested mainly in the impact of the attention If we further compare our model with that of Lee et al.
model, we use a simple version in which CNN features from [10], that also uses different variants of RNN architectures
the encoder are fed to the LSTM only at the first time step. and an attention model on top of CNN features, we find
At every step the output character is determined based on that they use recursive CNN features. They report that
the output of the previous step and the previous hidden state. this gives an 8% increase in accuracy over the baseline.
In an effort to evaluate each of our contributions, we This success is due to the recurrent nature of the CNN
trained the baseline system and our model with exactly the feature which implicitly model the conditional probability
same training data. For this purpose we randomly sampled of character sequences.using recursive CNN performs better
one million training samples from the Synth90k [8] dataset. than the traditional convolutional feature. However, the RNN
For validation we used 300,000 samples randomly taken architecture they use improves only 4% over the baseline.
from the same synth90K dataset. In contrast our method rely on traditional CNN features
We present the results for each of the component of (which can possibly encodes the presence of individual
the framework as described above in Table 1. The attention characters as shown in [6] from lower convolutional layer
model outperforms the baseline by a significant margin preserving local spatial characteristics, which reduces the
(around 7%). Also these results confirm the advantage of complexity of the model. In addition, as reported in table 1,
using an explicit language model in addition to the implicit our combination of LSTM and soft attention model achieves
conditional character probabilities learned by the LSTM a much larger margin, 14%, over the baseline. Theses results
model. Using the language model improves accuracy in an- show that a combination of local convolutional features
other 7%. We also see that further constraining the inference using the context based attention attention performs better
wih a dictionary does not improve the result much, probably or comparable to the previous state-of-the- art results.
because the language model is learned from the same 90K We provide the results on the COCO text dataset in
dictionary proposed by Jaderberg et al.in [8]. Table 3: Being this the most recenlty released dataset in
In comparison with other related works on unconstrained this domain, there are no published results that could be
text recognition, it is noteworthy that with only one million comparable with our work. To make a valid comparison we
training samples our complete framework can learn a better used two neural network based approaches by M. Jaderberg
model than Jaderberg et al. [6] and obtain results that are et. al. [9] as they have made their models available online.
close to other state-of-the-art methods that are using the We also fine-tuned the models on COCO dataset which leads
whole 9 million sample training dataset (see table 2). to significant improvement (last row in Table 3). We can
see that our simplest model is comparable to Jaderberg’s
Methods SVT
Baseline (LSTM-no attention) 61.7
results while including the explicit language model leads to
Proposed (LSTM + attention model) 68.16 a significant improvement by a large margin.
Proposed (LSTM + attention model + LM) 75.57 Lexicon-based recognition For SVT-50 we can observe
Proposed (LSTM + attention model+LM+dict) 76.04
TABLE 1. I MPACT OF THE DIFFERENT COMPONENTS OF OUR that our method obtain a similar result than the best of the
FRAMEWORK WITH RESPECT TO THE BASELINE . W E COMPARE THE methods [8] specifically designed to work in a lexicon-based
BASELINE (LSTM WITH NO ATTENTION MODEL ) WITH ALL THE scenario. Comparing with methods for unsconstrained text
VARIANTS OF THE PROPOSED METHOD
recognition, only the method of Leeet al. [10] outperforms
our best setting. But as we have already discussed, part of
this better performance can be explained by the use of the
more complex recursive CNN features.
4.3. Comparison with state of the art
Concerning ICDAR’03-50 and ICDAR’03-full, our re-
In this section we will compare our result with other sults, although do not beat current state of the art are very
related works on scene text recognition. The results of this competitive and comparable to the best performing methods.
comparison are shown in table 2 for SVT and ICDAR’03
and 3 for COCO dataset. First, we will discuss results on 5. Conclusions
unconstrained text recognition which is the main focus of
our work. Then, we will analyze results for lexicon-based In this paper we proposed an LSTM-based visual at-
recognition. tention model for scene text recognition. The model uses
Unconstrained text recognition: apart from our method convolutional features from a standard CNN as input to an
Jaderberg et al. [6], Lee et al. [10] and Bissaccco et al. LSTM network that selectively attends to parts of the image
[3] are the only methods which are capable of perform- at each time step in order to recognize words without re-
ing totally unconstrained recognition of scene text. Among sorting to a fixed lexicon. We also propose a modified beam
947
Methods SVT-50 SVT ICDAR’03-50 ICDAR’03-full ICDAR’03
Almazan et al. [1] 89.2 - - - -
Lee et al. [11] 80.0 - 88.0 76.0 -
Lexicon-based
Methods Accuracy
[6] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep
Charnet [9] 24.72
structured output learning for unconstrained text recognition,” ICLR
Dictnet [9] 26.79
2015, 2014.
Proposed (LSTM + attention model) 24.11
Proposed (LSTM + attention model+LM) 33.67 [7] ——, “Reading text in the wild with convolutional neural networks,”
Proposed (LSTM + attention model+LM+FT) 43.86 CORR/abs/1412.1842, 2014.
TABLE 3. P ERFORMANCE OF OUR METHODS ON RECENTLY RELEASED
[8] ——, “Synthetic data and artificial neural networks for natural scene
COCO-T EXT DATASET, W E COMPARE DIFFERENT VARIANTS OF OUR
text recognition,” in NIPS Deep Learning Workshop 2014, 2014.
METHOD USING ONLY THE ATTENTION MODEL , INTEGRATING EXPLICIT
LANGUAGE MODEL AND ALSO F INE TUNING THE MODEL ON COCO [9] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep features for text
T EXT DATASET ). spotting,” in ECCV 2014, 2014.
[10] C. Lee and S. Osindero, “Recursive recurrent nets with attention
search strategy that is able to incorporate weak language modeling for OCR in the wild,” CoRR, vol. abs/1603.03101, 2016.
models (n-grams) to improve recognition accuracy. Experi- [Online]. Available: http://arxiv.org/abs/1603.03101
mental results demonstrate that our approach outperforms or [11] S. Lee, M. S. Cho, K. Jung, and J. H. Kim, “Scene text extraction
performs comparably to state-of-the-art approaches that use with edge constraint and text collinearity,” in Proc. ICPR, 2010.
lexicons to constrain inferred output words. Experimental [12] L. Neumann and J. Matas, “Real-time scene text localization and
results shows that context plays a important part in case of recognition,” in Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on. IEEE, 2012, pp. 3538–3545.
real data, thus using a explicit language model always helps
to improve the result. [13] J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin, “Label embed-
ding: A frugal baseline for text recognition,” International Journal of
In future we can extend the attention model for the text Computer Vision, vol. 113, no. 3, pp. 193–207, 2015.
detection task, which will lead to an end-to-end framework
[14] D. L. Smith, J. Field, and E. Learned-Miller, “Enforcing similarity
for text recognition from images. Moreover, in our current constraints with integer programming for better scene text recogni-
framework convolutional features are taken from one single tion,” in Computer Vision and Pattern Recognition (CVPR), 2011
layer, which can lead to poorer results when the text is IEEE Conference on. IEEE, 2011, pp. 73–80.
either too big or too small. This can be dealt with combining [15] L. P. Sosa, S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and
features from multiple layers. R. Young, “Icdar 2003 robust reading competitions,” in In Proceed-
ings of the Seventh International Conference on Document Analysis
and Recognition. IEEE Press, 2003, pp. 682–687.
References
[16] B. Su and S. Lu, “Accurate scene text recognition based on recurrent
neural network,” in Asian Conference on Computer Vision. Springer,
[1] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word spotting and 2014, pp. 35–48.
recognition with embedded attributes,” IEEE transactions on pattern
analysis and machine intelligence, vol. 36, no. 12, pp. 2552–2566, [17] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-
2014. text: Dataset and benchmark for text detection and recognition in
natural images,” in arXiv preprint arXiv:1601.07140, 2016.
[2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, [18] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text
2014. [Online]. Available: http://arxiv.org/abs/1409.0473 recognition,” in 2011 International Conference on Computer Vision.
IEEE, 2011, pp. 1457–1464.
[3] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Photoocr:
Reading text in uncontrolled conditions,” in 2013 IEEE International [19] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,
Conference on Computer Vision, Dec 2013, pp. 785–792. R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image
[4] A. Gordo, “Supervised mid-level features for word image represen- caption generation with visual attention,” CoRR, vol. abs/1502.03044,
tation,” in The IEEE Conference on Computer Vision and Pattern 2015. [Online]. Available: http://arxiv.org/abs/1502.03044
Recognition (CVPR), June 2015. [20] C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A learned multi-scale
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural representation for scene text recognition,” in The IEEE Conference
computation, vol. 9, no. 8, pp. 1735–1780, 1997. on Computer Vision and Pattern Recognition (CVPR), June 2014.
948