0% found this document useful (0 votes)
8 views6 pages

Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov

This paper presents a novel approach for lexicon-free scene text recognition using an LSTM-based soft visual attention model that processes convolutional features from images. The model selectively focuses on different parts of the image to recognize characters without requiring predefined dictionaries, achieving state-of-the-art performance on various datasets. The proposed framework integrates an attention mechanism to enhance recognition accuracy while maintaining lower computational complexity compared to existing methods.

Uploaded by

bob wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov

This paper presents a novel approach for lexicon-free scene text recognition using an LSTM-based soft visual attention model that processes convolutional features from images. The model selectively focuses on different parts of the image to recognize characters without requiring predefined dictionaries, achieving state-of-the-art performance on various datasets. The proposed framework integrates an attention mechanism to enhance recognition accuracy while maintaining lower computational complexity compared to existing methods.

Uploaded by

bob wu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2017 14th IAPR International Conference on Document Analysis and Recognition

Visual attention models for scene text recognition

Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov


Computer Vision Center, Dept. Ciències de la Computació Media Integration and Communication Center (MICC)
Universitat Autònoma de Barcelona Universita di Firenze, Firenze
08193 Bellaterra (Barcelona), Spain Email: bagdanov@cvc.uab.es.
Email: sghosh,ernest@cvc.uab.es

Abstract—In this paper we propose an approach to lexicon- Existing word recognition methods can be broadly divided
free recognition of text in scene images. Our approach relies into dictionay-based methods, using some kind of predefined
on a LSTM-based soft visual attention model learned from lexicon to guide the recognition, and unconstrained methods,
convolutional features. A set of feature vectors are derived able to recognize any word.
from an intermediate convolutional layer corresponding to Dictionary-based scene text recognition. Traditionally,
different areas of the image. This permits encoding of spatial scene text recognition systems use character recognizers
information into the image representation. In this way, the in a sequential way by localizing characters using a slid-
framework is able to learn how to selectively focus on different ing window [9], [12], [18] and then grouping responses
parts of the image. At every time step the recognizer emits one by arranging the character windows from left to right as
character using a weighted combination of the convolutional words. A variety of techniques have been used to classify
feature vectors according to the learned attention model. Train- character bounding boxes, including random ferns [18], in-
ing can be done end-to-end using only word level annotations. teger programming [14] and Convolutional Neural Networks
In addition, we show that modifying the beam search algorithm (CNNs) [9]. These methods often use the lexical constraints
by integrating an explicit language model leads to significantly imposed by a fixed lexicon while grouping the character
better recognition results. We validate the performance of hypotheses into words.
our approach on standard SVT, ICDAR’03 and MS-COCO In contrast to sequential character recognizer models,
scene text datasets, showing state-of-the-art performance in holistic fixed-length representations have been proposed
unconstrained text recognition. in [1], [4], [7], [8], [9], [13]. In [1], [4], [13], a holistic
signature derived from a set of training images is used to
1. Introduction learn a joint embedding space between images and words.
The first attempt using CNN features was made by Jaderberg
et al.in [9], where a sliding window over CNN features
The increasing ability to capture images in any condition
is used for robust scene text recognition using a fixed
and situation poses many challenges and opportunities for
lexicon. Later, the same authors also proposed a fixed-length
extracting visual information from images. One such chal-
representation [7] using convolutional features trained on a
lenge is the detection and recognition of text “in the wild”.
synthetic dataset of 9 million images [8]
Text in natural images is a high level semantic information
that can aid automatic image understanding and retrieval. Unconstrained scene text recognition. Though most of
However, robust reading of text in uncontrolled environ- the works in scene text recognition focus on fixed-lexicon
ments is very different from text recognition in document recognition, a few attempts at unconstrained text recognition
images and much more challenging due to multiple factors have also been made.
such as difficult acquisition conditions, low resolution, font Biassco et al.in [3] rely on sequential character clas-
variability, complex backgrounds, different lighting condi- sifiers. They use a massive number of annotated character
tions, blur, etc. Therefore, OCR techniques used in docu- bounding boxes to learn character classifiers. Binarization
ment images do not generalize well to recognition of scene and sliding window methods are used to generate character
text. proposals followed by a text/background classifier. Finally,
The problem of end-to-end scene text recognition is character probabilities given by character classifiers are used
usually divided in two different tasks: word detection and in a beam search to recognize words. They also integrate a
word recognition. The goal of the word detection stage static character n-gram language model in every step of the
is to generate bounding boxes around potential words in beam search to incorporate an underlying language model.
the images. Subsequently, the words in these bounding Though CNN models have achieved great success in
boxes are recognized in the word recognition stage. This lexicon-based text recognition, word recognition in un-
paper is focused on this second stage, word recognition. constrained scenarios requires modeling the underlying

2379-2140/17 $31.00 © 2017 IEEE 943


DOI 10.1109/ICDAR.2017.158
character-level language model. Jaderberg et al.in [6] pro-
posed to use two separate CNNs, one modeling character 
unigram sequences and another n-gram language statistics.
They additionally use a Conditional Random Field to model
the interdependence of characters (n-grams). However, this       

significantly increases the computational complexity. In ad-


dition, to detect the presence of character n-grams in word
images as neural activations, character n-grams are used as    
output nodes, leading to a huge (10k output units for n=4)
output layer.
In contrast to the above strategies our approach neither     
recognizes individual characters in the word image nor uses
any holistic representation to recognize the word. It rather
uses a LSTM-based visual attention model on top of CNN    
features (based on [19]) to focus attention on relevant parts
of the image at every step and infer a character present in     
the image (see figure 1). Thus, the system does not require
explicit character segmentation and is able to recognize
any word, without the help of any predefined dictionary.    
The visual attention model can be trained using only word
bounding boxes and does not need explicit character bound-
ing boxes at training time.    
Recently, visual attention models have gained a lot of
    
attention and have been used for machine translation [2],
image captioning [19] and also text recognition [10]. In Figure 1. Overall scheme of the proposed recognition framework. Given a
[19] the attention model is combined with an LSTM on cropped word image, a set of spatially localized features are obtained using
top of CNN features. The LSTM outputs one caption word a CNN. Then, an LSTM decoder is combined with an attention model to
at every step focusing on a specific part of the image driven generate the sequence of characters. At every time step the attention model
weights the set of feature vectors to make the LSTM focus on a specific
by the attention model. In our work, we mainly follow part of the image.
this attention model, adapted to the particular case of text
the final recognition result without having to resort
recognition. Although the work of [10] also makes use of
to a fixed lexicon. For that, We modify the beam
a soft attention model for text recognition in wild, there
search to take into account the language model.
are significant differences with respect our work. Firstly,
Additionaly, the beam search can also incorporate
their model relies on Recursive CNN features to model the
a lexicon whenever it is available.
dependencies between characters. Instead we use traditional,
• We experimentally validate that our approach with
much simpler CNN features and it is the visual attention
weak language modeling outperforms the state-of-
model which learns to selectively attend to parts of the
the-art in unconstrained scene text recognition and
image and the dependencies between them. Secondly, Lee
performs comparably to lexicon-based approaches
et al. [10] used the features from the fully connected layer,
with a model complexity lower than similar ap-
while we use features from an earlier convolutional layer,
proaches.
thus preserving the local spatial characteristics of the image
and reducing the model complexity. This also allows the The rest of the paper is organized as follows. In Section
model to focus on a subset of features corresponding to 2, we present our attention-based recognition approach and
certain area of the image and learn the underlying inter- 3. In Section 4 we experimentally validate the model on
dependencies. Thirdly, we used LSTM instead of RNN a variety of standard and public benchmark datasets. We
which has been shown to learn long term dependencies conclude in Section 5 with a summary of our contributions
better than traditional RNNs. and a discussion of future research directions.
Our contributions with respect to the state-of-the-art.
In summary the contributions of our work are: 2. Visual attention for scene text recognition
• We introduce a LSTM-based visual attention model Our recognition approach is based on an encoder-
on top of CNN features for unconstrained scene text decoder framework for sequence to sequence learning. An
recognition. This model is able to selectively attend overall scheme of the framework is illustrated in figure 2.
to specific parts of word images, allowing it to model The encoder takes an image of a cropped word as input and
inter-character dependencies as needed and thus to encodes this image as a sequence of convolutional features.
implicitly model the underlying language. The attention model in between the encoder and the decoder
• We show that weak explicit language models (in the drives, at every step, the focus of attention of the decoder
form of prefix probabilities) can significantly boost towards a specific part of the sequence of features. Then,

944
Decoder at every time step t, a vector zˆt that will is the input to
D O O R
the LSTM decoder. This vector zˆt can be expressed as a
y1 y2 y3 y4 weighted combination of the set Ψ of feature vectors xi
LSTM LSTM LSTM LSTM extracted from the image:
ẑ1 ẑ2 ẑ3 ẑ4 K

zˆt = βt,i xi (2)
h1

h2

h3
β1 β2 β3 β4
i=1
MLP MLP MLP MLP Thus, the vector zˆt encodes the relative importance of
each part of the image in order to predict the next character
ψ Attention Model for the underlying word. At every time step t, and for
Encoder CNN
each location i a positive weight βt,i is assigned such that

(βi ) = 1. These weights are obtained as the softmax
output of a Multi Layer Perpectron (denoted as Φ) using the
set of feature vectors Ψ and the hidden state of the LSTM
decoder at the previous time step, ht−1 . More formally:
αti = Φ (xi , ht−1 ) (3)
exp (αti )
βti = K (4)
Figure 2. The proposed Encoder-decoder framework with attention model. j=1 exp (αt,j )
an LSTM-based decoder generates a sequence of alpha- This model is smooth and differentiable and thus it can be
numeric symbols as output, one at every time step, termi- learned using standard back propagation.
nating when a special stop symbol is output by the LSTM. Decoder: Our decoder is a Long Short Term Memory
Below we describe the details of each of the components of (LSTM) network [5] which produces one symbol from the
the framework. given symbol set L, at every time step. The output of the
Encoder: The encoder uses a convolutional neural network LSTM is a vector yt of |L| character probabilities which
to extract a set of features from the image. Specifically, we represents the probability of emitting each of the characters
make use of the CNN model proposed by Jaderberg et al. [7] in the symbol set L at time t. It depends on the output vector
for scene text recognition – however we do not use the of the soft attention model zˆt , the hidden state at previous
fully connected layer as a fixed-length representation as it step ht−1 and the output of the LSTM at previous step yt−1 .
is common in previuos works. Instead, we take the features We follow the notation introduced in [19] where the network
produced by the last convolutional layer. In this way we is described ⎛ by:⎞ ⎛ ⎞ ⎛
it σ ⎞
can produce a set of feature vectors, each of them linked Eyt−1
⎜ ft ⎟ ⎜ σ ⎟ ⎝
to a specific spatial location of the image through its corre- ⎝ o ⎠ = ⎝ σ ⎠T ht−1 ⎠ (5)
t
sponding receptive field. This preserves spatial information zˆt
about the image and reduces model complexity. Through gt tanh
the attention model, the decoder is able to use this spatial ct = ft  ct−1 + it  gt (6)
information to selectively focus on the most relevant parts ht = ot  tanh (ct ) , (7)
of the image at every step. where T is the matrix of weights learned by the network and
Thus, given an input image of a cropped word, the it , ft , ct , ot , and ht are the input, forget, memory, output
encoder generates a set of feature vectors: and hidden state of the LSTM, respectively. In the above
Ψ = {xi : i = 1 . . . K}, (1) definition,  denotes the element-wise multiplication and
where xi denotes the feature vector corresponding to ith E is an embedding of the output character probabilities
part of the image. Each xi corresponds to a spatial location that is also learned by the network. σ and tanh denote the
in the image and contains the activations of all feature maps activation functions that are applied after the multiplication
at that location in the last convolutional layer of the CNN. by the matrix of weights
Finally, to compute the output character probability yt ,
Attention model: For the attention model, we adapt the
a deep output layer is added that takes as input the character
soft attention model of [19] for image captioning, originallly
probability at the previous step, the current LSTM hidden
introduced by [2] for neural machine translation. In [19]
state, and the current feature vector. The output character
slightly better results are obtained using the hard version
probability is:
of the model that focuses, at every time step, on a single
P (yt |Ψ, yt−1 ) ∼ exp (L0 (Eyt−1 + Lh ht + Lz zˆt )) (8)
feature vector. However, we argue that, in the case of text
where L0 , Lh and Lz are the parameters of the deep output
recognition, the soft version is more appropriate since a
layer that are learned using back-propagation.
single character will usually span more than one spatial cell
of the image corresponding to each of the feature vectors.
The soft version of the model can combine several feature 3. Inference
vectors with different weights into the final representation.
As shown in figure 2, the attention model generates, We use beam search over LSTM outputs to perform
word inference. We first introduce the basic procedure, and

945
then describe how we extend it to incorporate language beam search any alternative that do not correspond to any
models. partial branch of the trie.

3.1. The basic inference procedure 4. Experimental Results


Once the model is trained, we use a beam search to
approximately maximize the following score function over 4.1. Datasets and experimental protocols
every possible word: w = [c1 , . . . , cn ]:
N

S (w, x) = log (P (ct |ct−1 )) , (9) We evaluate the performance of the proposed method
t=1
using the following standard datasets.
where cn is a special symbol signifying the end of a word, Street View Text (SVT) dataset: this dataset contains
which immediately stops the beam search. 647 cropped word images downloaded from Google Street
The beam search keeps track at every step of the top View. Results using the predefined lexicons defined by Wang
N most probable sequences of characters. For every active et al.in [18] of 50 words for each image refereed as SVT-50.
branch of the beam search, given the previous character of
the sequence, ct−1 , the output character probability yt of ICDAR’03 text dataset: this dataset dataset contains 251
the LSTM is used to obtain P (ct |ct−1 ) for all characters ct full images and 860 cropped word images [15]. We used
in the symbol set L. the same protocol as [1], [10], [18] and evaluate cropped
word images for which the groundtruth text contains only
alphanumeric characters and contains at least three charac-
3.2. Incorporating language models and Lexicon ters.
MSCOCO [17] text Dataset: This is a recently published
Text is a strongly contextual. There are some strict
dataset. This dataset is also challenging as none of the
constraints imposed by the grammar of the language. For
images are captured specifically with text recognition in
example any word in English cannot carry more than two
mind. Also this dataset is much bigger than previous scene
consecutive occurrences of any alphabet letter. Leveraging
text datasets.
such knowledge can positively impact the final recogni-
Synth90k text dataset: this dataset is used only for
tion output. Although the LSTM implicitly learns some
dependences between consecutive characters, we show that training [8]. It contains 9 million synthetically-generated text
adding an explicit language model that takes into account images. We use the official partition for training as in other
longer dependencies gives a significant boost to recognition works like [10].
accuracy. Evaluation protocol: We use the standard evaluation
In this work we use a standard n-gram based language protocol adopted in most previous work on text recognition
model during inference to leverage the language prior. The in scene images [6], [10], [18]. The accepted metric is word
character n-gram model gives probability of a character level accuracy in percentage. SVT and ICDAR’03 are used
conditioned on k previous characters, where k is a parameter for evaluation. For lexicon-based recognition, we used the
of the model: same set of 50 for all images in for SVT and ICDAR’03
# (c1 c2 ...ck−1 )
Θ (ck |ck−1 , ck−2 ..., c1 ) = , (10) dataset, as proposed buy Wang et al. [18].
# (ck ck ...ck )
where, #(c1 , . . . cn ) is the number of occurrences of a Implementation details: The CNN encoder used in this
particular substring in a training corpus. work is the Dictnet model by Jaderberg et al. [8]. Their
Finally, the score function in equation 9 can be modified deep convolutional network consists of four convolutional
to take the n-gram language model into account as: layers and two fully connected layers. In this work we used
N features from the last convolutional layer. Thus, the feature
S (w, x) = log (P (ct |ct−1 )) map used is of size 4 × 13 and therefore, the LSTM takes
t=1 input in the form of 52 × 512.
+ α log Θ (wt |wt−1 , wt−2 ..., w1 ) (11) For lexicon-based recognition when we do not use the
At every step we fix the parameter k of the language lexicon-based inference explained in section 3.2. Instead,
model to the number of previously generated characters in we take the output of unconstrained recognition and find
order to take into account the longest possible sequence. the closest word in the lexicon using the Levenshtein
Although our method is originally designed for uncon- edit distance. For lexicon-based inference in unsconstrained
strained text recognition, it can also leverage a lexicon datasets (SVT and ICDAR’03) we use the 90k-words lexi-
whenever available. The use of a lexicon D can be integrated con provided by Jaderberg et al.in [8]. The explicit language
by modifying the beam search so that all active sequences model is also learned using this 90k word lexicon.
that do not correspond to any valid word are automatically The parameter α (see equation 11) to weight the lan-
removed from the beam. guage model with respect to LSTM character probability is
This can be efficiently implemented by storing the lexi- empirically established. In our experiments we found the
con in a trie structure and automatically removing from the best results with α between 0.25 to 0.3

946
4.2. Baseline performance analysis these methods, our visual attention based model performs
significantly better than Bissacco et al. [3] and Jaderberg
In this section we analyze the impact on performance of et al. [6] in both SVT and ICDAR’03 datasets. Our model
all the components of the proposed model. We start with a also performs as good as Lee et al. [10] in SVT dataset
baseline that consists of a simple one layer LSTM network and outperforms them by 3% in ICDAR’03 dataset, which
as decoder, without any attention or explicit language model. is significant given the high recognition rates.
As we are interested mainly in the impact of the attention If we further compare our model with that of Lee et al.
model, we use a simple version in which CNN features from [10], that also uses different variants of RNN architectures
the encoder are fed to the LSTM only at the first time step. and an attention model on top of CNN features, we find
At every step the output character is determined based on that they use recursive CNN features. They report that
the output of the previous step and the previous hidden state. this gives an 8% increase in accuracy over the baseline.
In an effort to evaluate each of our contributions, we This success is due to the recurrent nature of the CNN
trained the baseline system and our model with exactly the feature which implicitly model the conditional probability
same training data. For this purpose we randomly sampled of character sequences.using recursive CNN performs better
one million training samples from the Synth90k [8] dataset. than the traditional convolutional feature. However, the RNN
For validation we used 300,000 samples randomly taken architecture they use improves only 4% over the baseline.
from the same synth90K dataset. In contrast our method rely on traditional CNN features
We present the results for each of the component of (which can possibly encodes the presence of individual
the framework as described above in Table 1. The attention characters as shown in [6] from lower convolutional layer
model outperforms the baseline by a significant margin preserving local spatial characteristics, which reduces the
(around 7%). Also these results confirm the advantage of complexity of the model. In addition, as reported in table 1,
using an explicit language model in addition to the implicit our combination of LSTM and soft attention model achieves
conditional character probabilities learned by the LSTM a much larger margin, 14%, over the baseline. Theses results
model. Using the language model improves accuracy in an- show that a combination of local convolutional features
other 7%. We also see that further constraining the inference using the context based attention attention performs better
wih a dictionary does not improve the result much, probably or comparable to the previous state-of-the- art results.
because the language model is learned from the same 90K We provide the results on the COCO text dataset in
dictionary proposed by Jaderberg et al.in [8]. Table 3: Being this the most recenlty released dataset in
In comparison with other related works on unconstrained this domain, there are no published results that could be
text recognition, it is noteworthy that with only one million comparable with our work. To make a valid comparison we
training samples our complete framework can learn a better used two neural network based approaches by M. Jaderberg
model than Jaderberg et al. [6] and obtain results that are et. al. [9] as they have made their models available online.
close to other state-of-the-art methods that are using the We also fine-tuned the models on COCO dataset which leads
whole 9 million sample training dataset (see table 2). to significant improvement (last row in Table 3). We can
see that our simplest model is comparable to Jaderberg’s
Methods SVT
Baseline (LSTM-no attention) 61.7
results while including the explicit language model leads to
Proposed (LSTM + attention model) 68.16 a significant improvement by a large margin.
Proposed (LSTM + attention model + LM) 75.57 Lexicon-based recognition For SVT-50 we can observe
Proposed (LSTM + attention model+LM+dict) 76.04
TABLE 1. I MPACT OF THE DIFFERENT COMPONENTS OF OUR that our method obtain a similar result than the best of the
FRAMEWORK WITH RESPECT TO THE BASELINE . W E COMPARE THE methods [8] specifically designed to work in a lexicon-based
BASELINE (LSTM WITH NO ATTENTION MODEL ) WITH ALL THE scenario. Comparing with methods for unsconstrained text
VARIANTS OF THE PROPOSED METHOD
recognition, only the method of Leeet al. [10] outperforms
our best setting. But as we have already discussed, part of
this better performance can be explained by the use of the
more complex recursive CNN features.
4.3. Comparison with state of the art
Concerning ICDAR’03-50 and ICDAR’03-full, our re-
In this section we will compare our result with other sults, although do not beat current state of the art are very
related works on scene text recognition. The results of this competitive and comparable to the best performing methods.
comparison are shown in table 2 for SVT and ICDAR’03
and 3 for COCO dataset. First, we will discuss results on 5. Conclusions
unconstrained text recognition which is the main focus of
our work. Then, we will analyze results for lexicon-based In this paper we proposed an LSTM-based visual at-
recognition. tention model for scene text recognition. The model uses
Unconstrained text recognition: apart from our method convolutional features from a standard CNN as input to an
Jaderberg et al. [6], Lee et al. [10] and Bissaccco et al. LSTM network that selectively attends to parts of the image
[3] are the only methods which are capable of perform- at each time step in order to recognize words without re-
ing totally unconstrained recognition of scene text. Among sorting to a fixed lexicon. We also propose a modified beam

947
Methods SVT-50 SVT ICDAR’03-50 ICDAR’03-full ICDAR’03
Almazan et al. [1] 89.2 - - - -
Lee et al. [11] 80.0 - 88.0 76.0 -
Lexicon-based

Yao et al. [20] 75.9 - 88.5 80.3 -


Rodriguez-Serrano et al. [13] 70.0 - - - -
Jaderberg et al. [7] 86.1 - 96.2 91.5 -
Su and Luet al. [16] 83.0 - 92.0 82.0 -
Gordo et al. [4] 90.7 - - - -
*DICT Jaderberg et al. [8] 95.4 80.7 98.7 98.6 93.1
Bissacco et al. [3] 90.4 78.0 - - -
Unconstrained

Jaderberget al. [6] 93.2 71.7 97.8 97.0 89.6


Lee et al. [10] 96.3 80.7 97.9 97.0 88.7
Proposed (LSTM + attention model) 91.7 75.1 93.4 91.0 89.3
Proposed (LSTM + attention model + LM) 95.2 80.4 95.7 94.1 92.6
Proposed (LSTM + attention model+LM+dict) 95.4 - 96.2 95.7 -
TABLE 2. S CENE TEXT RECOGNITION ACCURACY. “50” AND “F ULL” DENOTE THE LEXICON SIZE USED FOR CONSTRAINED TEXT RECOGNITION AS
DEFINED IN [18]. R ESULTS ARE DIVIDED INTO LEXICON - BASED AND UNCONSTRAINED ( LEXICON - FREE ) APPROACHES . *DICT [8] IS NOT
LEXICON - FREE DUE TO INCORPORATING GROUND - TRUTH LABELS DURING TRAINING .

Methods Accuracy
[6] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Deep
Charnet [9] 24.72
structured output learning for unconstrained text recognition,” ICLR
Dictnet [9] 26.79
2015, 2014.
Proposed (LSTM + attention model) 24.11
Proposed (LSTM + attention model+LM) 33.67 [7] ——, “Reading text in the wild with convolutional neural networks,”
Proposed (LSTM + attention model+LM+FT) 43.86 CORR/abs/1412.1842, 2014.
TABLE 3. P ERFORMANCE OF OUR METHODS ON RECENTLY RELEASED
[8] ——, “Synthetic data and artificial neural networks for natural scene
COCO-T EXT DATASET, W E COMPARE DIFFERENT VARIANTS OF OUR
text recognition,” in NIPS Deep Learning Workshop 2014, 2014.
METHOD USING ONLY THE ATTENTION MODEL , INTEGRATING EXPLICIT
LANGUAGE MODEL AND ALSO F INE TUNING THE MODEL ON COCO [9] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep features for text
T EXT DATASET ). spotting,” in ECCV 2014, 2014.
[10] C. Lee and S. Osindero, “Recursive recurrent nets with attention
search strategy that is able to incorporate weak language modeling for OCR in the wild,” CoRR, vol. abs/1603.03101, 2016.
models (n-grams) to improve recognition accuracy. Experi- [Online]. Available: http://arxiv.org/abs/1603.03101
mental results demonstrate that our approach outperforms or [11] S. Lee, M. S. Cho, K. Jung, and J. H. Kim, “Scene text extraction
performs comparably to state-of-the-art approaches that use with edge constraint and text collinearity,” in Proc. ICPR, 2010.
lexicons to constrain inferred output words. Experimental [12] L. Neumann and J. Matas, “Real-time scene text localization and
results shows that context plays a important part in case of recognition,” in Computer Vision and Pattern Recognition (CVPR),
2012 IEEE Conference on. IEEE, 2012, pp. 3538–3545.
real data, thus using a explicit language model always helps
to improve the result. [13] J. A. Rodriguez-Serrano, A. Gordo, and F. Perronnin, “Label embed-
ding: A frugal baseline for text recognition,” International Journal of
In future we can extend the attention model for the text Computer Vision, vol. 113, no. 3, pp. 193–207, 2015.
detection task, which will lead to an end-to-end framework
[14] D. L. Smith, J. Field, and E. Learned-Miller, “Enforcing similarity
for text recognition from images. Moreover, in our current constraints with integer programming for better scene text recogni-
framework convolutional features are taken from one single tion,” in Computer Vision and Pattern Recognition (CVPR), 2011
layer, which can lead to poorer results when the text is IEEE Conference on. IEEE, 2011, pp. 73–80.
either too big or too small. This can be dealt with combining [15] L. P. Sosa, S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and
features from multiple layers. R. Young, “Icdar 2003 robust reading competitions,” in In Proceed-
ings of the Seventh International Conference on Document Analysis
and Recognition. IEEE Press, 2003, pp. 682–687.
References
[16] B. Su and S. Lu, “Accurate scene text recognition based on recurrent
neural network,” in Asian Conference on Computer Vision. Springer,
[1] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word spotting and 2014, pp. 35–48.
recognition with embedded attributes,” IEEE transactions on pattern
analysis and machine intelligence, vol. 36, no. 12, pp. 2552–2566, [17] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Belongie, “Coco-
2014. text: Dataset and benchmark for text detection and recognition in
natural images,” in arXiv preprint arXiv:1601.07140, 2016.
[2] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation
by jointly learning to align and translate,” CoRR, vol. abs/1409.0473, [18] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text
2014. [Online]. Available: http://arxiv.org/abs/1409.0473 recognition,” in 2011 International Conference on Computer Vision.
IEEE, 2011, pp. 1457–1464.
[3] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Photoocr:
Reading text in uncontrolled conditions,” in 2013 IEEE International [19] K. Xu, J. Ba, R. Kiros, K. Cho, A. C. Courville, R. Salakhutdinov,
Conference on Computer Vision, Dec 2013, pp. 785–792. R. S. Zemel, and Y. Bengio, “Show, attend and tell: Neural image
[4] A. Gordo, “Supervised mid-level features for word image represen- caption generation with visual attention,” CoRR, vol. abs/1502.03044,
tation,” in The IEEE Conference on Computer Vision and Pattern 2015. [Online]. Available: http://arxiv.org/abs/1502.03044
Recognition (CVPR), June 2015. [20] C. Yao, X. Bai, B. Shi, and W. Liu, “Strokelets: A learned multi-scale
[5] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural representation for scene text recognition,” in The IEEE Conference
computation, vol. 9, no. 8, pp. 1735–1780, 1997. on Computer Vision and Pattern Recognition (CVPR), June 2014.

948

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy