0% found this document useful (0 votes)
46 views7 pages

TextBoxes A Fast Text Detector With DL

Uploaded by

Bindu Madhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views7 pages

TextBoxes A Fast Text Detector With DL

Uploaded by

Bindu Madhuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

TextBoxes: A Fast Text Detector with


a Single Deep Neural Network

Minghui Liao,∗ Baoguang Shi,∗ Xiang Bai,† Xinggang Wang, Wenyu Liu
School of Electronic Information and Communications, Huazhong University of Science and Technology
{mhliao, xbai, xgwang, liuwy}@hust.edu.cn and shibaoguang@gmail.com

Abstract The final outputs are the aggregation of all boxes, followed
by a standard non-maximum suppression process. To han-
This paper presents an end-to-end trainable fast scene text dle the large variation in aspect ratios of words, we design
detector, named TextBoxes, which detects scene text with
several novel, inception-style (Szegedy et al. 2015) output
both high accuracy and efficiency in a single network forward
pass, involving no post-process except for a standard non- layers that utilize both irregular convolutional kernels and
maximum suppression. TextBoxes outperforms competing default boxes. Our detector delivers both high accuracy and
methods in terms of text localization accuracy and is much high efficiency with only a single forward pass on single-
faster, taking only 0.09s per image in a fast implementation. scale inputs, and even higher accuracy with multiple passes
Furthermore, combined with a text recognizer, TextBoxes on multi-scale inputs.
significantly outperforms state-of-the-art approaches on word Furthermore, we argue that word recognition is helpful to
spotting and end-to-end text recognition tasks. distinguish texts from backgrounds, especially when words
are confined to a given set, i.e. a lexicon. We adopt a suc-
cessful text recognition algorithm, CRNN (Shi, Bai, and Yao
Introduction 2015), in conjunction with TextBoxes. The recognizer not
Scene text is one of the most general visual objects in natu- only provides extra recognition outputs, but also regularizes
ral scenes. It frequently appears on road signs, license plates, text detection with its semantic-level awareness, thus fur-
product packages, etc. Reading scene text facilitates a lot of ther boosting the accuracy of word spotting considerably.
useful applications, such as image-based geolocation. De- The combination of TextBoxes and CRNN yields the state-
spite the similarity to traditional OCR, scene text reading is of-the-art performance on word spotting and end-to-end text
much more challenging, due to the large variations in both recognition tasks, which appears to be a simple yet effective
foreground text and background objects, as well as uncon- solution to robust text reading in the wild.
trollable lighting conditions, etc. To summarize, the contributions of this paper are three-
Owing to the inevitable challenges and complexities, fold: First, we design an end-to-end trainable neural net-
traditional text detection methods tend to involve multi- work model for scene text detection. Second, we propose a
ple processing steps, e.g. character/word candidate gener- word spotting/end-to-end recognition framework that effec-
ation (Neumann and Matas 2012; Jaderberg et al. 2016), tively combines detection and recognition. Third, our model
candidate filtering, and grouping. They often end up strug- achieves highly competitive results while keeping its com-
gling to get each module working properly, requiring much putational efficiency.
effort in tuning parameters and designing heuristic rules,
also slowing down detection speed. Inspired by the re- Related Works
cent developments in object detection (Liu et al. 2016;
Intuitively, scene text reading can be further divided into
Ren et al. 2015), we propose to detect texts by directly pre-
two sub-tasks: text detection and text recognition. The for-
dicting word bounding boxes via a single neural network
mer aims to localize text in images, mostly in the form of
that is end-to-end trainable.
word bounding boxes; The latter transcripts cropped word
Our key contribution in this paper is a fast and accu-
images into machine-interpretable character sequences. We
rate text detector called TextBoxes, which is based on fully-
cover both tasks in this paper but pay more attention to de-
convolutional network (LeCun et al. 1998). TextBoxes di-
tection.
rectly outputs the coordinates of word bounding boxes at
Based on a basic detection target, previous methods for
multiple network layers by jointly predicting text presence
text detection can be roughly categorized into three cate-
and coordinate offsets to default boxes (Liu et al. 2016).
gories:

Authors contribute equally. 1) Character-based: Individual characters are first detected

Corresponding author. and then grouped into words (Neumann and Matas 2012;
Copyright  c 2017, Association for the Advancement of Artificial Pan, Hou, and Liu 2011; Yao et al. 2012; Huang, Qiao, and
Intelligence (www.aaai.org). All rights reserved. Tang 2014). For example, (Neumann and Matas 2012) lo-

4161
Figure 1: TextBoxes Architecture. TextBoxes is a 28-layer fully convolutional network. Among them, 13 are inherited from
VGG-16. 9 extra convolutional layers are appended after the VGG-16 layers. Text-box layers are connected to 6 of the con-
volutional layers. On every map location, a text-box layer predicts a 72-d vector, which are the text presence scores (2-d) and
offsets (4-d) for 12 default boxes. A non-maximum suppression is applied to the aggregated outputs of all text-box layers.

cates characters by classifying Extremal Regions. After that, Detecting text with TextBoxes
the detected characters are grouped by an exhaustive search Architecture
method;
The architecture of TextBoxes is depicted in Fig. 1. It in-
2) Word-based: Words are directly hit with the similar
herits the popular VGG-16 architecture (Simonyan and Zis-
manner of general object detection (Jaderberg et al. 2016;
serman 2014), keeping the layers from conv1 1 through
Zhong et al. 2016; Gomez-Bigorda and Karatzas 2016).
conv4 3. The last two fully-connected layers of VGG-16
(Jaderberg et al. 2016) proposes an R-CNN-based (Girshick
are converted into convolutional layers by parameters down-
et al. 2014) framework.First, word candidates are generated
sampling (Liu et al. 2016). They are followed by a few
with class-agnostic proposal generators. Then the proposals
extra convolutional and pooling layers, namely conv6 to
are classified by a random forest classifier. Finally, a con-
pool11.
volutional neural network for bounding box regression was
Multiple output layers, which we call text-box layers, are
adopted to refine the bounding boxes. (Gupta, Vedaldi, and
inserted after the last and some intermediate convolutional
Zisserman 2016) improves over the YOLO network (Red-
layers. Their outputs are aggregated and undergo a non-
mon et al. 2016) while it still adopts the filter and regression
maximum suppression (NMS) process. Output layers are
steps for further removing the false positives;
also convolutional. All together, TextBoxes consists of only
3) Text-line-based: Text lines are detected and then bro-
convolutional and pooling layers, thus fully-convolutional. It
ken into words. For example, (Zhang et al. 2015) proposes
adapts to arbitrary-size images in both training and testing.
to detect text lines utilizing their symmetric characteristics.
Furthermore, (Zhang et al. 2016) localizes text lines with Text-box layers
fully convolutional networks (Long, Shelhamer, and Darrell
Text-box layers are the key component of TextBoxes. A text-
2015).
box layer simultaneously predicts text presence and bound-
TextBoxes is word-based. In contrast to (Jaderberg et al. ing boxes, conditioned on its input feature map. At every
2016), which comprises three detection steps and each fur- map location, it outputs the classification scores and offsets
ther includes more than one algorithm, TextBoxes enjoys a to its associated default boxes in a convolutional manner.
much simpler pipeline. We only need to train one network Suppose that image and feature map sizes are respectively
end-to-end. (wim , him ) and (wmap , hmap ). On a map location (i, j) which
associates a default box b0 = (x0 , y0 , w0 , h0 ), the text-box
TextBoxes is inspired by SSD (Liu et al. 2016), a recent
layer predicts the values of (Δx, Δy, Δw, Δh, c), indicat-
development in object detection. SSD aims to detect general
ing that a box b = (x, y, w, h) is detected with confidence
objects in images but fails on words that have extreme aspect
c, where
ratios. We propose text-box layers in TextBoxes to solve this
problem, which significantly improve the performance.
x = x0 + w0 Δx,
We adopt a text recognizer called CRNN (Shi, Bai, and y = y0 + h0 Δy,
Yao 2015) in conjunction with TextBoxes for word spotting (1)
w = w0 exp(Δw),
and end-to-end recognition. CRNN directly outputs char-
acter sequences given input images and is also end-to-end h = h0 exp(Δh).
trainable. Besides, we use the confidence scores of CRNN In the training phase, ground-truth word boxes are
to regularize the detection outputs of TextBoxes. Note that matched to default boxes according to box overlap, follow-
it is also possible to adopt other recognizers, such as (Jader- ing the matching scheme in (Liu et al. 2016). Each map lo-
berg et al. 2016). cation is associated with multiple default boxes of different

4162
sizes. They effectively divide words by their scales and as- Multi-scale inputs
pect ratios, allowing TextBoxes to learn specific regression Even with the optimizations on default boxes and convolu-
and classification weights that handle words of similar size. tional filters, it may be still difficult to robustly localize the
Therefore, the design of default boxes is highly task-specific. words of extreme aspect ratios and sizes. To further boost
Different from general objects, words tend to have large detection accuracy, we use multiple rescaled versions of the
aspect ratios. Therefore, we include “long” default boxes input image for TextBoxes. An input image is rescaled into
that have large aspect ratios. Specifically, we define 6 aspect five scales, including (width*height) 300*300, 700*700,
ratios for default boxes, including 1,2,3,5,7, and 10. How- 300*700, 500*700, and 1600*1600. Note that some scales
ever, this makes the default boxes dense on the horizontal squeeze image horizontally, so that some “long” words are
direction while sparse vertically, which causes poor match- shortened. Multi-scale inputs boost detection accuracy while
ing boxes. To solve this issue, each default box is set with slightly increasing the computational cost. On ICDAR 2013
vertical offsets. The design of the default boxes is illustrated , they further improve f-measure of detection by 5 percents.
in Fig. 2. Detecting all five scales takes 0.73s per image, and 0.24s
if we remove the last 1600*1600 scale. The running time
is measured on a single Titan X GPU. Note that, different
from testing, we only use single-scale input (300*300) for
training.

Non-maximum suppression
Non-maximum suppression is applied to the aggregated out-
puts of all text-box layers. We adopt an extra non-maximum
suppression for multi-scale inputs on the task of text local-
ization.

Word spotting and end-to-end recognition


Word spotting is to localize specific words that are given in a
lexicon. End-to-end recognition concerns both detection and
recognition. Although both tasks can be achieved by sim-
Figure 2: Illustration of default boxes for a 4*4 grid. For
ply connecting TextBoxes with a text recognizer, we pro-
better visualization, only a column of default boxes whose
pose to improve detection with recognition. We argue that a
aspect ratios 1 and 5 are plotted. The rest of the aspect ratios
recognizer can help eliminating false-positive detection re-
are 2,3,7 and 10, which are placed similarly. The black (as-
sults that are unlikely to be meaningful words, e.g. repetitive
pect ratio: 5) and blue (ar: 1) default boxes are centered in
patterns. Particularly, when a lexicon is present, a recognizer
their cells. The green (ar: 5) and red (ar: 1) boxes have the
could effectively removes the detected bounding boxes that
same aspect ratios and a vertical offset(half of the height of
do not match any of the given words.
the cell) to the grid center respectively.
We adopt the CRNN model (Shi, Bai, and Yao 2015) as
Moreover, in text-box layers we adopt irregular 1*5 con- our text recognizer. CRNN uses CTC (Graves et al. 2006) as
volutional filters instead of the standard 3*3 ones. This its output layer, which estimates sequence probability condi-
inception-style (Szegedy et al. 2015) filters yield rectangu- tioned on input image, i.e. p(w|I), where I is an input image
lar receptive fields, which better fit words with larger aspect and w represents a character sequence. We treat the proba-
ratios, also avoiding noisy signals that a square-shaped re- bility as a matching score, which measures the compatibility
ceptive field would bring in. of an image to a particular word. The detection score is then
the maximum score among all words in a given lexicon:
Learning
We adopt the same loss function as (Liu et al. 2016). Let s = max p(w|I) (3)
w∈W
x be the match indication matrix, c be the confidence, l be
the predicted location, and g be the ground-truth location. where W is a given lexicon. If the task specifies no lexicon,
Specifically, for the i-th default box and the j-th ground we use a generic lexicon that consists of 90k English words.
truth, xij = 1 means matching while xij = 0 otherwise. We replace the original TextBoxes detection score with
The loss function is defined as: the one in Eq. 3. However, evaluating Eq. 3 on all
boxes would be time-consuming. In practice, we first use
TextBoxes to produce a redundant set of word candidates
1
L(x, c, l, g) = (Lconf (x, c) + αLloc (x, l, g)), (2) by detecting with a lower score threshold and a high NMS
N overlap threshold, preserving about 35 bounding boxes per
where N is the number of default boxes that match ground- image with a high recall of 0.93 with multi-scale inputs for
truth boxes, and α is set to 1. We adopt the smooth L1 ICDAR 2013 . Then we apply Eq. 3 to all candidates to
loss (Girshick 2015) for Lloc and a 2-class softmax loss for re-evaluate their scores, followed by a second score thresh-
Lconf . olding and a NMS. When dealing with multi-scale inputs,

4163
Table 1: Text localization on ICDAR 2011 and ICDAR 2013. P, R and F refer to precision, recall and F-measure respectively.
FCRNall+filts reported a time consumption of 1.27 seconds excluding its regression step so we assume it takes more than 1.27
seconds.
Datasets ICDAR 2011 ICDAR 2013
Evaluation protocol IC13 Eval DetEval IC13 Eval DetEval Time/s
Methods P R F P R F P R F P R F
Jaderberg (Jaderberg et al. 2016) – – – – – – – – – – – – 7.3
MSERs-CNN
0.88 0.71 0.78 – – – – – – – – – –
(Huang, Qiao, and Tang 2014)

MMser
– – – – – – 0.86 0.70 0.77 – – – 0.75
(Zamberletti, Noce, and Gallo 2014)

TextFlow (Tian et al. 2015) 0.86 0.76 0.81 – – – 0.85 0.76 0.80 – – – 1.4
FCRNall+filts
– – – 0.92 0.75 0.82 – – – 0.92 0.76 0.83 >1.27
(Gupta, Vedaldi, and Zisserman 2016)

FCN (Zhang et al. 2016) – – – – – – 0.88 0.78 0.83 – – – 2.1


SSD (Liu et al. 2016) – – – – – – 0.80 0.60 0.68 0.80 0.60 0.69 0.1
Fast TextBoxes 0.86 0.74 0.80 0.88 0.74 0.80 0.86 0.74 0.80 0.88 0.74 0.81 0.09
TextBoxes 0.88 0.82 0.85 0.89 0.82 0.86 0.88 0.83 0.85 0.89 0.83 0.86 0.73

we generate candidates separately on each scale and per- due to the lower resolution of the images. There exist some
form the above steps on candidates of all the scales. Here we unlabeled texts in the images. Thus, we only use this dataset
also adopt a slightly different NMS scheme. A lower over- for word spotting, in which a lexicon containing 50 words is
lap threshold is employed for boxes that are recognized as provided for each image.
the same word, so that stronger suppression is imposed on
boxes of the same word. Implementation details
TextBoxes is trained with 300*300 images using stochas-
Experiments tic gradient descent (SGD). Momentum and weight decay
We verify the effectiveness of TextBoxes on three different are set to 0.9 and 5 × 10−4 respectively. Learning rate is
tasks, including text detection, word-spotting, and end-to- initially set to 10−3 , and decayed to 10−4 after 40k train-
end recognition. ing iterations. On all the datasets except SVT, we first train
TextBoxes on SynthText for 50k iterations, then finetune it
Datasets on ICDAR 2013 training dataset for 2k iterations. On SVT,
the finetuning is performed on the SVT training dataset. All
SynthText (Gupta, Vedaldi, and Zisserman 2016) contains training images are augmented online with random crop and
800k synthesized text images, created via blending rendered flip, following the scheme in (Liu et al. 2016). All the ex-
words with natural images. The synthesized images look re- periments are carried out on a PC with one Titan X GPU.
alistic, as the location and transform of text are carefully The whole training time is about 25 hours. Text recognition
chosen with a learning algorithm. This dataset is used for is performed with a pre-trained CRNN (Shi, Bai, and Yao
pre-training our model. 2015) model1 , which is implemented and released by the
ICDAR 2011 (IC11) (Shahab, Shafait, and Dengel 2011) authors.
There are real-world images with high resolution in the IC-
DAR 2011 dataset. The test set of the ICDAR 2011 dataset Text localization
is used to evaluate our model. TextBoxes is tested on ICDAR 2011 and ICDAR 2013 for
ICDAR 2013 (IC13) (Karatzas et al. 2013) The ICDAR evaluating its text localization performance. The results are
2013 dataset is similar to the ICDAR 2011 dataset. We use summarized and compared with other methods in Table. 1.
the training set of the ICDAR 2013 for training when we Results are evaluated under two different evaluation proto-
do experiments on the ICDAR 2011 dataset and the ICDAR cols, the DetEval (Wolf and Jolion 2006) and the ICDAR
2013 dataset. The ICDAR 2013 dataset gives 3 lexicons of 2013 evaluation (Karatzas et al. 2013).
different sizes for the task of word spotting and end-to-end Since there is a trade-off between precision and recall
recognition. For each test image, it gives 100 words as a lex- rate, f-measure is the most accurate measurement of detec-
icon, which is called a strong lexicon. For the whole test tion performance. TextBoxes consistently outperforms com-
set, it gives a lexicon containing hundreds of words, which peting methods in terms of f-measure. On ICDAR 2011,
is called a weakly lexicon. It also gives a generic lexicon TextBoxes outperforms the second best methods (Gupta,
which contains 90k words. Vedaldi, and Zisserman 2016), by 4 percents. On ICDAR
Street View Text (SVT) (Wang and Belongie 2010) The
1
SVT dataset is more challenging than the ICDAR datasets https://github.com/bgshih/crnn

4164
Figure 3: Examples of text localization results. The green bounding boxes are correct detections; Red boxes are false positives;
Red dashed boxes are false negatives.

Table 2: Word spotting and end-to-end results. The values in the table are F-measure. For ICDAR 2013, strong, weak and
generic mean a small lexicon containing 100 words for each image, a lexicon containing all words in the whole test set and
a large lexicon respectively. We use a lexicon containing 90k words as our generic lexicon. The methods marked by “*” are
published on the ICDAR 2015 Robust Reading Competition website: http://rrc.cvc.uab.es
IC13 IC13
IC11 SVT SVT-50
Methods spotting end-to-end
spotting spotting spotting
strong weak generic strong weak generic
Alsharif (Alsharif and Pineau 2013) – – 0.48 – – – – – –
Jaderberg (Jaderberg et al. 2016) 0.76 0.56 0.68 – – 0.76 – – –
FCRNall+filts
0.84 0.53 0.76 – – 0.85 – – –
(Gupta, Vedaldi, and Zisserman 2016)

Deep2Text II+* – – – 0.85 0.83 0.80 0.82 0.79 0.77


SRC-B-TextProcessingLab* – – – 0.90 0.88 0.81 0.87 0.85 0.80
Adelaide ConvLSTMs* – – – 0.91 0.90 0.83 0.87 0.86 0.80
TextBoxes 0.87 0.64 0.84 0.94 0.92 0.87 0.91 0.89 0.84

2013, TextBoxes also outperforms competing methods by at Word spotting and end-to-end recognition
least 2 percents. TextBoxes ranks the first in term of test- The performance of word spotting is evaluated by detec-
ing speed, even with the multi-scale version, which takes tion results that are refined by recognition, while the eval-
only 0.73s per image. Meanwhile, a fast implementation of uation of end-to-end performance concerns both detection
TextBoxes takes merely 0.09s per image, without much loss and recognition results. We test TextBoxes on ICDAR 2011,
in accuracy. SVT, and ICDAR 2013.
As shown in Table. 2, our method outperforms all the ex-
isting methods, including the most recent competition re-
In order to further verify the effectiveness of TextBoxes, sults published on the website. On ICDAR 2011 and IC-
we also report the results of SSD (Liu et al. 2016) for the DAR 2013, our method outperforms the second best method
comparison in Table. 1, which is the most relevant and at least 2 percents with all the evaluation protocol listed
the state-of-the-art detector for general objects. Here, SSD in Table. 2. The performance gap on SVT is even larger.
is trained using the same procedures as TextBoxes. SSD TextBoxes outperforms the leading method (Gupta, Vedaldi,
achieves competitive performance, but still falls short of and Zisserman 2016), by over 8 percents on both SVT and
other state-of-the-art methods. In particular, we observe that SVT-50. The reason is likely to be that TextBoxes is more
SSD cannot achieve good results when detecting words with robust to the low resolution images in SVT, since TextBoxes
large aspect ratios while TextBoxes performs much better, is trained on relatively low resolution images.
benefiting from the proposed text-box layers which are de- Coupled with a recognition model, TextBoxes achieves
signed in order to overcome the length variation of words. the state-of-the-art performance on end-to-end recogni-

4165
Figure 4: Examples of word spotting results. Yellow words are recognition results. Words less than 3 letters are ignored,
following the evaluation protocol. The box colors have the same meaning as Fig. 3.

tion benchmarks. On ICDAR 2013, TextBoxes breaks the CNN regression model. It takes 1.27s excluding the regres-
records recently made by Adelaide ConvLSTMs* on all sion step, whose running time is not reported. TextBoxes
the lexicon settings. More specifically, TextBoxes generates achieves the highest detection accuracy while being the
about 35 proposals per image when using multi-scale in- fastest among them.
puts on ICDAR 2013, with a recall of 0.93. With a strong
lexicon for the recognition model, 3.8 bounding boxes per Weaknesses
image are reserved, achieving a recall of 0.91 and a pre- TextBoxes performs well in most situations. However, it still
cision of 0.97. We employ a 90k-lexicon for SVT and IC- fails to handle some difficult cases, such as overexposure
DAR 2011, and a 50-word lexicon per image on SVT-50. and large character spacing. Some failure cases are shown
Note that even though Jaderberg (Jaderberg et al. 2016) and in Fig. 3 and Fig. 4.
FCRNall+filts (Gupta, Vedaldi, and Zisserman 2016) adopt
a much smaller lexicon(50k words), their results are still in- Conclusion
ferior to our method.
We have presented TextBoxes, an end-to-end fully convo-
Running speed lutional network for text detection, which is highly stable
and efficient to generate word proposals against cluttered
Most existing methods detect texts in a multi-step manner, backgrounds. Comprehensive evaluations and comparisons
making them hard to run efficiently. Most of the compu- on benchmark datasets clearly validate the advantages of
tation of TextBoxes is spent on the convolutional forward Textboxes in three related tasks including text detection,
passes, which are very fast when running on GPU devices. word spoting and end-to-end recognition. In the future, we
TextBoxes takes only 0.09s per image with 700 ∗ 700 single- are interested to extend TextBoxes for multi-oriented texts,
scale images, resulting in an f-measure of 0.80 on ICDAR and combine the networks of detection and recognition into
2013, which is still very competitive. When running on 5 one unified framework.
input scales, TextBoxes achieves 0.85 f-measure on ICDAR
2013, taking 0.73 second per image with the batch size set- Acknowledgements
ting to 1. We remove the 1600 ∗ 1600 scale when testing
on SVT, since the SVT image resolutions are relatively low. This work was partly supported by National Natural Science
Testing on the the remaining scales takes merely 0.24 second Foundation of China (61222308, 61573160, 61572207 and
per image. 61503145), and Open Project Program of the State Key Lab-
The speed comparisons are listed in Table. 1. (Jaderberg oratory of Digital Publishing Technology (F2016001).
et al. 2016) adopts two proposal generation methods, a ran-
dom forest classifier, and a CNN regression model. They References
each takes 1-3 seconds, about 7s in total. (Gupta, Vedaldi, Alsharif, O., and Pineau, J. 2013. End-to-end text
and Zisserman 2016) proposes a YOLO-like model called recognition with hybrid HMM maxout models. CoRR
FCRN, followed by the same random forest classifiers and a abs/1310.1811.

4166
Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014. Tian, S.; Pan, Y.; Huang, C.; Lu, S.; Yu, K.; and Lim Tan, C.
Rich feature hierarchies for accurate object detection and se- 2015. Text flow: A unified text detection system in natural
mantic segmentation. In Proc. CVPR. scene images. In Proc. ICCV.
Girshick, R. B. 2015. Fast R-CNN. In Proc. ICCV. Wang, K., and Belongie, S. 2010. Word spotting in the wild.
Gomez-Bigorda, L., and Karatzas, D. 2016. Textproposals: In Proc. ECCV, 591–604.
a text-specific selective search algorithm for word spotting Wolf, C., and Jolion, J. 2006. Object count/area graphs
in the wild. CoRR abs/1604.02619. for the evaluation of object detection and segmentation al-
Graves, A.; Fernández, S.; Gomez, F.; and Schmidhuber, J. gorithms. IJDAR 8(4):280–296.
2006. Connectionist temporal classification: labelling un- Yao, C.; Bai, X.; Liu, W.; Ma, Y.; and Tu, Z. 2012. Detecting
segmented sequence data with recurrent neural networks. In texts of arbitrary orientations in natural images. In Proc.
Proc. ICML, 369–376. CVPR, 1083–1090.
Gupta, A.; Vedaldi, A.; and Zisserman, A. 2016. Synthetic Zamberletti, A.; Noce, L.; and Gallo, I. 2014. Text local-
data for text localisation in natural images. In Proc. CVPR. ization based on fast feature pyramids and multi-resolution
maximally stable extremal regions. In Proc. ACCV, 91–105.
Huang, W.; Qiao, Y.; and Tang, X. 2014. Robust scene
text detection with convolution neural network induced mser Zhang, Z.; Shen, W.; Yao, C.; and Bai, X. 2015. Symmetry-
trees. In Proc. ECCV. based text line detection in natural scenes. In Proc. CVPR,
2558–2567.
Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman,
A. 2016. Reading text in the wild with convolutional neural Zhang, Z.; Zhang, C.; Shen, W.; Yao, C.; Liu, W.; and Bai,
networks. IJCV 116(1):1–20. X. 2016. Multi-oriented text detection with fully convolu-
tional networks. In Proc. CVPR.
Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Big-
orda, L. G.; Mestre, S. R.; Mas, J.; Mota, D. F.; Almazan, Zhong, Z.; Jin, L.; Zhang, S.; and Feng, Z. 2016. Deeptext:
J. A.; and de las Heras, L. P. 2013. Icdar 2013 robust read- A unified framework for text proposal generation and text
ing competition. In ICDAR, 1484–1493. detection in natural images. CoRR abs/1605.07314.
LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.
Gradient-based learning applied to document recognition.
Proceedings of the IEEE 86(11):2278–2324.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; and Reed,
S. E. 2016. SSD: single shot multibox detector. In Proc.
ECCV.
Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolu-
tional networks for semantic segmentation. In Proc. CVPR.
Neumann, L., and Matas, J. 2012. Real-time scene text
localization and recognition. In Proc. CVPR, 3538–3545.
Pan, Y.-F.; Hou, X.; and Liu, C.-L. 2011. A hybrid approach
to detect and localize texts in natural scene images. IEEE T.
Image Proc. 20(3):800–813.
Redmon, J.; Divvala, S. K.; Girshick, R. B.; and Farhadi,
A. 2016. You only look once: Unified, real-time object
detection. In Proc. CVPR.
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-
cnn: Towards real-time object detection with region proposal
networks. In Proc. NIPS.
Shahab, A.; Shafait, F.; and Dengel, A. 2011. Icdar 2011 ro-
bust reading competition challenge 2: Reading text in scene
images. In Proc. ICDAR, 1491–1496.
Shi, B.; Bai, X.; and Yao, C. 2015. An end-to-end trainable
neural network for image-based sequence recognition and its
application to scene text recognition. CoRR abs/1507.05717.
Simonyan, K., and Zisserman, A. 2014. Very deep convo-
lutional networks for large-scale image recognition. CoRR
abs/1409.1556.
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.;
Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich,
A. 2015. Going deeper with convolutions. In Proc. CVPR.

4167

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy