0% found this document useful (0 votes)

46 views7 pages

TextBoxes A Fast Text Detector With DL

Uploaded by

Bindu Madhuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views7 pages

TextBoxes A Fast Text Detector With DL

Uploaded by

Bindu Madhuri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

TextBoxes: A Fast Text Detector with

a Single Deep Neural Network

Minghui Liao,∗ Baoguang Shi,∗ Xiang Bai,† Xinggang Wang, Wenyu Liu
School of Electronic Information and Communications, Huazhong University of Science and Technology
{mhliao, xbai, xgwang, liuwy}@hust.edu.cn and shibaoguang@gmail.com

Abstract The final outputs are the aggregation of all boxes, followed
by a standard non-maximum suppression process. To han-
This paper presents an end-to-end trainable fast scene text dle the large variation in aspect ratios of words, we design
detector, named TextBoxes, which detects scene text with
several novel, inception-style (Szegedy et al. 2015) output
both high accuracy and efficiency in a single network forward
pass, involving no post-process except for a standard non- layers that utilize both irregular convolutional kernels and
maximum suppression. TextBoxes outperforms competing default boxes. Our detector delivers both high accuracy and
methods in terms of text localization accuracy and is much high efficiency with only a single forward pass on single-
faster, taking only 0.09s per image in a fast implementation. scale inputs, and even higher accuracy with multiple passes
Furthermore, combined with a text recognizer, TextBoxes on multi-scale inputs.
significantly outperforms state-of-the-art approaches on word Furthermore, we argue that word recognition is helpful to
spotting and end-to-end text recognition tasks. distinguish texts from backgrounds, especially when words
are confined to a given set, i.e. a lexicon. We adopt a suc-
cessful text recognition algorithm, CRNN (Shi, Bai, and Yao
Introduction 2015), in conjunction with TextBoxes. The recognizer not
Scene text is one of the most general visual objects in natu- only provides extra recognition outputs, but also regularizes
ral scenes. It frequently appears on road signs, license plates, text detection with its semantic-level awareness, thus fur-
product packages, etc. Reading scene text facilitates a lot of ther boosting the accuracy of word spotting considerably.
useful applications, such as image-based geolocation. De- The combination of TextBoxes and CRNN yields the state-
spite the similarity to traditional OCR, scene text reading is of-the-art performance on word spotting and end-to-end text
much more challenging, due to the large variations in both recognition tasks, which appears to be a simple yet effective
foreground text and background objects, as well as uncon- solution to robust text reading in the wild.
trollable lighting conditions, etc. To summarize, the contributions of this paper are three-
Owing to the inevitable challenges and complexities, fold: First, we design an end-to-end trainable neural net-
traditional text detection methods tend to involve multi- work model for scene text detection. Second, we propose a
ple processing steps, e.g. character/word candidate gener- word spotting/end-to-end recognition framework that effec-
ation (Neumann and Matas 2012; Jaderberg et al. 2016), tively combines detection and recognition. Third, our model
candidate filtering, and grouping. They often end up strug- achieves highly competitive results while keeping its com-
gling to get each module working properly, requiring much putational efficiency.
effort in tuning parameters and designing heuristic rules,
also slowing down detection speed. Inspired by the re- Related Works
cent developments in object detection (Liu et al. 2016;
Intuitively, scene text reading can be further divided into
Ren et al. 2015), we propose to detect texts by directly pre-
two sub-tasks: text detection and text recognition. The for-
dicting word bounding boxes via a single neural network
mer aims to localize text in images, mostly in the form of
that is end-to-end trainable.
word bounding boxes; The latter transcripts cropped word
Our key contribution in this paper is a fast and accu-
images into machine-interpretable character sequences. We
rate text detector called TextBoxes, which is based on fully-
cover both tasks in this paper but pay more attention to de-
convolutional network (LeCun et al. 1998). TextBoxes di-
tection.
rectly outputs the coordinates of word bounding boxes at
Based on a basic detection target, previous methods for
multiple network layers by jointly predicting text presence
text detection can be roughly categorized into three cate-
and coordinate offsets to default boxes (Liu et al. 2016).
gories:
∗
Authors contribute equally. 1) Character-based: Individual characters are first detected
†
Corresponding author. and then grouped into words (Neumann and Matas 2012;
Copyright c 2017, Association for the Advancement of Artificial Pan, Hou, and Liu 2011; Yao et al. 2012; Huang, Qiao, and
Intelligence (www.aaai.org). All rights reserved. Tang 2014). For example, (Neumann and Matas 2012) lo-

4161
Figure 1: TextBoxes Architecture. TextBoxes is a 28-layer fully convolutional network. Among them, 13 are inherited from
VGG-16. 9 extra convolutional layers are appended after the VGG-16 layers. Text-box layers are connected to 6 of the con-
volutional layers. On every map location, a text-box layer predicts a 72-d vector, which are the text presence scores (2-d) and
offsets (4-d) for 12 default boxes. A non-maximum suppression is applied to the aggregated outputs of all text-box layers.

cates characters by classifying Extremal Regions. After that, Detecting text with TextBoxes
the detected characters are grouped by an exhaustive search Architecture
method;
The architecture of TextBoxes is depicted in Fig. 1. It in-
2) Word-based: Words are directly hit with the similar
herits the popular VGG-16 architecture (Simonyan and Zis-
manner of general object detection (Jaderberg et al. 2016;
serman 2014), keeping the layers from conv1 1 through
Zhong et al. 2016; Gomez-Bigorda and Karatzas 2016).
conv4 3. The last two fully-connected layers of VGG-16
(Jaderberg et al. 2016) proposes an R-CNN-based (Girshick
are converted into convolutional layers by parameters down-
et al. 2014) framework.First, word candidates are generated
sampling (Liu et al. 2016). They are followed by a few
with class-agnostic proposal generators. Then the proposals
extra convolutional and pooling layers, namely conv6 to
are classified by a random forest classifier. Finally, a con-
pool11.
volutional neural network for bounding box regression was
Multiple output layers, which we call text-box layers, are
adopted to refine the bounding boxes. (Gupta, Vedaldi, and
inserted after the last and some intermediate convolutional
Zisserman 2016) improves over the YOLO network (Red-
layers. Their outputs are aggregated and undergo a non-
mon et al. 2016) while it still adopts the filter and regression
maximum suppression (NMS) process. Output layers are
steps for further removing the false positives;
also convolutional. All together, TextBoxes consists of only
3) Text-line-based: Text lines are detected and then bro-
convolutional and pooling layers, thus fully-convolutional. It
ken into words. For example, (Zhang et al. 2015) proposes
adapts to arbitrary-size images in both training and testing.
to detect text lines utilizing their symmetric characteristics.
Furthermore, (Zhang et al. 2016) localizes text lines with Text-box layers
fully convolutional networks (Long, Shelhamer, and Darrell
Text-box layers are the key component of TextBoxes. A text-
2015).
box layer simultaneously predicts text presence and bound-
TextBoxes is word-based. In contrast to (Jaderberg et al. ing boxes, conditioned on its input feature map. At every
2016), which comprises three detection steps and each fur- map location, it outputs the classification scores and offsets
ther includes more than one algorithm, TextBoxes enjoys a to its associated default boxes in a convolutional manner.
much simpler pipeline. We only need to train one network Suppose that image and feature map sizes are respectively
end-to-end. (wim , him ) and (wmap , hmap ). On a map location (i, j) which
associates a default box b0 = (x0 , y0 , w0 , h0 ), the text-box
TextBoxes is inspired by SSD (Liu et al. 2016), a recent
layer predicts the values of (Δx, Δy, Δw, Δh, c), indicat-
development in object detection. SSD aims to detect general
ing that a box b = (x, y, w, h) is detected with confidence
objects in images but fails on words that have extreme aspect
c, where
ratios. We propose text-box layers in TextBoxes to solve this
problem, which significantly improve the performance.
x = x0 + w0 Δx,
We adopt a text recognizer called CRNN (Shi, Bai, and y = y0 + h0 Δy,
Yao 2015) in conjunction with TextBoxes for word spotting (1)
w = w0 exp(Δw),
and end-to-end recognition. CRNN directly outputs char-
acter sequences given input images and is also end-to-end h = h0 exp(Δh).
trainable. Besides, we use the confidence scores of CRNN In the training phase, ground-truth word boxes are
to regularize the detection outputs of TextBoxes. Note that matched to default boxes according to box overlap, follow-
it is also possible to adopt other recognizers, such as (Jader- ing the matching scheme in (Liu et al. 2016). Each map lo-
berg et al. 2016). cation is associated with multiple default boxes of different

4162
sizes. They effectively divide words by their scales and as- Multi-scale inputs
pect ratios, allowing TextBoxes to learn specific regression Even with the optimizations on default boxes and convolu-
and classification weights that handle words of similar size. tional filters, it may be still difficult to robustly localize the
Therefore, the design of default boxes is highly task-specific. words of extreme aspect ratios and sizes. To further boost
Different from general objects, words tend to have large detection accuracy, we use multiple rescaled versions of the
aspect ratios. Therefore, we include “long” default boxes input image for TextBoxes. An input image is rescaled into
that have large aspect ratios. Specifically, we define 6 aspect five scales, including (width*height) 300*300, 700*700,
ratios for default boxes, including 1,2,3,5,7, and 10. How- 300*700, 500*700, and 1600*1600. Note that some scales
ever, this makes the default boxes dense on the horizontal squeeze image horizontally, so that some “long” words are
direction while sparse vertically, which causes poor match- shortened. Multi-scale inputs boost detection accuracy while
ing boxes. To solve this issue, each default box is set with slightly increasing the computational cost. On ICDAR 2013
vertical offsets. The design of the default boxes is illustrated , they further improve f-measure of detection by 5 percents.
in Fig. 2. Detecting all five scales takes 0.73s per image, and 0.24s
if we remove the last 1600*1600 scale. The running time
is measured on a single Titan X GPU. Note that, different
from testing, we only use single-scale input (300*300) for
training.

Non-maximum suppression
Non-maximum suppression is applied to the aggregated out-
puts of all text-box layers. We adopt an extra non-maximum
suppression for multi-scale inputs on the task of text local-
ization.

Word spotting and end-to-end recognition

Word spotting is to localize specific words that are given in a
lexicon. End-to-end recognition concerns both detection and
recognition. Although both tasks can be achieved by sim-
Figure 2: Illustration of default boxes for a 4*4 grid. For
ply connecting TextBoxes with a text recognizer, we pro-
better visualization, only a column of default boxes whose
pose to improve detection with recognition. We argue that a
aspect ratios 1 and 5 are plotted. The rest of the aspect ratios
recognizer can help eliminating false-positive detection re-
are 2,3,7 and 10, which are placed similarly. The black (as-
sults that are unlikely to be meaningful words, e.g. repetitive
pect ratio: 5) and blue (ar: 1) default boxes are centered in
patterns. Particularly, when a lexicon is present, a recognizer
their cells. The green (ar: 5) and red (ar: 1) boxes have the
could effectively removes the detected bounding boxes that
same aspect ratios and a vertical offset(half of the height of
do not match any of the given words.
the cell) to the grid center respectively.
We adopt the CRNN model (Shi, Bai, and Yao 2015) as
Moreover, in text-box layers we adopt irregular 1*5 con- our text recognizer. CRNN uses CTC (Graves et al. 2006) as
volutional filters instead of the standard 3*3 ones. This its output layer, which estimates sequence probability condi-
inception-style (Szegedy et al. 2015) filters yield rectangu- tioned on input image, i.e. p(w|I), where I is an input image
lar receptive fields, which better fit words with larger aspect and w represents a character sequence. We treat the proba-
ratios, also avoiding noisy signals that a square-shaped re- bility as a matching score, which measures the compatibility
ceptive field would bring in. of an image to a particular word. The detection score is then
the maximum score among all words in a given lexicon:
Learning
We adopt the same loss function as (Liu et al. 2016). Let s = max p(w|I) (3)
w∈W
x be the match indication matrix, c be the confidence, l be
the predicted location, and g be the ground-truth location. where W is a given lexicon. If the task specifies no lexicon,
Specifically, for the i-th default box and the j-th ground we use a generic lexicon that consists of 90k English words.
truth, xij = 1 means matching while xij = 0 otherwise. We replace the original TextBoxes detection score with
The loss function is defined as: the one in Eq. 3. However, evaluating Eq. 3 on all
boxes would be time-consuming. In practice, we first use
TextBoxes to produce a redundant set of word candidates
1
L(x, c, l, g) = (Lconf (x, c) + αLloc (x, l, g)), (2) by detecting with a lower score threshold and a high NMS
N overlap threshold, preserving about 35 bounding boxes per
where N is the number of default boxes that match ground- image with a high recall of 0.93 with multi-scale inputs for
truth boxes, and α is set to 1. We adopt the smooth L1 ICDAR 2013 . Then we apply Eq. 3 to all candidates to
loss (Girshick 2015) for Lloc and a 2-class softmax loss for re-evaluate their scores, followed by a second score thresh-
Lconf . olding and a NMS. When dealing with multi-scale inputs,

4163
Table 1: Text localization on ICDAR 2011 and ICDAR 2013. P, R and F refer to precision, recall and F-measure respectively.
FCRNall+ﬁlts reported a time consumption of 1.27 seconds excluding its regression step so we assume it takes more than 1.27
seconds.
Datasets ICDAR 2011 ICDAR 2013
Evaluation protocol IC13 Eval DetEval IC13 Eval DetEval Time/s
Methods P R F P R F P R F P R F
Jaderberg (Jaderberg et al. 2016) – – – – – – – – – – – – 7.3
MSERs-CNN
0.88 0.71 0.78 – – – – – – – – – –
(Huang, Qiao, and Tang 2014)

MMser
– – – – – – 0.86 0.70 0.77 – – – 0.75
(Zamberletti, Noce, and Gallo 2014)

TextFlow (Tian et al. 2015) 0.86 0.76 0.81 – – – 0.85 0.76 0.80 – – – 1.4
FCRNall+ﬁlts
– – – 0.92 0.75 0.82 – – – 0.92 0.76 0.83 >1.27
(Gupta, Vedaldi, and Zisserman 2016)

FCN (Zhang et al. 2016) – – – – – – 0.88 0.78 0.83 – – – 2.1

SSD (Liu et al. 2016) – – – – – – 0.80 0.60 0.68 0.80 0.60 0.69 0.1
Fast TextBoxes 0.86 0.74 0.80 0.88 0.74 0.80 0.86 0.74 0.80 0.88 0.74 0.81 0.09
TextBoxes 0.88 0.82 0.85 0.89 0.82 0.86 0.88 0.83 0.85 0.89 0.83 0.86 0.73

we generate candidates separately on each scale and per- due to the lower resolution of the images. There exist some
form the above steps on candidates of all the scales. Here we unlabeled texts in the images. Thus, we only use this dataset
also adopt a slightly different NMS scheme. A lower over- for word spotting, in which a lexicon containing 50 words is
lap threshold is employed for boxes that are recognized as provided for each image.
the same word, so that stronger suppression is imposed on
boxes of the same word. Implementation details
TextBoxes is trained with 300*300 images using stochas-
Experiments tic gradient descent (SGD). Momentum and weight decay
We verify the effectiveness of TextBoxes on three different are set to 0.9 and 5 × 10−4 respectively. Learning rate is
tasks, including text detection, word-spotting, and end-to- initially set to 10−3 , and decayed to 10−4 after 40k train-
end recognition. ing iterations. On all the datasets except SVT, we first train
TextBoxes on SynthText for 50k iterations, then finetune it
Datasets on ICDAR 2013 training dataset for 2k iterations. On SVT,
the finetuning is performed on the SVT training dataset. All
SynthText (Gupta, Vedaldi, and Zisserman 2016) contains training images are augmented online with random crop and
800k synthesized text images, created via blending rendered flip, following the scheme in (Liu et al. 2016). All the ex-
words with natural images. The synthesized images look re- periments are carried out on a PC with one Titan X GPU.
alistic, as the location and transform of text are carefully The whole training time is about 25 hours. Text recognition
chosen with a learning algorithm. This dataset is used for is performed with a pre-trained CRNN (Shi, Bai, and Yao
pre-training our model. 2015) model1 , which is implemented and released by the
ICDAR 2011 (IC11) (Shahab, Shafait, and Dengel 2011) authors.
There are real-world images with high resolution in the IC-
DAR 2011 dataset. The test set of the ICDAR 2011 dataset Text localization
is used to evaluate our model. TextBoxes is tested on ICDAR 2011 and ICDAR 2013 for
ICDAR 2013 (IC13) (Karatzas et al. 2013) The ICDAR evaluating its text localization performance. The results are
2013 dataset is similar to the ICDAR 2011 dataset. We use summarized and compared with other methods in Table. 1.
the training set of the ICDAR 2013 for training when we Results are evaluated under two different evaluation proto-
do experiments on the ICDAR 2011 dataset and the ICDAR cols, the DetEval (Wolf and Jolion 2006) and the ICDAR
2013 dataset. The ICDAR 2013 dataset gives 3 lexicons of 2013 evaluation (Karatzas et al. 2013).
different sizes for the task of word spotting and end-to-end Since there is a trade-off between precision and recall
recognition. For each test image, it gives 100 words as a lex- rate, f-measure is the most accurate measurement of detec-
icon, which is called a strong lexicon. For the whole test tion performance. TextBoxes consistently outperforms com-
set, it gives a lexicon containing hundreds of words, which peting methods in terms of f-measure. On ICDAR 2011,
is called a weakly lexicon. It also gives a generic lexicon TextBoxes outperforms the second best methods (Gupta,
which contains 90k words. Vedaldi, and Zisserman 2016), by 4 percents. On ICDAR
Street View Text (SVT) (Wang and Belongie 2010) The
1
SVT dataset is more challenging than the ICDAR datasets https://github.com/bgshih/crnn

4164
Figure 3: Examples of text localization results. The green bounding boxes are correct detections; Red boxes are false positives;
Red dashed boxes are false negatives.

Table 2: Word spotting and end-to-end results. The values in the table are F-measure. For ICDAR 2013, strong, weak and
generic mean a small lexicon containing 100 words for each image, a lexicon containing all words in the whole test set and
a large lexicon respectively. We use a lexicon containing 90k words as our generic lexicon. The methods marked by “*” are
published on the ICDAR 2015 Robust Reading Competition website: http://rrc.cvc.uab.es
IC13 IC13
IC11 SVT SVT-50
Methods spotting end-to-end
spotting spotting spotting
strong weak generic strong weak generic
Alsharif (Alsharif and Pineau 2013) – – 0.48 – – – – – –
Jaderberg (Jaderberg et al. 2016) 0.76 0.56 0.68 – – 0.76 – – –
FCRNall+ﬁlts
0.84 0.53 0.76 – – 0.85 – – –
(Gupta, Vedaldi, and Zisserman 2016)

Deep2Text II+* – – – 0.85 0.83 0.80 0.82 0.79 0.77

SRC-B-TextProcessingLab* – – – 0.90 0.88 0.81 0.87 0.85 0.80
Adelaide ConvLSTMs* – – – 0.91 0.90 0.83 0.87 0.86 0.80
TextBoxes 0.87 0.64 0.84 0.94 0.92 0.87 0.91 0.89 0.84

2013, TextBoxes also outperforms competing methods by at Word spotting and end-to-end recognition
least 2 percents. TextBoxes ranks the first in term of test- The performance of word spotting is evaluated by detec-
ing speed, even with the multi-scale version, which takes tion results that are refined by recognition, while the eval-
only 0.73s per image. Meanwhile, a fast implementation of uation of end-to-end performance concerns both detection
TextBoxes takes merely 0.09s per image, without much loss and recognition results. We test TextBoxes on ICDAR 2011,
in accuracy. SVT, and ICDAR 2013.
As shown in Table. 2, our method outperforms all the ex-
isting methods, including the most recent competition re-
In order to further verify the effectiveness of TextBoxes, sults published on the website. On ICDAR 2011 and IC-
we also report the results of SSD (Liu et al. 2016) for the DAR 2013, our method outperforms the second best method
comparison in Table. 1, which is the most relevant and at least 2 percents with all the evaluation protocol listed
the state-of-the-art detector for general objects. Here, SSD in Table. 2. The performance gap on SVT is even larger.
is trained using the same procedures as TextBoxes. SSD TextBoxes outperforms the leading method (Gupta, Vedaldi,
achieves competitive performance, but still falls short of and Zisserman 2016), by over 8 percents on both SVT and
other state-of-the-art methods. In particular, we observe that SVT-50. The reason is likely to be that TextBoxes is more
SSD cannot achieve good results when detecting words with robust to the low resolution images in SVT, since TextBoxes
large aspect ratios while TextBoxes performs much better, is trained on relatively low resolution images.
benefiting from the proposed text-box layers which are de- Coupled with a recognition model, TextBoxes achieves
signed in order to overcome the length variation of words. the state-of-the-art performance on end-to-end recogni-

4165
Figure 4: Examples of word spotting results. Yellow words are recognition results. Words less than 3 letters are ignored,
following the evaluation protocol. The box colors have the same meaning as Fig. 3.

tion benchmarks. On ICDAR 2013, TextBoxes breaks the CNN regression model. It takes 1.27s excluding the regres-
records recently made by Adelaide ConvLSTMs* on all sion step, whose running time is not reported. TextBoxes
the lexicon settings. More specifically, TextBoxes generates achieves the highest detection accuracy while being the
about 35 proposals per image when using multi-scale in- fastest among them.
puts on ICDAR 2013, with a recall of 0.93. With a strong
lexicon for the recognition model, 3.8 bounding boxes per Weaknesses
image are reserved, achieving a recall of 0.91 and a pre- TextBoxes performs well in most situations. However, it still
cision of 0.97. We employ a 90k-lexicon for SVT and IC- fails to handle some difficult cases, such as overexposure
DAR 2011, and a 50-word lexicon per image on SVT-50. and large character spacing. Some failure cases are shown
Note that even though Jaderberg (Jaderberg et al. 2016) and in Fig. 3 and Fig. 4.
FCRNall+filts (Gupta, Vedaldi, and Zisserman 2016) adopt
a much smaller lexicon(50k words), their results are still in- Conclusion
ferior to our method.
We have presented TextBoxes, an end-to-end fully convo-
Running speed lutional network for text detection, which is highly stable
and efficient to generate word proposals against cluttered
Most existing methods detect texts in a multi-step manner, backgrounds. Comprehensive evaluations and comparisons
making them hard to run efficiently. Most of the compu- on benchmark datasets clearly validate the advantages of
tation of TextBoxes is spent on the convolutional forward Textboxes in three related tasks including text detection,
passes, which are very fast when running on GPU devices. word spoting and end-to-end recognition. In the future, we
TextBoxes takes only 0.09s per image with 700 ∗ 700 single- are interested to extend TextBoxes for multi-oriented texts,
scale images, resulting in an f-measure of 0.80 on ICDAR and combine the networks of detection and recognition into
2013, which is still very competitive. When running on 5 one unified framework.
input scales, TextBoxes achieves 0.85 f-measure on ICDAR
2013, taking 0.73 second per image with the batch size set- Acknowledgements
ting to 1. We remove the 1600 ∗ 1600 scale when testing
on SVT, since the SVT image resolutions are relatively low. This work was partly supported by National Natural Science
Testing on the the remaining scales takes merely 0.24 second Foundation of China (61222308, 61573160, 61572207 and
per image. 61503145), and Open Project Program of the State Key Lab-
The speed comparisons are listed in Table. 1. (Jaderberg oratory of Digital Publishing Technology (F2016001).
et al. 2016) adopts two proposal generation methods, a ran-
dom forest classifier, and a CNN regression model. They References
each takes 1-3 seconds, about 7s in total. (Gupta, Vedaldi, Alsharif, O., and Pineau, J. 2013. End-to-end text
and Zisserman 2016) proposes a YOLO-like model called recognition with hybrid HMM maxout models. CoRR
FCRN, followed by the same random forest classifiers and a abs/1310.1811.

4166
Girshick, R.; Donahue, J.; Darrell, T.; and Malik, J. 2014. Tian, S.; Pan, Y.; Huang, C.; Lu, S.; Yu, K.; and Lim Tan, C.
Rich feature hierarchies for accurate object detection and se- 2015. Text flow: A unified text detection system in natural
mantic segmentation. In Proc. CVPR. scene images. In Proc. ICCV.
Girshick, R. B. 2015. Fast R-CNN. In Proc. ICCV. Wang, K., and Belongie, S. 2010. Word spotting in the wild.
Gomez-Bigorda, L., and Karatzas, D. 2016. Textproposals: In Proc. ECCV, 591–604.
a text-specific selective search algorithm for word spotting Wolf, C., and Jolion, J. 2006. Object count/area graphs
in the wild. CoRR abs/1604.02619. for the evaluation of object detection and segmentation al-
Graves, A.; Fernández, S.; Gomez, F.; and Schmidhuber, J. gorithms. IJDAR 8(4):280–296.
2006. Connectionist temporal classification: labelling un- Yao, C.; Bai, X.; Liu, W.; Ma, Y.; and Tu, Z. 2012. Detecting
segmented sequence data with recurrent neural networks. In texts of arbitrary orientations in natural images. In Proc.
Proc. ICML, 369–376. CVPR, 1083–1090.
Gupta, A.; Vedaldi, A.; and Zisserman, A. 2016. Synthetic Zamberletti, A.; Noce, L.; and Gallo, I. 2014. Text local-
data for text localisation in natural images. In Proc. CVPR. ization based on fast feature pyramids and multi-resolution
maximally stable extremal regions. In Proc. ACCV, 91–105.
Huang, W.; Qiao, Y.; and Tang, X. 2014. Robust scene
text detection with convolution neural network induced mser Zhang, Z.; Shen, W.; Yao, C.; and Bai, X. 2015. Symmetry-
trees. In Proc. ECCV. based text line detection in natural scenes. In Proc. CVPR,
2558–2567.
Jaderberg, M.; Simonyan, K.; Vedaldi, A.; and Zisserman,
A. 2016. Reading text in the wild with convolutional neural Zhang, Z.; Zhang, C.; Shen, W.; Yao, C.; Liu, W.; and Bai,
networks. IJCV 116(1):1–20. X. 2016. Multi-oriented text detection with fully convolu-
tional networks. In Proc. CVPR.
Karatzas, D.; Shafait, F.; Uchida, S.; Iwamura, M.; i Big-
orda, L. G.; Mestre, S. R.; Mas, J.; Mota, D. F.; Almazan, Zhong, Z.; Jin, L.; Zhang, S.; and Feng, Z. 2016. Deeptext:
J. A.; and de las Heras, L. P. 2013. Icdar 2013 robust read- A unified framework for text proposal generation and text
ing competition. In ICDAR, 1484–1493. detection in natural images. CoRR abs/1605.07314.
LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.
Gradient-based learning applied to document recognition.
Proceedings of the IEEE 86(11):2278–2324.
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; and Reed,
S. E. 2016. SSD: single shot multibox detector. In Proc.
ECCV.
Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolu-
tional networks for semantic segmentation. In Proc. CVPR.
Neumann, L., and Matas, J. 2012. Real-time scene text
localization and recognition. In Proc. CVPR, 3538–3545.
Pan, Y.-F.; Hou, X.; and Liu, C.-L. 2011. A hybrid approach
to detect and localize texts in natural scene images. IEEE T.
Image Proc. 20(3):800–813.
Redmon, J.; Divvala, S. K.; Girshick, R. B.; and Farhadi,
A. 2016. You only look once: Unified, real-time object
detection. In Proc. CVPR.
Ren, S.; He, K.; Girshick, R.; and Sun, J. 2015. Faster r-
cnn: Towards real-time object detection with region proposal
networks. In Proc. NIPS.
Shahab, A.; Shafait, F.; and Dengel, A. 2011. Icdar 2011 ro-
bust reading competition challenge 2: Reading text in scene
images. In Proc. ICDAR, 1491–1496.
Shi, B.; Bai, X.; and Yao, C. 2015. An end-to-end trainable
neural network for image-based sequence recognition and its
application to scene text recognition. CoRR abs/1507.05717.
Simonyan, K., and Zisserman, A. 2014. Very deep convo-
lutional networks for large-scale image recognition. CoRR
abs/1409.1556.
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.;
Anguelov, D.; Erhan, D.; Vanhoucke, V.; and Rabinovich,
A. 2015. Going deeper with convolutions. In Proc. CVPR.

4167

Farm Management in Peasant Agriculture-CRC Press (1983)
No ratings yet
Farm Management in Peasant Agriculture-CRC Press (1983)
484 pages
Fable The Bird and The Whale
No ratings yet
Fable The Bird and The Whale
3 pages
Textboxes
No ratings yet
Textboxes
15 pages
Calculation of Slab On Grade 15 CM
No ratings yet
Calculation of Slab On Grade 15 CM
2 pages
Most
No ratings yet
Most
10 pages
Pyramid Mask Text Detector
No ratings yet
Pyramid Mask Text Detector
10 pages
A Survey
No ratings yet
A Survey
8 pages
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
No ratings yet
Enhanced Scene Text Recognition Using Deep Learning Based Hybrid Attention Recognition Network
12 pages
Marine Catalogue
No ratings yet
Marine Catalogue
81 pages
Sast
No ratings yet
Sast
9 pages
A Novel Ensemble Deep Network Framework For Scene Text Recognition
No ratings yet
A Novel Ensemble Deep Network Framework For Scene Text Recognition
11 pages
Directive 036 Drilling Blowout Prevention Requirements and Procedures
No ratings yet
Directive 036 Drilling Blowout Prevention Requirements and Procedures
211 pages
Scene Text Recognition Based On Improved CRNN
No ratings yet
Scene Text Recognition Based On Improved CRNN
14 pages
Westwood Homework
100% (1)
Westwood Homework
7 pages
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
No ratings yet
Visual Attention Models For Scene Text Recognition: Suman K. Ghosh and Ernest Valveny Andrew D. Bagdanov
6 pages
For Your Salvation
No ratings yet
For Your Salvation
455 pages
A Single Shot Oriented Scene Text Detector
No ratings yet
A Single Shot Oriented Scene Text Detector
12 pages
Wang Shape Robust Text Detection With Progressive Scale Expansion Network CVPR 2019 Paper
No ratings yet
Wang Shape Robust Text Detection With Progressive Scale Expansion Network CVPR 2019 Paper
10 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
No ratings yet
Kittenplon Towards Weakly-Supervised Text Spotting Using A Multi-Task Transformer CVPR 2022 Paper
10 pages
PGNet AAAI-2885.WangP
No ratings yet
PGNet AAAI-2885.WangP
9 pages
Consumer Equilibrium
No ratings yet
Consumer Equilibrium
31 pages
خطة يومية كاملة صف ثاني متوسط
No ratings yet
خطة يومية كاملة صف ثاني متوسط
61 pages
Factors Affecting Investment Decisions Studies On Young Investors
No ratings yet
Factors Affecting Investment Decisions Studies On Young Investors
7 pages
TCS Ocr
No ratings yet
TCS Ocr
39 pages
Hot Rod Coyote Swap Guide Reprint July 2013
No ratings yet
Hot Rod Coyote Swap Guide Reprint July 2013
8 pages
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
No ratings yet
Neurocomputing: Juhua Liu, Qihuang Zhong, Yuan Yuan, Hai Su, Bo Du
11 pages
Scene Text Detection With Fully Convolutional Neural Networks
No ratings yet
Scene Text Detection With Fully Convolutional Neural Networks
23 pages
Textfield: Learning A Deep Direction Field For Irregular Scene Text Detection
No ratings yet
Textfield: Learning A Deep Direction Field For Irregular Scene Text Detection
14 pages
Meyer Attatchment Parts Catalogue 6-5206n - 156153 - V1
No ratings yet
Meyer Attatchment Parts Catalogue 6-5206n - 156153 - V1
23 pages
Detection of Text From Lecture Video Images
No ratings yet
Detection of Text From Lecture Video Images
5 pages
Robust Scene Text Recognition With Automatic Rectification
No ratings yet
Robust Scene Text Recognition With Automatic Rectification
9 pages
CRNN Model For Text Detection and Classification From Natural Scenes
No ratings yet
CRNN Model For Text Detection and Classification From Natural Scenes
11 pages
Textfusenet: Scene Text Detection With Richer Fused Features
No ratings yet
Textfusenet: Scene Text Detection With Richer Fused Features
7 pages
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
No ratings yet
Review of Scene Text Detection and Recognition: Han Lin Peng Yang Fanlong Zhang
22 pages
A Transformer-Based Framework For Scene Text Recognition
No ratings yet
A Transformer-Based Framework For Scene Text Recognition
16 pages
Research Paper 5
No ratings yet
Research Paper 5
10 pages
Unconstrained Text Recognition With Convolutional Neural Networks
No ratings yet
Unconstrained Text Recognition With Convolutional Neural Networks
13 pages
Electronics 12 03087
No ratings yet
Electronics 12 03087
10 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
Report
No ratings yet
Report
39 pages
Ijariie 26613
No ratings yet
Ijariie 26613
5 pages
Sample Private
No ratings yet
Sample Private
1 page
Varm All 300 English-1
No ratings yet
Varm All 300 English-1
26 pages
Enhancing Text Spotting With A Language Model and Visual Context Information
No ratings yet
Enhancing Text Spotting With A Language Model and Visual Context Information
10 pages
Applied Sciences: Scene Text Detection Using Attention With Depthwise Separable Convolutions
No ratings yet
Applied Sciences: Scene Text Detection Using Attention With Depthwise Separable Convolutions
18 pages
Nguyen State-of-the-Art in Action Unconstrained Text Detection ICCVW 2019 Paper
No ratings yet
Nguyen State-of-the-Art in Action Unconstrained Text Detection ICCVW 2019 Paper
8 pages
End-To-End Text Recognition With Convolutional Neural Networks
No ratings yet
End-To-End Text Recognition With Convolutional Neural Networks
60 pages
Deep Scene Text Detection With Connected Component Proposals
No ratings yet
Deep Scene Text Detection With Connected Component Proposals
10 pages
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
No ratings yet
Tang Few Could Be Better Than All Feature Sampling and Grouping CVPR 2022 Paper
10 pages
Assignment1 FN
No ratings yet
Assignment1 FN
8 pages
Deep Neuronal Network
No ratings yet
Deep Neuronal Network
9 pages
I Am Curious (Yellow)
No ratings yet
I Am Curious (Yellow)
7 pages
DMM Ii Question Bank
No ratings yet
DMM Ii Question Bank
7 pages
Going Full-TILT Boogie On Document Understanding With Text-Image-Layout Transformer
No ratings yet
Going Full-TILT Boogie On Document Understanding With Text-Image-Layout Transformer
17 pages
Impact of Gender Diversity On Team Performance SM Raza Naqvi
No ratings yet
Impact of Gender Diversity On Team Performance SM Raza Naqvi
8 pages
PDHPE - CAFS - Year 12 2022 - Task 1 - 291121
No ratings yet
PDHPE - CAFS - Year 12 2022 - Task 1 - 291121
9 pages
Analogy Hard
No ratings yet
Analogy Hard
2 pages
Thesis Research Proposal
No ratings yet
Thesis Research Proposal
5 pages
MT6761 Android Scatter
No ratings yet
MT6761 Android Scatter
12 pages
Text Detection OCR Reseacrh Paper
No ratings yet
Text Detection OCR Reseacrh Paper
26 pages
415 V Bus Charging
No ratings yet
415 V Bus Charging
3 pages
MICRO CHAP6 ACTS DRAFT Copy 1
No ratings yet
MICRO CHAP6 ACTS DRAFT Copy 1
3 pages
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
No ratings yet
A Sensitivity Analysis of Convolutional Neural Networks For Sentence Classification
18 pages
Ijarcce 208
No ratings yet
Ijarcce 208
3 pages
Altivar Easy 310: Variable Speed Drives
No ratings yet
Altivar Easy 310: Variable Speed Drives
15 pages
Kami Export - 1904.01941
No ratings yet
Kami Export - 1904.01941
5 pages
Organisational Change Od Assignment
No ratings yet
Organisational Change Od Assignment
12 pages
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
No ratings yet
Fujitake DTrOCR Decoder-Only Transformer For Optical Character Recognition WACV 2024 Paper
11 pages
Jaderberg 16
No ratings yet
Jaderberg 16
20 pages
Dtrocr: Decoder-Only Transformer For Optical Character Recognition
No ratings yet
Dtrocr: Decoder-Only Transformer For Optical Character Recognition
11 pages
Char RCG TH
No ratings yet
Char RCG TH
11 pages
Deep Learning Approaches To Scene Text Detection A
No ratings yet
Deep Learning Approaches To Scene Text Detection A
61 pages
Jaderberg15b PDF
No ratings yet
Jaderberg15b PDF
188 pages
Samba de Verão & Wave - Sax
No ratings yet
Samba de Verão & Wave - Sax
2 pages
Gupta Synthetic Data For CVPR 2016 Paper
No ratings yet
Gupta Synthetic Data For CVPR 2016 Paper
10 pages
Haramaya University Computer Science Student
No ratings yet
Haramaya University Computer Science Student
15 pages
Modelo de Method Statement Maintenance of Stainless Steel Gutters
No ratings yet
Modelo de Method Statement Maintenance of Stainless Steel Gutters
2 pages
Plagiarism Checker X Originality Report: Similarity Found: 26%
No ratings yet
Plagiarism Checker X Originality Report: Similarity Found: 26%
29 pages
Scene Text Detection and Recognition USING DL PDF
No ratings yet
Scene Text Detection and Recognition USING DL PDF
20 pages
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques (Veronica - Vilaplana, Antoni - Gasull, Ferran - Marques) @upc - Edu
No ratings yet
Miriam Leon, Veronica Vilaplana, Antoni Gasull, Ferran Marques (Veronica - Vilaplana, Antoni - Gasull, Ferran - Marques) @upc - Edu
4 pages
Pro B760M P DDR4
No ratings yet
Pro B760M P DDR4
1 page
Research PaPer EAST
No ratings yet
Research PaPer EAST
10 pages
SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
No ratings yet
SEE: Towards Semi-Supervised End-to-End Scene Text Recognition
8 pages
Convolutional Character Networks
No ratings yet
Convolutional Character Networks
11 pages
Techniques of Text Detection and Extraction
No ratings yet
Techniques of Text Detection and Extraction
18 pages
Long2021 Article SceneTextDetectionAndRecogniti
No ratings yet
Long2021 Article SceneTextDetectionAndRecogniti
24 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Programming with X10: Definitive Reference for Developers and Engineers
From Everand
Programming with X10: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

TextBoxes A Fast Text Detector With DL

Uploaded by

TextBoxes A Fast Text Detector With DL

Uploaded by

Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17)

TextBoxes: A Fast Text Detector with

Word spotting and end-to-end recognition

FCN (Zhang et al. 2016) – – – – – – 0.88 0.78 0.83 – – – 2.1

Deep2Text II+* – – – 0.85 0.83 0.80 0.82 0.79 0.77

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.