0% found this document useful (0 votes)
4 views9 pages

Sast

The paper presents a novel text detection method called SAST, which utilizes a context attended multi-task learning framework based on Fully Convolutional Networks (FCN) to effectively detect arbitrarily-shaped text in images. It introduces a Context Attention Block to capture long-range dependencies and a Point-to-Quad assignment method for accurate pixel clustering into text instances. Experimental results show that SAST achieves superior or comparable performance to existing methods while operating efficiently in real-time applications.

Uploaded by

lala hua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views9 pages

Sast

The paper presents a novel text detection method called SAST, which utilizes a context attended multi-task learning framework based on Fully Convolutional Networks (FCN) to effectively detect arbitrarily-shaped text in images. It introduces a Context Attention Block to capture long-range dependencies and a Point-to-Quad assignment method for accurate pixel clustering into text instances. Experimental results show that SAST achieves superior or comparable performance to existing methods while operating efficiently in real-time applications.

Uploaded by

lala hua
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

A Single-Shot Arbitrarily-Shaped Text Detector based on

Context Attended Multi-Task Learning

Pengfei Wang∗ Chengquan Zhang∗ Fei Qi∗†


School of Artificial Intelligence, Department of Computer Vision School of Artificial Intelligence,
Xidian University Technology (VIS), Baidu Inc. Xidian University
pfwang@stu.xidian.edu.cn zhangchengquan@baidu.com fred.qi@ieee.org

Zuming Huang Mengyi En Junyu Han


arXiv:1908.05498v1 [cs.CV] 15 Aug 2019

Department of Computer Vision Department of Computer Vision Department of Computer Vision


Technology (VIS), Baidu Inc. Technology (VIS), Baidu Inc. Technology (VIS), Baidu Inc.
huangzuming@baidu.com enmengyi@baidu.com hanjunyu@baidu.com

Jingtuo Liu Errui Ding Guangming Shi


Department of Computer Vision Department of Computer Vision School of Artificial Intelligence,
Technology (VIS), Baidu Inc. Technology (VIS), Baidu Inc. Xidian University
liujingtuo@baidu.com dingerrui@baidu.com gmshi@xidian.edu.cn

ABSTRACT ACM Reference Format:


Detecting scene text of arbitrary shapes has been a challenging task Pengfei Wang, Chengquan Zhang, Fei Qi, Zuming Huang, Mengyi En, Junyu
Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2019. A Single-Shot
over the past years. In this paper, we propose a novel segmentation-
Arbitrarily-Shaped Text Detector based on Context Attended Multi-Task
based text detector, namely SAST, which employs a context at- Learning. In Proceedings of the 27th ACM International Conference on Mul-
tended multi-task learning framework based on a Fully Convolu- timedia (MM ’19), October 21–25, 2019, Nice, France. ACM, New York, NY,
tional Network (FCN) to learn various geometric properties for the USA, 9 pages. https://doi.org/10.1145/3343031.3350988
reconstruction of polygonal representation of text regions. Taking
sequential characteristics of text into consideration, a Context At- 1 INTRODUCTION
tention Block is introduced to capture long-range dependencies
Recently, scene text reading has attracted extensive attention in
of pixel information to obtain a more reliable segmentation. In
both academia and industry for its numerous applications, such
post-processing, a Point-to-Quad assignment method is proposed
as scene understanding, image and video retrieval, and robot navi-
to cluster pixels into text instances by integrating both high-level
gation. As the prerequisite in textual information extraction and
object knowledge and low-level pixel information in a single shot.
understanding, text detection is of great importance. Thanks to
Moreover, the polygonal representation of arbitrarily-shaped text
the surge of deep neural networks, various convolutional neural
can be extracted with the proposed geometric properties much
network (CNN) based methods have been proposed to detect scene
more effectively. Experiments on several benchmarks, including
text, continuously refreshing the performance records on standard
ICDAR2015, ICDAR2017-MLT, SCUT-CTW1500, and Total-Text,
benchmarks [1, 15, 28, 42]. However, text detection in the wild
demonstrate that SAST achieves better or comparable performance
is still a challenging task due to the significant variations in size,
in terms of accuracy. Furthermore, the proposed algorithm runs at
aspect ratios, orientations, languages, arbitrary shapes, and even
27.63 FPS on SCUT-CTW1500 with a Hmean of 81.0% on a single
the complex background. In this paper, we seek an effective and
NVIDIA Titan Xp graphics card, surpassing most of the existing
efficient detector for text of arbitrary shapes.
segmentation-based methods.
To detect arbitrarily-shaped text, especially those in curved form,
some segmentation-based approaches [23, 35, 37, 44] formulated
KEYWORDS
text detection as a semantic segmentation problem. They employ a
FCN; Arbitrarily-shaped Text Detection; Real-time Segmentation fully convolutional network (FCN) [27] to predict text regions, and
∗ Equal apply several post-processing steps such as connected component
contribution. This work is done when Pengfei Wang is an intern at Baidu Inc.
† Corresponding author. analysis to extract final geometric representation of scene text. Due
to the lack of global context information, there are two common
Permission to make digital or hard copies of part or all of this work for personal or challenges for segmentation-based text detectors, as demonstrated
classroom use is granted without fee provided that copies are not made or distributed in Fig. 1, including: (1) Lying close to each other, text instances are
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored. difficult to be separated via semantic segmentation; (2) Long text
For all other uses, contact the owner/author(s). instances tend to be fragmented easily, especially when character
MM ’19, October 21–25, 2019, Nice, France spacing is far or the background is complex, such as the effect of
© 2019 Copyright held by the owner/author(s).
ACM ISBN 978-1-4503-6889-6/19/10. . . $15.00 strong illumination. In addition, most segmentation-based detectors
https://doi.org/10.1145/3343031.3350988 have to output large-resolution prediction to precisely describe text
contours, thus suffer from time-consuming and redundant post-
processing steps.
Some instance segmentation methods [4, 7, 45] attempt to em-
bed high-level object knowledge or non-local information into
the network to alleviate the similar problems described above.
Among them, Mask-RCNN [7], a proposal-based segmentation
method that cascades detection task (i.e., RPN [29]) and segmen-
tation task by RoIAlign [7], has achieved better performance than
those proposal-free methods by a large margin. Recently, some
similar ideas [14, 24, 39] have been introduced to settle the problem
of detecting text of arbitrary shapes. However, they are all facing a
common challenge that it takes much more time when the number (a) (b)
of valid text proposals increases, due to the large number of over-
lapping computations in segmentation, especially in the case that
Figure 1: Two common challenges for segmentation-based
valid proposals are dense. In contrast, our approach is based on a
text detectors: (a) Adjacent text instances are difficult to be
single-shot view and efficient multi-task mechanism.
separated; (b) The long text may break into fragments. The
Inspired by recent works [16, 22, 33] in general semantic instance
first row is the response for text regions from segmentation
segmentation, we aim to design a segmentation-based Single-shot
branch. In the second row, cyan contours are the detection
Arbitrarily-Shaped Text detector (SAST), which integrates both
results from EAST [47], while red contours are from SAST.
the high-level object knowledge and low-level pixel information
in a single shot and detects scene text of arbitrary shapes with
high accuracy and efficiency. Employing a FCN [27] model, various text and curved text in a unified manner. With FCN [27], the
geometric properties of text regions, including text center line segmentation-based text detectors first classify text at pixel level,
(TCL), text border offset (TBO), text center offset (TCO), and text then followed by several post-processing to extract final geometric
vertex offset (TVO), are designed to learn simultaneously under a representation of scene text, so the performance of this kind of
multi-task learning formulation. In addition to skip connections, a detectors is strongly affected by the robustness of segmentation re-
Context Attention Block (CAB) is introduced into the architecture sults. In PixelLink [2], positive pixels are joined into text instances
to aggregate contextual information for feature augmentation. To by predicted positive links, and the bounding boxes are extracted
address the problems illustrated in Fig. 1, we propose a point-to- from segmentation result directly. TextSnake [23] proposes a novel
quad method for text instance segmentation, which assigns labels presentation for arbitrarily-shaped text, and treats a text instance
to pixels by combining high-level object knowledge from TVO and as a sequence of overlapping disks lying at text center line to de-
TCO maps. After clustering TCL map into text instances, more scribe the geometric properties of text instances of irregular shapes.
precise polygonal representations of arbitrarily-shaped text are PSENet [35] shrinks the original text instance into various scales,
then reconstructed based on TBO maps. and gradually expands the kernels to the text instances of complete
Experiments on public datasets demonstrate that the proposed shapes through a progressive scale expansion algorithm. The main
method achieves better or comparable performance in terms of challenge of FCN-based methods is separating text instances, which
both accuracy and efficiency. The contribution of this paper are lie close to each other. The runtime of approaches above highly de-
three-fold: pends on the employed post-processing step, which often involves
• We propose a single-shot text detector based on multi-task several pipelines and tends to be rather slow.
learning for text of arbitrary shapes including multi-oriented, Detection-based Text Detectors. Scene text is regarded as a
multilingual, and curved scene text, which is efficient enough special type of object, several methods [10, 17, 18, 26, 43, 47] are
for some real-time applications. based on Faster R-CNN [29], SSD [20] and DenseBox [12], which
• The Context Attention Block aggregates the contextual in- generates text bounding boxes by regressing coordinates of boxes
formation to augment the feature representation without directly. TextBoxes [17] and RRD [18] adopt SSD as a base detector
too much extra calculation cost. and adjust the anchor ratios and convolution kernel size to han-
• The point-to-quad assignment is robust and effective to sep- dle variation of aspect ratios of text instances. He et al. [10] and
arate text instance and alleviate the problem of fragments, EAST [47] perform direct regression to determine vertex coordi-
which is better than connected component analysis. nates of quadrilateral text boundaries in a per-pixel manner without
using anchors and proposals, and conduct the Non-Max Suppres-
sion (NMS) to get the final detection results. RRPN [26] generates
2 RELATED WORK inclined proposals with text orientation angle information and pro-
In this section, we will review some representative segmentation- pose Rotation Region-of-Interest (RRoI) pooling layer to detect
based and detection-based text detectors, as well as some recent arbitrary-oriented text. Limited by the receptive field of CNNs and
progress in general semantic segmentation. A comprehensive re- the relatively simple representations like rectangle bounding box
view of recent scene text detectors can be found in [40, 49]. or quadrangle adopted to describe text, detection-based methods
Segmentation-based Text Detectors. The greatest benefit of may fall short when dealing with more challenging text instances,
segmentation-based methods is the ability to detect both straight such as extremely long text and arbitrarily-shaped text.
PSENet [35] and TextSnake [23] attempted to progressively recon-
struct the polygonal representation of detected text based on a
shrunk text region, of which the post-processing is complex and
tended to be slow. Inspired by those efforts, we aim to design an
effective method for arbitrarily-shaped text representation.
In this paper, we extract the center line of text region (TCL map)
and reconstruct the precise shape representation of text instances
with a regressed geometry property, i.e. TBO, which indicates the
offset between each pixel in TCL map and corresponding point pair
in upper and lower edge of its text region. More specifically, as
depicted in Fig. 2, the representation strategy consists of two steps:
text center point sampling and border point extraction. Firstly, we
sample n points at equidistance intervals from left to right on the
Figure 2: Arbitrary Shape Representation: a) The text line in center line region of text instance. By taking a further operation, we
TCL map; b) Sample adaptive number of points in the line; can determine the corresponding border point pairs based on the
c) Calculate the corresponding border point pairs with TBO sampled center line point with the information provided by TBO
map; d) Link all the border points as final representation. maps in the same location. By linking all the border points clock-
wise, we can obtain a complete text polygon representation. Instead
of setting n to a fixed number, we assign it by the ratio of center
General Instance Segmentation. Instance segmentation is a
line length to the average of length of border offset pairs adaptively.
challenging task, which involves both segmentation and classifica-
Several experiments on curved text datasets prove that our method
tion tasks. The most recent and successful two-stage representa-
is efficient and flexible for arbitrarily-shaped text instances.
tive is Mask R-CNN [7], which achieves amazing results on public
benchmarks, but requires relatively long execution time due to the
per-proposal computation and its deep stem network. Other frame- 3.2 Pipeline
works rely mostly on pixel-features generated by a single FCN
The network architecture of FCN-based text detectors are limited
forward pass, and employ post-processing like graphical models,
to the local receptive fields and short-range contextual information,
template matching, or pixel embedding to cluster pixels belonging
and makes it struggling to segment some challenging text instances.
to the same instance. More specifically, Non-local Networks [36]
Thus, we design a Context Attention Block to integrate the long-
utilizes a self-attention [34] mechanism to enable a pixel-feature to
range dependencies of pixels to obtain a more representative feature.
perceive features from all the other positions, while the CCNet [13]
As a substitute for the Connected Component Analysis, we also
harvests the contextual information from all pixels more efficiently
propose Point-to-Quad Assignment to cluster the pixels in TCL
by stacking two criss-cross attention modules, which augments the
map into text instances, where we use TCL and TVO maps to restore
feature representation a lot. In post-processing step, Liu et al. [22]
the minimum quadrilateral bounding boxes of text instances as
present a pixel affinity scheme and cluster pixels into instances with
high-level information.
a simple yet effective graph merge algorithm. Instance-Cut [16] and
An overview of our framework is depicted in Fig. 3. It consists of
the work of [41] predict object boundaries intentionally to facilitate
three parts, including a stem network, multi-task branches, and a
the separation of object instances.
post-processing part. The stem network is based on ResNet-50 [8]
Our method, SAST, employs a FCN-based framework to predict
with FPN [19] and CABs to produce context-enhanced representa-
TCL, TCO, TVO, and TBO maps in parallel. With the efficient CAB
tion. The TCL, TCO, TVO, and TBO maps are predicted for each
and point-to-quad assignment, where the high-level object knowl-
text region as a multi-task problem. In the post-processing, we
edge is combined with low-level pixel information, SAST can detect
segment text instances by point-to-quad assignment. Concretely,
text of arbitrary shapes with high accuracy and efficiency.
similar to EAST [47], the TVO map regresses the four vertices of
bounding quadrangle of text region directly, and the detection re-
3 METHODOLOGY sults is considered as high-level object knowledge. For each pixel in
In this section, we will describe our SAST framework for detecting TCL map, a corresponding offset vector from TCO map will point
scene text of arbitrary shapes in details. to a low-level center which the pixel belongs to. Computing the
distance between lower-level center and high-level object centers
3.1 Arbitrary Shape Representation of the detected bounding quadrangle, pixels in the TCL map will
The bounding boxes, rotated rectangles, and quadrilaterals are used be grouped into several text instances. In contrast to the connected
as classical representations in most detection-based text detectors, component analysis, it takes high-level object knowledge into ac-
which fails to precisely describe the text instances of arbitrary count, and is proved to be more efficient. More details about the
shapes, as shown in Fig. 1 (b). The segmentation-based methods mechanism of point-to-quad assignment will be discussed in this
formulate the detection of arbitrarily-shaped text as a binary seg- Section 3.4. We sample a adaptive number of points in the center
mentation problem. Most of them directly extracted the contours of line of each text instance, calculate corresponding points in upper
instance mask as the representation of text, which is easily affected and lower borders with the help of TBO map, and reconstruct the
by the completeness and consistency of segmentation. However, representation of arbitrarily-shaped scene text finally.
(a) (b) (c) (d)

Multi-task
TCL_map
Text
𝑋 ′′
TVO_map Instance
Segmentation
Feature Extractor
TCO_map
X
CABs* Shape Representation
TBO_map
* Context Attention Blocks

Figure 3: The pipeline of proposed method: 1) Extract feature from input image, and learn TCL, TBO, TCO, TVO maps as a
multi-task problem; 2) Achieve instance segmentation by Text Instance Segmentation Module, and the mechanism of point-
to-quad assignment is illustrated in c; 3) Restore polygonal representation of text instances of arbitrary shapes.

3.3 Network Architecture feature, which is resized to N × H × W × C finally, is integrated


In this paper, we employ ResNet-50 as the backbone network with by multiplication of fд and the attention map. It is slightly differ-
the additional fully-connected layers removed. With different levels ent to get the vertical contextual information that { fθ ,fϕ ,fд } is
of feature map from the stem network gradually merged three-times transposed to {N × H } × C ×W at the beginning, as shown in cyan
in the FPN manner, a fused feature map X is produced at 1/4 size of boxes in Fig. 4. Meanwhile, a short-cut path is used to preserve lo-
the input images. We serially stack two CABs behind to capture rich cal features. Concatenating the horizontal contextual map, vertical
contextual information. Adding four branches behind the context- contextual map, and short-cut map together and reducing channel
enhanced feature maps X ′′ , the TCL and other geometric maps number of X ′ with a 1 × 1 convolutional layer, the CAB aggregates
are predicted in parallel, where we adopt a 1 × 1 convolution layer long-range pixel-wise contextual information in both horizontal
with the number of output channel set to {1, 2, 8, 4} for TCL, TCO, and vertical directions. Besides, the convolutional layers denoted
TVO, and TBO map respectively. It is worth mentioning that all by purple and cyan boxes share weights. By serially connecting
the output channels of convolution layers in the FPN is set to 128 two CABs, each pixel can finally capture long-range dependencies
directly, regardless of whether the kernel size is 1 or 3. from all pixels, as depicted in the bottom of Fig. 4, leading to a
Context Attention Block. The segmentation results of FCN more powerful context-enhanced feature map X ′′ , which also helps
and the post-processing steps depend mainly on local informa- alleviate the problems caused by the limited receptive field when
tion. The proposed CAB utilizes a self-attention mechanism [34] dealing with more challenging text instances, such as long text.
to aggregate the contextual information to augment the feature
representation, of which the details is demonstrated in Fig. 4. In 3.4 Text Instance Segmentation
order to alleviate the huge computational overhead caused by direct For most proposal free text detector of arbitrary shapes, the mor-
use of self-attention, CAB only considers the similarity between phological post-processing such as connected component analysis
each location in feature map and others in the same horizontal are adopted to achieve text instance segmentation, which do not
or vertical column. The feature map X is the output of ResNet-50 explicitly incorporate high-level object knowledge and easily fail
backbone which is in size of N × H × W × C. To collect contex- to detect complex scene text. In this section, we describe how to
tual information horizontally, we adopt three convolution layers generate an text instance semantic segmentation with TCL, TCO
behind X in parallel to get { fθ , fϕ , fд } and reshape them into and TVO maps with high-level object information.
{N × H } × W × C, then multiply fϕ by the transpose of fθ to get Point-to-Quad Assignment. As depicted in Fig. 3, the first step
an attention map of size {N × H } × W × W , which is activated by of text instance segmentation is detecting candidate text quadran-
a Sigmoid function. A horizontal contextual information enhanced gles based on TCL and TVO maps. Similar to EAST [47], we binarize
{NxH} x W x C 𝑓𝜃 T, {NxH} x C x W CAB 𝑣1 𝑃𝑢𝑝𝑝𝑒𝑟 𝑣2
𝑓𝜌 {NxH} x W x W 𝑃1 𝑃0 𝑃2
X 𝑓𝑐 ′
R, N x H x W x C 𝑋
𝑣4 𝑃𝑙𝑜𝑤𝑒𝑟 𝑣3
Short Cut, N x H x W x C
(a) (b)
𝑓𝜃𝑇
NxHxWxC NxHxWxC
R, N x W x H x C
𝑓𝜌𝑇
{NxW} x H x H
𝑓𝑐𝑇 R for “Reshape”
{NxW} x H x C T, {NxW} x C x H T for “Transpose”

Conv 1x1 Conv 1x1, R T, Conv 1x1, R Multiplication Concatenate (c) (d)

X 𝑋′ 𝑋 ′′
Figure 5: Label Generation: (a) Text center region of a curved
text is annotated in red; (b) The generation of TBO map; The
CAB CAB four vertices (red stars in c) and center point(red star in d) of
bounding box, to which the TVO and TCO map refer.

Figure 4: Context Attention Blocks: a single CAB module


center of bounding box, while TVO is the offset between the four
aggregates pixel-wise contextual information both horizon-
vertices of bounding box and pixels in TCL map. Hence, the channel
tally and vertically, and long-range dependencies from all
numbers of TCO and TVO maps are 2 and 8, respectively, because
pixels can be captured by serially connecting two CABs.
each point pair requires two channels to represent the offsets {∆x i ,
∆yi }. Meanwhile, TBO determines the upper and lower boundaries
the TCL map, whose pixel values are in the range of [0, 1], with a of text instances in TCL map, thus it is a four channel offset map.
given threshold, and restore the corresponding quadrangle bound- Here are more details about the generation of a TBO map. For a
ing boxes with the four vertex offsets provided by TVO map. Of quadrangle text annotation with vertices {V1 , V2 , V3 , V4 } in clock-
course, NMS is adopted to suppress overlapping candidates. The wise and V1 is the top left vertex, as shown in Fig. 5 (b), the genera-
final quadrangle candidates shown in Fig. 3 (b) can be considered tion of TBO mainly contains two steps: first, we find a correspond-
to depend on high-level knowledge. The second and last step in ing point pair on the top and bottom boundaries for each point in
text instance segmentation is clustering the responses of text re- TCL map, than calculate the corresponding offset pair. With average
gion in the binarized TCL map into text instances. As Fig. 3 (c) slope of the upper side and lower side of quadrangle, the line cross
shows, the TCO map is a pixel-wise prediction of offset vectors a point P0 in TCL map can be determined. And it is easy to directly
pointing to the center of bounding boxes which the pixels in the calculate the intersection points {P1 , P 2 } of the line in the left and
TCL map should belong to. With a strong assumption that pixels right edges of bounding quadrangle with algebraic methods. A pair
in TCL map belonging to the same text instance should point to of corresponding points {Pupper , Plower } for P0 can be determined
the same object-level center, we cluster TCL map into several text from:
P0 − P1 Pupper − V1 Plower − V4
instances by assigning the response pixel to the quadrangle boxes = = .
P2 − P1 V2 − V1 V3 − V4
generated in the first step. Moreover, we do not care about whether
In the second step, the offsets between P0 and {Pupper , Plower }
predicted boxes in the first step are fully bounding text region in
can be easily determined. Polygons of more than four vertices are
the input image, and the pixels outside of the predicted box will
treated as a series of quadrangles connected together, and TBO of
be mostly assigned to the corresponding text instances. Integrated
polygons can be generated gradually from quadrangles as described
with high-level object knowledge and low-level pixel information,
before. For non-TCL pixels, their corresponding geometry attributes
the proposed post-processing clusters each pixel in TCL map to its
are set to 0 for convenience.
best matching text instance efficiently, and can help to not only
At the stage of training, the whole network is trained in an
separate text instances that are close to each other, but also alleviate
end-to-end manner, and the loss of the model can be formulated as:
fragments when dealing with extremely long text.
Ltot al = λ 1 Ltcl + λ 2 Ltco + λ 3 Ltvo + λ 4 Ltbo ,
3.5 Label Generation and Training Objectives where Ltcl , Ltco ,Ltvo and Ltbo represent the loss of TCL, TCO,
In this part, the generation of TCL, TCO, TVO, and TBO maps TVO, and TBO maps, and the first one is the binary segmentation
will be discussed. TCL is the shrunk version of text region, and it loss while the other are regression loss.
is an one channel segmentation map for text/non-text. The other We train segmentation branch by minimizing the Dice loss [27],
label maps such as TCO, TVO, and TBO, are per-pixel offsets with and the Smooth L 1 loss [5] is adopted for regression loss. The loss
reference to those pixels in TCL map. For each text instance, we weights λ 1 , λ 2 , λ 3 , and λ 4 are a tradeoff between four tasks which
calculate the center and four vertices of the minimum enclosing are equally important in this work, so we determine a set of values
quadrangle from its annotation polygon, as depicted in Fig. 5 (c) {1.0, 0.5, 0.5, 1.0} by making the four loss gradient norms close in
(d). TCO map is the offset between pixels in TCL map and the back-propagation.
4 EXPERIMENTS labeled datasets, we crop images without crossing text instances to
To compare the effectiveness of SAST with existing methods, we per- avoid the destruction of polygon annotations. The cropped image
form thorough experiments on four public text detection datasets, regions will be rotated randomly in 4 directions (0◦ , 90◦ , 180◦ ,
i.e., ICDAR 2015, ICDAR2017-MLT, SCUT-CTW1500 and Total-Text. and 270◦ ) and standardized by subtracting the RGB mean value of
ImageNet dataset. The text region, which is marked as "DO NOT
4.1 Datasets CARE" or its minimum length of edges is less than 8 pixels, will be
ignored in the training process.
The datasets used for the experiments in this paper are briefly Testing. In inference phase, unless otherwise stated, we set the
introduced below. longer side to 1536 for single-scale testing, and to 512, 768, 1536,
SynthText. The SynthText dataset [6] is composed of 800,000 and 2048 for multi-scale testing, while keeping the aspect ratio
natural images, on which text in random colors, fonts, scales, and unchanged. A specified range is assigned to each testing scale and
orientations is rendered carefully to have a realistic look. We use detections from different scales are combined using NMS, which is
the dataset with word-level labels to pre-train our model. inspired by SNIP[31].
ICDAR 2015. The ICDAR 2015 dataset [15] is collected for the
ICDAR 2015 Robust Reading Competition, with 1,000 natural im-
ages for training and 500 for testing. The images are acquired using
Google Glass and the text accidentally appear in the scene. All the 4.3 Ablation Study
text instances annotated with word-level quadrangles. We conduct several ablation experiments to analyze SAST. The
ICDAR2017-MLT. The ICDAR2017-MLT [28] is a large scale details are discussed as follows.
multi-lingual text dataset, which includes 7,200 training images, The Effectiveness of TBO, TCO and TVO. To verify the effi-
1,800 validation images and 9,000 test images. The dataset consists ciency of Text Instance Segmentation Module (TVO and TCO maps)
of multi-oriented and multi-lingual aspects of scene text. The text and Arbitrary Shape Representation Module (TBO map) in SAST,
regions in ICDAR2017-MLT are also annotated by quadrangles. we conduct several experiments with the following configurations:
SCUT-CTW1500. The SCUT-CTW1500 [42] is a challenging 1) TCL + CC + Expand: It is a naive way to predict center region
dataset for curved text detection. It consists of 1,000 training images of text, use connected component analysis to achieve text instance
and 500 test images, and text instances are largely in English and segmentation, and expand the contours of connected components
Chinese. Different from traditional datasets, the text instances in by a shrinking rate as the final text geometric representation. 2)
SCUT-CTW1500 are labelled by polygons with 14 vertices. TCL + CC + TBO: Instead of expending the contours directly, we
Total-Text. The Total-Text [1] is another curved text benchmark, reconstruct the precise polygon of a text instance with Arbitrary
which consists of 1,255 training images and 300 testing images with Shape Representation Module. 3) TCL + TVO + TCO +TBO: As a
more than 3 different text orientations: Horizontal, Multi-Oriented, substitute for connected component analysis, we use the method of
and Curved. The annotations are labelled in word-level. point-to-quad assignment in Text Instance Segmentation Module,
Evaluation Metrics. The performance on ICDAR2015, Total- which is supposed to incorporate high-level object knowledge and
Text, SCUT-CTW1500, and ICDAR2017-MLT is evaluated using the low-level information, and assign each pixel on the TCL map to
protocols provided in [1, 15, 28, 42], respectively. its best matching text instances. The efficiency of the proposed
method is demonstrated on SCUT-CTW1500, as shown in Tab. 1.
4.2 Implementation Details It surpasses the first two methods by 21.75% and 1.46% in Hmean,
Training. ResNet-50 is used as the network backbone with pre- respectively. Meanwhile, the proposed point-to-quad assignment
trained weight on ImageNet [3]. The skip-connection is in FPN cost almost the same time as connected component analysis.
fashion with output channel numbers of the convolutional layers The Trade-off between Speed and Accuracy. There is a trade-
set to 128 and the final output is at 1/4 size of input images. All off between speed and accuracy, and the mainstream segmentation
upsample operators are the bilinear interpolation and the classifica- methods maintain high resolution, which is usually in the same size
tion branch is activated with sigmoid while the regression branches, of input image, to achieve a better result at a correspondingly high
i.e. TCO, TVO, and TBO maps, is the output of the last convolution cost in time. Several experiments are made on SCUT-CTW1500
layer directly. The training process is divided into two steps, i.e., the benchmark, We compare the performance with different resolu-
warming-up and fine-tuning steps. In the warming-up step, we ap- tion of output, i.e., {1, 1/2, 1/4, 1/8 }, and find a rational trade-off
ply Adam optimizer to train our model with learning rate 1e-4, and between speed and accuracy at the 1/4 scale of input images. The
the learning rate decay factor is 0.94 on the SynthText dataset. In detail configuration and results are shown in Tab. 2. Note that the
the fine-tuning step, the learning rate is re-initiated to 1e-4 and the feature extractor in those experiments is not equipped with Context
model is tuned on ICDAR 2015, ICDAR2017-MLT, SCUT-CTW1500 Attention Block.
and Total-Text. The Effectiveness of Context Attention Blocks. We intro-
All the experiments are performed on a workstation with the duce the CABs into the network architecture to capture long-range
following configuration, CPU: Intel(R) Xeon(R) CPU E5-2620 v2 @ dependencies of pixel information. We conduct two experiments
2.10GHz x16; GPU: NVIDIA TITAN Xp ×4; RAM: 64GB. During the on SCUT-CTW1500 by replacing the CABs with several convolu-
training time, we set the batch size to 8 per GPU in parallel. tional layers stacked together as the baseline experiment, which
Data Augmentation. We randomly crop the text image regions, has almost the same number of trainable variables. The input size
then resize and pad them to 512 × 512. Specially for curved polygon of image is 512 × 512, and the output of images is at 1/4 of input
Table 1: Ablation study for the effectiveness of TBO, TCO, Table 4: Evaluation on SCUT-CTW1500 for detecting text
and TVO in the proposed method. lines of arbitrary shapes.

Method Recall Precision Hmean T (ms) Method Recall Precision Hmean FPS
TCL + CC + expand 55.51 61.55 58.37 – CTPN [32] 53.80 60.40 56.90 –
TCL + CC + TBO 74.65 83.13 78.66 5.84 EAST [47] 49.10 78.70 60.40 –
TCL + TVO + TCO + TBO 76.69 83.89 80.12 6.26 DMPNet [21] 56.00 69.90 62.20 –
CTD [42] 65.20 74.30 69.50 15.20
Table 2: Ablation study for the trade-off between speed and CTD + TLOC [42] 69.80 74.30 73.40 13.30
accuracy on SCUT-CTW1500. SLPR [48] 70.10 80.10 74.80 –
TextSnake [23] 85.30 67.90 75.60 12.07
PSENet-4s [35] 78.13 85.49 79.29 8.40
Method Recall Precision Hmean FPS PSENet-2s [35] 79.30 81.95 80.60 –
1s 76.34 86.23 80.98 10.54 PSENet-1s [35] 79.89 82.50 81.17 3.90
2s 70.86 86.30 77.82 21.68 TextField [38] 79.80 83.00 81.40 –
4s 76.69 83.86 80.12 30.21 SAST 77.05 85.31 80.97 27.63
8s 71.54 80.67 75.83 38.05 SAST MS 81.71 81.19 81.45 –

Table 3: Ablation study for the effectiveness of CABs on Table 5: Evaluation on Total-Text for detecting text lines of
SCUT-CTW1500. arbitrary shapes.

Method Recall Precision Hmean FPS Method Recall Precision Hmean


baseline 76.69 83.86 80.12 30.21 SegLink [30] 23.80 30.30 26.70
with CABs 77.05 85.31 80.97 27.63 DeconvNet [1] 33.00 40.00 36.00
EAST [47] 36.20 50.00 42.00
TextSnake [23] 74.50 82.70 78.40
size. Tab. 3 demonstrates the performance and speed of both exper-
TextField [38] 79.90 81.20 80.60
iments. The experiment with CABs achieves 80.97% in Hmean at a
speed of 27.63 FPS, which exceeds the baseline by 0.85% in Hmean SAST 76.86 83.77 80.17
at a bit slower frame rate. SAST MS 75.49 85.57 80.21

4.4 Evaluation on Curved Text Benchmark


On SCUT-CTW1500 and Total-Text, we evaluate the performance most competitors (these pure detection methods without the assis-
of SAST for detecting text lines of arbitrary shapes. We fine-tune tance of recognition task). Moreover, multi-scale testing increases
our model for about 10 epochs on SCUT-CTW1500 and Total-Text about 0.53% in Hmean. Some detection results are shown in Fig. 6
training set, respectively. In testing phase, the number of vertices (c), and indicate that SAST is also capable to detect multi-oriented
of text polygons is adaptively counted and we set the scale of the text accurately.
longer side to 512 for single-scale testing on both datasets.
The quantitative results are shown in Tab. 4 and Tab. 5. With 4.6 Evaluation on ICDAR2017-MLT
the help of the efficient post-processing, SAST achieves 80.97% and To demonstrate the generalization ability of SAST on multilingual
78.08% in Hmean on SCUT-CTW1500 and Total-Text, respectively, scene text detection, we evaluate SAST on ICDAR2017-MLT. Similar
which is comparable to the state-of-the-art methods. In addition, to the above training methods, the detector is fine-tuned for about
multi-scale testing can further improves Hmean to 81.45% and 10 epochs on the SynthText pre-trained model. At the single scale
80.21% on SCUT-CTW1500 and Total-Text. The visualization of testing, our proposed method achieves a Hmean of 68.76%, and it
curved text detection are shown in Fig. 6 (a) and (b). As can be seen, increases to 72.37% for multi-scale testing. The quantitative results
the proposed text detector SAST can handle curved text lines well. are shown in Tab. 7. The visualizatio n of multilingual text detection
is as illustrated in the Fig. 6 (d), which shows the robustness of the
4.5 Evaluation on ICDAR 2015 proposed method in detecting multilingual scene text.
In order to verify the validity for detecting oriented text, we com-
pare SAST with the state-of-the-art methods on ICDAR 2015 dataset, 4.7 Runtime
a standard oriented text dataset. Compared with previous arbitrarily- In this paper, we make a trade-off between speed and accuracy.
shaped text detectors [35, 38, 39], which detect text on the same The TCL, TVO, TCO, and TBO maps are predicted in the 1/4 size
size as input image, SAST achieves a better performance in a much of input images. With the proposed post-processing step, SAST is
faster speed. All the results are listed in Tab. 6. Specifically, for supposed to detect text of arbitrary shapes in real-time speed with a
single-scale testing, SAST achieves 86.91% in Hmean, surpassing commonly used GPU. To demonstrate the runtime of the proposed
Figure 6: Some qualitative results by the proposed method. From left to right: ICDAR2015, SCUT-CTW1500, Total-Text, and
ICDAR17-MLT. Blue contours: ground truths; Cyan contours: quads from TVO map; Red contours: final detection results.

Table 6: Evaluation on ICDAR 2015 for detecting oriented method, we run testing on SCUT-CTW1500 with a workstation
text. equipped with NVIDIA TITAN Xp. The test image is resized to
512 × 512 and the batch size is set to 1 on a single GPU. It takes
Method Recall Precision Hmean 29.58 ms and 6.61 ms in the process of network inference and post-
processing respectively, which is written in Python code 1 and can
DMPNet [21] 68.22 73.23 70.64
be further optimized. It runs at 27.63 FPS with a Hmean of 80.97%,
SegLink [30] 76.50 74.74 75.61
surpassing most of the existing arbitrarily-shaped text detectors in
SSTD [9] 73.86 80.23 76.91
both accuracy and efficiency2 , as depicted in Tab. 4.
WordSup [11] 77.03 79.33 78.16
RRPN [26] 77.13 83.52 80.20
EAST [47] 78.33 83.27 80.72
5 CONCLUSION AND FUTURE WORK
He et al. [10] 80.00 82.00 81.00 In this paper, we propose an efficient single-shot arbitrarily-shaped
TextField [38] 80.50 84.30 82.40 text detector together with Context Attention Blocks and a mecha-
TextSnake [23] 80.40 84.90 82.60 nism of point-to-quad assignment, which integrates both high-level
PixelLink [2] 82.00 85.50 83.70 object knowledge and low-level pixel information to obtain text
RRD [18] 80.00 88.00 83.80 instances from a context-enhanced segmentation. Several experi-
Lyu et al. [25] 79.70 89.50 84.30 ments demonstrate that the proposed SAST is effective in detecting
PSENet-4s [35] 83.87 87.98 85.88 arbitrarily-shaped text, and is also robust in generalizing to mul-
IncepText [39] 84.30 89.40 86.80 tilingual scene text datasets. Qualitative results show that SAST
PSENet-1s [35] 85.51 88.71 87.08 helps to alleviate some common challenges in segmentation-based
PSENet-2s [35] 85.22 89.30 87.21 text detector, such as the problem of fragments and the separation
of adjacent text instances. Moreover, with a commonly used GPU,
SAST 87.09 86.72 86.91 SAST runs fast and may be sufficient for some real-time applica-
SAST MS 87.34 87.55 87.44 tions, e.g., augmented reality translation. However, it is difficult for
SAST to detect some extreme cases, which mainly are very small
Table 7: Evaluation on ICDAR2017-MLT for the generaliza-
text regions. In the future, we are interested in improving the ability
tion ability of SAST on multilingual scene text detection.
of small text detection and developing an end-to-end text reading
system for text of arbitrary shapes.
Method Recall Precision Hmean
Lyu et al. [25] 56.60 83.80 66.80 ACKNOWLEDGMENTS
AF-RPN [46] 66.00 75.00 70.00 This work is supported in part by the National Natural Science
PSENet-4s [35] 67.56 75.98 71.52 Foundation of China (NSFC) under Grant 61572387, Grant 61632019,
PSENet-2s [35] 68.35 76.97 72.40 Grant 61836008, and Grant 61672404, and the Foundation for Inno-
Lyu et al. MS [25] 70.60 74.30 72.40 vative Research Groups of the NSFC under Grant 61621005.
PSENet-1s [35] 68.40 77.01 72.45
SAST 67.56 70.00 68.76 1 TheNMS part is written in C++.
2 The
SAST MS 66.53 79.35 72.37 speed of different methods is depicted for reference only, which might be evalu-
ated with different hardware environments.
REFERENCES 4th Int. Conf. 3D Vision (3DV). IEEE, 565–571.
[1] Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-Text: A comprehensive [28] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas,
dataset for scene text detection and recognition. In Int. Conf. Doc. Anal. Recognit. Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. 2017.
(ICDAR), Vol. 1. IEEE, 935–942. ICDAR2017 robust reading challenge on multi-lingual scene text detection and
[2] Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting script identification-rrc-mlt. In Int. Conf. Doc. Anal. Recognit. (ICDAR), Vol. 1.
scene text via instance segmentation. In Proc. AAAI Conf. Artif. Intell. (AAAI). IEEE, 1454–1459.
[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- [29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN:
geNet: A large-scale hierarchical image database. In IEEE Conf. Comp. Vis. Patt. Towards real-time object detection with region proposal networks. In Adv. Neural
Recognit. (CVPR). IEEE, 248–255. Inf. Process. Syst. (NIPS). 91–99.
[4] Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio [30] Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text
Guadarrama, and Kevin P Murphy. 2017. Semantic instance segmentation via in natural images by linking segments. In IEEE Conf. Comp. Vis. Patt. Recognit.
deep metric learning. arXiv:1703.10277 (CVPR). 2550–2558.
[5] R. Girshick. 2015. Fast R-CNN. In IEEE Int. Conf. Comp. Vis. (ICCV). 1440–1448. [31] Bharat Singh and Larry S Davis. 2018. An analysis of scale invariance in object
[6] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data detection snip. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 3578–3587.
for text localisation in natural images. In IEEE Conf. Comp. Vis. Patt. Recognit. [32] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in
(CVPR). 2315–2324. natural image with connectionist text proposal network. In Eur. Conf. Comp. Vis.
[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask (ECCV). Springer, 56–72.
R-CNN. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 2961–2969. [33] Jonas Uhrig, Eike Rehder, Björn Fröhlich, Uwe Franke, and Thomas Brox. 2018.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual Box2Pix: Single-shot instance segmentation by assigning pixels to object boxes.
learning for image recognition. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). In IEEE Intell. Veh. Symp. (IV). IEEE, 292–299.
770–778. [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[9] Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
shot text detector with regional attention. In IEEE Conf. Comp. Vis. Patt. Recognit. you need. In Adv. Neural Inf. Process. Syst. (NIPS). 5998–6008.
(CVPR). 3047–3055. [35] Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai
[10] Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct Shao. 2019. Shape Robust Text Detection With Progressive Scale Expansion
regression for multi-oriented scene text detection. In IEEE Conf. Comp. Vis. Patt. Network. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 9336–9345.
Recognit. (CVPR). 745–753. [36] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local
[11] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui neural networks. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 7794–7803.
Ding. 2017. WordSup: Exploiting Word Annotations for Character Based Text [37] Yue Wu and Prem Natarajan. 2017. Self-organized text detection with minimal
Detection. In IEEE Int. Conf. Comp. Vis. (ICCV). 4950–4959. post-processing via border learning. In IEEE Conf. Comp. Vis. Patt. Recognit.
[12] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. 2015. DenseBox: Unifying (CVPR). 5000–5009.
landmark localization with end to end object detection. arXiv:1509.04874 [38] Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan Wang, Zhibo Yang, and Xiang
[13] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, Bai. 2019. TextField: Learning A Deep Direction Field for Irregular Scene Text
and Wenyu Liu. 2018. CCNet: Criss-cross attention for semantic segmentation. Detection. IEEE Trans. Image Process. (2019). arXiv:1812.01393
arXiv:1811.11721 [39] Qiangpeng Yang, Mengli Cheng, Wenmeng Zhou, Yan Chen, Minghui Qiu, and
[14] Zhida Huang, Zhuoyao Zhong, Lei Sun, and Qiang Huo. 2019. Mask R-CNN with Wei Lin. 2018. IncepText: a new inception-text module with deformable PSROI
pyramid attention network for scene text detection. In Winter Conf. Appl. Comp. pooling for multi-oriented scene text detection. In Int. Joint Conf. Artif. Intell.
Vis. (WACV). IEEE, 764–772. (IJCAI). IJCAI, 1071–1077.
[15] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, [40] Qixiang Ye and David Doermann. 2015. Text detection and recognition in imagery:
Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ra- A survey. IEEE Trans. Pattern Anal. Mach. Intell. 37, 7 (2015), 1480–1500.
maseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on [41] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong
robust reading. In Int. Conf. Doc. Anal. Recognit. (ICDAR). IEEE, 1156–1160. Sang. 2018. Learning a discriminative feature network for semantic segmentation.
[16] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). IEEE, 1857–1866.
Carsten Rother. 2017. InstanceCut: from edges to instances with multicut. In [42] Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. 2017. Detecting
IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). IEEE, 7322–7331. curve text in the wild: New dataset and new solution. arXiv:1712.02170
[17] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. [43] Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui
TextBoxes: A fast text detector with a single deep neural network. In Proc. AAAI Ding, and Xinghao Ding. 2019. Look More Than Once: An Accurate Detector for
Conf. Artif. Intell. (AAAI). 4161–4167. Text of Arbitrary Shapes. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR).
[18] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. [44] Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang
Rotation-sensitive regression for oriented scene text detection. In IEEE Conf. Bai. 2016. Multi-oriented text detection with fully convolutional networks. In
Comp. Vis. Patt. Recognit. (CVPR). 5909–5918. IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 4159–4167.
[19] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and [45] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet,
Serge Belongie. 2017. Feature pyramid networks for object detection. In IEEE Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional
Conf. Comp. Vis. Patt. Recognit. (CVPR). 2117–2125. random fields as recurrent neural networks. In IEEE Conf. Comp. Vis. Patt. Recognit.
[20] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, (CVPR). 1529–1537.
Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. [46] Zhuoyao Zhong, Lei Sun, and Qiang Huo. 2018. An Anchor-Free Region Proposal
In Eur. Conf. Comp. Vis. (ECCV). Springer, 21–37. Network for Faster R-CNN based Text Detection Approaches. arXiv:1804.09003
[21] Yuliang Liu and Lianwen Jin. 2017. Deep matching prior network: Toward tighter [47] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and
multi-oriented text detection. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In IEEE
1962–1969. Conf. Comp. Vis. Patt. Recognit. (CVPR). 5551–5560.
[22] Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan [48] Yixing Zhu and Jun Du. 2018. Sliding line point regression for shape robust scene
Lu. 2018. Affinity derivation and graph merge for instance segmentation. In Eur. text detection. In Int. Conf. Pattern Recognit. (ICPR). 3735–3740.
Conf. Comp. Vis. (ECCV). 686–703. [49] Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recog-
[23] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong nition: Recent advances and future trends. Frontiers of Computer Science 10, 1
Yao. 2018. TextSnake: A flexible representation for detecting text of arbitrary (2016), 19–36.
shapes. In Eur. Conf. Comp. Vis. (ECCV). 20–36.
[24] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018.
Mask TextSpotter: An end-to-end trainable neural network for spotting text with
arbitrary shapes. In Eur. Conf. Comp. Vis. (ECCV). 67–83.
[25] Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018. Multi-
oriented scene text detection via corner localization and region segmentation. In
IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 7553–7563.
[26] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xi-
angyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals.
IEEE Trans. Multimedia 20, 11 (2018), 3111–3122.
[27] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-Net: Fully
convolutional neural networks for volumetric medical image segmentation. In

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy