0% found this document useful (0 votes)
33 views10 pages

SPTS: Single-Point Text Spotting: Dezhi Peng Xinyu Wang Yuliang Liu

Uploaded by

dqhockingqueen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views10 pages

SPTS: Single-Point Text Spotting: Dezhi Peng Xinyu Wang Yuliang Liu

Uploaded by

dqhockingqueen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

SPTS: Single-Point Text Spotting

Dezhi Peng Xinyu Wang Yuliang Liu∗


South China University of Technology Zhejiang University Huazhong University of Science and
Technology

Jiaxin Zhang Songxuan Lai Dahua Lin


Mingxin Huang Jing Li Chinese University of Hong Kong
South China University of Technology Shenggao Zhu
Huawei Cloud Computing
arXiv:2112.07917v6 [cs.CV] 29 Aug 2022

Technologies

Chunhua Shen Xiang Bai Lianwen Jin∗


Zhejiang University Huazhong University of Science and South China University of Technology
Technology

ABSTRACT ACM Reference Format:


Existing scene text spotting (i.e., end-to-end text detection and Dezhi Peng, Xinyu Wang, Yuliang Liu, et al. 2022. SPTS: Single-Point Text
Spotting. In Proceedings of the 30th ACM International Conference on Multi-
recognition) methods rely on costly bounding box annotations
media (MM ’22), October 10–14, 2022, Lisboa, Portugal. ACM, New York, NY,
(e.g., text-line, word-level, or character-level bounding boxes). For USA, 10 pages. https://doi.org/10.1145/3503161.3547942
the first time, we demonstrate that training scene text spotting
models can be achieved with an extremely low-cost annotation of
a single-point for each instance. We propose an end-to-end scene 1 INTRODUCTION
text spotting method that tackles scene text spotting as a sequence In the last decades, it has been witnessed that modern Optical
prediction task. Given an image as input, we formulate the desired Character Recognition (OCR) algorithms are able to read textual
detection and recognition results as a sequence of discrete tokens content from pictures of complex scenes, which is an incredible
and use an auto-regressive Transformer to predict the sequence. The development, leading to enormous interest from both academia and
proposed method is simple yet effective, which can achieve state- industry. The limitation of existing methods, and particularly their
of-the-art results on widely used benchmarks. Most significantly, poorer performance on arbitrarily shaped scene text, have been
we show that the performance is not very sensitive to the positions repeatedly identified [6, 19, 23]. This can be seen in the trend of
of the point annotation, meaning that it can be much easier to be worse predictions for instances with curved shapes, varied fonts,
annotated or even be automatically generated than the bounding distortions, etc.
box that requires precise positions. We believe that such a pioneer The focus of research in the OCR community has moved on
attempt indicates a significant opportunity for scene text spotting from horizontal [20, 35] and multi-oriented text [24, 42, 43] to ar-
applications of a much larger scale than previously possible. The bitrarily shaped text [23, 29] in recent years, accompanied by the
code is available at https://github.com/shannanyinxiang/SPTS. annotation format from horizontal rectangles, to quadrilaterals,
and to polygons. The fact that regular bounding boxes are prone to
CCS CONCEPTS involve noises has been well studied in previous works (see Fig. 1),
• Computing methodologies → Scene understanding; Com- which has proved that character-level and polygonal annotations
puter vision. can effectively lift the model performance [19, 21, 29]. Furthermore,
many efforts have been made to develop more sophisticated repre-
KEYWORDS sentations to fit arbitrarily shaped text instances [8, 23, 27, 37, 44]
Scene text spotting, Transformer, Vision Transformer, Single-point (see Fig. 2). For example, Text Dragon [8] utilizes character-level
representation bounding boxes to generate centerlines for enabling the prediction
∗ Corresponding authors. Part of this work was done by Yuliang Liu at the Chinese
of local geometry attributes, ABCNet [23] converts polygon an-
University of Hong Kong. notations to Bezier curves for representing curved text instances,
and Text Snake [27] describes text instances by a series of ordered
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed disks centered at symmetric axes. However, these novel representa-
for profit or commercial advantage and that copies bear this notice and the full citation tions are primarily and carefully designed by experts based on prior
on the first page. Copyrights for components of this work owned by others than ACM knowledge, heavily relying on highly customized network architec-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a ture (e.g., specified Region of Interest (RoI) modules) and consuming
fee. Request permissions from permissions@acm.org. more expensive annotations (e.g., character-level annotations), lim-
MM ’22, October 10–14, 2022, Lisboa, Portugal iting their generalization ability for practical applications.
© 2022 Association for Computing Machinery.
ACM ISBN 978-1-4503-9203-7/22/10. . . $15.00 To reduce the cost of data labeling, some researchers [1, 2, 12, 34]
https://doi.org/10.1145/3503161.3547942 have explored training the OCR models with coarse annotations in a
MM ’22, October 10–14, 2022, Lisboa, Portugal Dezhi Peng et al.

(a) Rectangle (55s). (b) Quadrilateral (96s). (c) Character (581s). (d) Polygon (172s). (e) Single-Point (11s).

Figure 1: Different annotation styles and their time cost (for all the text instances in the sample image) measured by the LabelMe1 tool. Green
areas are positive samples, while red dashed boxes are noises that may be possibly included. The time cost of single-point annotation is more
than 50 times faster than character-level annotation.

Such a straightforward annotation paradigm can consider-


ably reduce the labeling costs, making it possible to access
large-scale OCR data in the future.
• We propose a new Transformer-based [36] scene text spot-
(a) Text Dragon [8] (b) ABCNet [23] (c) Text Snake [27] ter, which forms the text spotting as a language modeling
task. Given an input image, our method predicts a discrete
Figure 2: Some recent representations of text instances. token sequence that includes both detection and recognition
results. Benefiting from such a concise pipeline, the complex
post-processing and sampling strategies designed based on
prior knowledge can be discarded, showing great potential
weakly-supervised manner. These methods can mainly be separated in terms of flexibility and generality.
into two categories, i.e., (1) bootstrapping labels to finer granularity • To evaluate the effectiveness of proposed methods, exten-
and (2) training with partial annotations. The former usually derives sive experiments and ablations are conducted on four widely
character-level labels from word- or line-level annotations; thus, used OCR datasets, i.e., ICDAR 2013 [16], ICDAR 2015 [14],
the models could enjoy the well-understood advantage of character- Total-Text [5], and SCUT-CTW1500 [25], involving both hor-
level supervision without introducing overhead costs. The latter izontal and arbitrarily shaped texts. The results show that
is committed to achieving competitive performance with fewer the proposed SPTS can achieve state-of-the-art performance
training samples. However, both methods still rely on the costly compared with existing approaches.
bounding box annotations.
One of the underlying problems that prevent replacing the bound- 1.1 Related Work
ing box with a simpler annotation format, such as a single-point, is
In the past decades, a variety of scene text datasets using different
that most text spotters rely on RoI-like sampling strategies to ex-
annotation styles have been proposed, focusing on various sce-
tract the shared backbone features. For example, Mask TextSpotter
narios, including horizontal text [15, 16] described by rectangles
requires mask prediction inside a RoI [19]; ABCNet [23] proposes
(Fig. 1a), multi-oriented text [14, 30] represented by quadrilaterals
BezeirAlign, while TextDragon [8] introduces RoISlide to unify the
(Fig. 1b), and arbitrarily shaped text [5, 6, 25] labeled by polygons
detection and recognition heads. In this paper, inspired by the re-
(Fig. 1d). These forms of annotations have facilitated the devel-
cent success of a sequence-based object detector Pix2Seq [4], we
opment of corresponding OCR algorithms. For example, earlier
show that the text spotter can be trained with a single-point, also
works [17] usually adapt the generic object detectors to scene text
termed as the indicated point (see Fig. 1e). Thanks to such a concise
spotting, where feature maps are shared between detection and
form of annotation, labeling time can be significantly saved, e.g., it
recognition heads via RoI modules. These approaches follow the
only takes less than one-fiftieth of the time to label single-points
sampling mechanism designed for generic object detection and
for the sample image shown in Fig. 1 compared with annotating
utilize rectangles to express text instances, thus performing worse
character-level bounding boxes, which is extremely tortuous espe-
on non-horizontal targets. Later, some methods [3, 10, 22] replace
cially for the small and vague text instances. Another motivating
the rectangular bounding boxes with quadrilaterals by modify-
factor in selecting point annotation is that a clean and efficient OCR
ing the regular Region Proposal Network (RPN) to generate ori-
pipeline can be developed, discarding the complex post-processing
ented proposals, enabling better performance for multi-oriented
module and sampling strategies; thus, the ambiguity introduced by
texts. Recently, with the presentation of the curved scene text
RoIs (see red dashed regions in Fig. 1) can be alleviated. To the best
datasets [5, 6, 25], the research interest of the OCR community
of our knowledge, this is the first attempt to simplify the bounding
has shifted to more challenging arbitrarily shaped texts. Gener-
box to a single-point supervision signal in the OCR community.
ally, there are two widely adopted solutions to solve the arbitrarily
The main contributions of this work are summarized as follows:
shaped text spotting task, i.e., segmentation-based [19, 32, 33, 40]
• For the first time, we show that the text spotters can be super-
vised by a simple yet effective single-point representation. 1 https://github.com/wkentaro/labelme
SPTS: Single-Point Text Spotting MM ’22, October 10–14, 2022, Lisboa, Portugal

t1 t2 t3 tn <EOS>

Predicted Sequence
... PLANET

Transformer
CNN
Encoder ... HOLLYWOOD
<SOS> t1 t2 tn-1 tn
Encoder Decoder

Figure 3: Overall framework of the proposed SPTS. The visual and contextual features are first extracted by a series of CNN and Transformer
encoders. Then, the features are auto-regressively decoded into a sequence that contains both localization and recognition information, which
is subsequently translated into point coordinates and text transcriptions. Only a point-level annotation is required for training.

and regression-based methods [8, 23, 38]. The former first predicts Recently, Pix2Seq [4] pioneered to cast the generic object detec-
masks to segment text instances, then the features inside text re- tion problem as a language modeling task, based on an intuitive
gions are sampled and grouped for further recognition. For example, assumption that if a deep model knows what and where the tar-
Mask TextSpotterv3 [19] proposes a Segmentation Proposal Net- get is, it can be taught to tell the results by the desired sequence.
work (SPN) instead of the regular RPN to decouple neighboring Thanks to the concise pipeline, labels with different attributes such
text instances accurately, thus significantly improving the perfor- as location coordinates and object categories can be integrated into
mance. In addition, regression-based methods usually parameterize a single sequence, enabling an end-to-end trainable framework
the text instances as a sequence of coordinates and subsequently without task-specific modules (e.g., Region Proposal Networks and
learn to predict them. For instance, ABCNet [23] converts polygons RoI pooling layers), which can thus be adapted to the text spotting
into Bezier curves, significantly improving performance on curved task. Inspired by this, we propose Single-Point Text Spotting (SPTS).
scene texts. Wang et al. [38] first localize the boundary points of Unlike Pix2Seq which is designed for object detection only and still
text instances, then the features rectified by Thin-Plate-Spline are requires the bounding boxes for all instances, our SPTS tackles
fed into the recognition branch, demonstrating promising accuracy both text detection and recognition as a end-to-end sequence pre-
on arbitrary-shaped instances. Moreover, Xing et al. [21] boost the diction task, using the single point location and text annotations.
text spotting performance by utilizing character-level annotations, Compared with existing text spotting approaches, SPTS follows a
where character bounding boxes, as well as type segmentation much more simple and concise pipeline where the input images
maps, are predicted simultaneously, enabling impressive perfor- are translated into a sequence containing location and recognition
mance. Even though different representations are adopted in the results, genuinely enabling the text detection and recognition task
above methods to describe the text instances, they are all actually simultaneously.
derived from one of the rectangular, quadrilateral, or polygonal Specifically, as shown in Fig. 3, each input image is first sequen-
bounding boxes. Such annotations must be carefully labeled by hu- tially encoded by a CNN and a Transformer encoder [36] to extract
man beings, thus are quite expensive, limiting the scale of training visual and contextual features. Then, the captured features are de-
datasets. coded by a Transformer decoder [36], where tokens are predicted in
In this paper, we propose Single-Point Text Spotting (SPTS), an auto-regressive manner. Unlike previous algorithms, we further
which is, to the best of our knowledge, the first scene text spotter simplify the bounding box to the center of the text instance, the
that does not rely on bounding box annotations at all. Specifically, corner point located at the top left of the first character, or the
each text instance is represented by a single-point (see Fig. 1e) with random point within the text instance, as described in Fig. 7. Bene-
a meager cost. The fact that this point does not need to be accurately fiting from such a simple yet effective representation, the modules
marked further demonstrates the possibility of learning in a weakly carefully designed based on prior knowledge, such as grouping
supervised manner, considerably lowering down the labeling cost. strategies utilized in segmentation-based methods and feature sam-
pling blocks equipped in box-based text spotters, can be eschewed.
2 METHODOLOGY Therefore, the recognition accuracy will not be limited by poor
detection results, significantly improving the model robustness.
Most of the existing text spotting algorithms treat the problem as
two sub-tasks, i.e., text detection and recognition, albeit the en-
tire network might be end-to-end optimized. Customized modules 2.1 Sequence Construction
such as BezierAlign [23], RoISlide [8], and RoIMasking [19] are The fact that a sequence can carry information with multiple at-
required to bridge the detection and recognition modules, where tributes naturally enables the text spotting task, where text in-
backbone features are cropped and shared between detection and stances are simultaneously localized and recognized. To express
recognition heads. Under such types of design, the recognition and the target text instances by a sequence, it is required to convert
detection modules are highly coupled. For example, the features fed the continuous descriptions (e.g., bounding boxes) to a discretized
to the recognition head are usually cropped from the ground-truth space. To this end, as shown in Fig. 4, we follow Pix2Seq [4] to
bounding box at the training stage since detection results are not build the target sequence; what distinguishes our methods is that
good enough in the first iterations; thus, the recognition result is we further simplify the bounding box to a single-point and use
susceptible to interference from the detected bounding box during the variable-length transcription instead of the single-token object
the test phase. category.
MM ’22, October 10–14, 2022, Lisboa, Portugal Dezhi Peng et al.

Coordinates Transcription

Original Annotations

Output Seq.
<EOS>
x1,y1 HAIGHT

x2,y2 ASHBURY
HAIGHT x y V I N T A G E ∅ ∅ x y ... ∅ ∅ x ... ...∅ >
x3,y3 VINTAGE

ASHBURY x4,y4 VINTAGE


...
Discretization
Coordinates Transcription

Input Seq.
... ∅ y ...

Discretized Instances
< x y V I N T A G E ∅ ∅ x ∅ x ∅
VINTAGE ⌊x1/w×nbins⌋, ⌊y1/h×nbins⌋ H,A,I,G,H,T,<PAD>,..

⌊x2/w×nbins⌋, ⌊y2/h×nbins⌋ A,S,H,B,U,R,Y,<PAD>,..

⌊x3/w×nbins⌋, ⌊y3/h×nbins⌋ V,I,N,T,A,G,E,<PAD>,.. <SOS> Text Instance 1 Randomly Ordered Instances


VINTAGE
⌊x4/w×nbins⌋, ⌊y4/h×nbins⌋ V,I,N,T,A,G,E,<PAD>,..

Target Sequence
Figure 5: Input and output sequences of the decoder.
Coord. Transcription <PAD>

< x y V I N T A G E ∅ ∅ x ... ∅ ∅ x y ... ∅ >

loss at training time, which can be written as:


<SOS> Text Instance 1 Randomly Ordered Instances <EOS>
𝐿
∑︁
maximize w𝑖 log 𝑃 (s̃𝑖 |I, s1:𝑖 ), (1)
Figure 4: Pipeline of the sequence construction. 𝑖=1
where I is the input image, s̃ is the output sequence, s is the input
sequence, 𝐿 is the length of the sequence, and w𝑖 is the weight of
Specifically, the continuous coordinates of the central point of
the likelihood of the 𝑖-th token, which is empirically set to 1.
the text instance are uniformly discretized into integers between
[1, 𝑛𝑏𝑖𝑛𝑠 ], where 𝑛𝑏𝑖𝑛𝑠 controls the degree of discretization. For
2.3 Inference
example, an image with a long side of 800 pixels requires only
𝑛𝑏𝑖𝑛𝑠 = 800 to achieve zero quantization error. Note that the central At the inference stage, SPTS auto-regressively predicts the tokens
point of the text instance is obtained by averaging the upper and until the end of the sequence token <EOS> occurs. The predicted
lower midpoints as shown in Fig. 7a. As so far, a text instance can sequence will subsequently be divided into multiple segments, each
thereby be represented by a sequence of three parts, i.e., [𝑥, 𝑦, 𝑡], of which contains 2 + 𝑙𝑡𝑟 tokens. Then, the tokens can be easily
where (𝑥, 𝑦) are the discretized coordinates and 𝑡 is the transcription translated into the point coordinates and transcriptions, yielding
text. Notably, the transcriptions are inherently discrete, i.e., each of the text spotting results. In addition, the likelihood of all tokens
the characters represents a category, thus can be easily appended to in the corresponding segment is averaged and assigned as a confi-
the sequence. However, different from the generic object detection dence score to filter the original outputs, which effectively removes
that has a relatively fixed vocabulary (each 𝑡 represents an object redundant and false-positive predictions.
category, such as pedestrian), 𝑡 can be any natural language text of
any length in our task, resulting in a variable length of the target 3 EXPERIMENTS
sequence, which may further cause misalignment issues and can We report the experimental results on four widely used bench-
consume more computational resources. To eliminate such prob- marks including horizontal dataset ICDAR 2013 [16], multi-oriented
lems, we pad or truncate the texts to a fixed length 𝑙𝑡𝑟 , where the dataset ICDAR 2015 [14], and arbitrarily shaped datasets Total-
<PAD> token is used to fill the vacancy for shorter text instances. Text [5] and SCUT-CTW1500 [25].
In addition, like other language modeling methods, <SOS> and
<EOS> tokens are inserted to the head and tail of the sequence, in- 3.1 Datasets
dicating the start and the end of a sequence, respectively. Therefore, Curved Synthetic Dataset 150k. It is admitted that the perfor-
given an image that contains 𝑛𝑡𝑖 text instances, the constructed mance of text spotters can be improved by pre-training on syn-
sequence will include (2 + 𝑙𝑡𝑟 ) × 𝑛𝑡𝑖 discrete tokens, where the text thesized samples. Following previous work [23], we use the 150k
instances would be randomly ordered, following previous works [4]. synthetic images generated by the SynthText [9] toolbox, which
Supposing there are 𝑛𝑐𝑙𝑠 categories of characters (e.g., 97 for English contains around one-third of curved texts and two-third of hori-
characters and symbols), the vocabulary size of the dictionary used zontal instances.
to tokenize the sequence can be calculated as 𝑛𝑏𝑖𝑛𝑠 +𝑛𝑐𝑙𝑠 + 3, where ICDAR 2013 [16] contains 229 training and 233 testing samples,
the extra three classes are for <PAD>, <SOS>, and <EOS> tokens. while the images are primarily captured in a controlled environ-
Empirically, we set the 𝑙𝑡𝑟 and 𝑛𝑏𝑖𝑛𝑠 to 25 and 1,000, respectively, in ment, where the text contents of interest are explicitly focused in
our experiments. Moreover, the maximum value of 𝑛𝑡𝑖 is set to 60, horizontal.
which means the sequence containing more than 60 text instances ICDAR 2015 [14] consists of 1,000 training and 500 testing
will be truncated. images that were incidentally captured, containing multi-oriented
text instances presented in complicated backgrounds with strong
2.2 Model Training variations in blur, distortions, etc.
Based on the constructed sequence, the input and output sequences Total-Text [5] includes 1,255 training and 300 testing images,
of the Transformer decoder are shown in Fig. 5. Since the SPTS is where at least one curved sample is presented in each image and
trained to predict tokens, it only requires to maximize the likelihood annotated with polygonal bounding box at the word-level.
SPTS: Single-Point Text Spotting MM ’22, October 10–14, 2022, Lisboa, Portugal

Table 1: Comparison of the end-to-end recognition performance


evaluated by the proposed point-based metric and box-based met-
262 232 - - ric. Results are reproduced using official codes.
243 58 - -
Total-Text SCUT-CTW1500
45 255 - - Method
Box Point Box Point
ABCNetv1 [23] 67.2 67.4 53.5 53.0
208 333 - -
ABCNetv2 [26] 71.7 71.9 57.6 57.1
Distance Matrix

Figure 6: Illustration of the point-based evaluation metric. Dia-


monds are predicted points and circles represent ground-truth.

(a) Central (b) Top-left (c) Random


SCUT-CTW1500 [25] is another widely used benchmark de-
signed for spotting arbitrary shaped scene text, involving 1,000 and Figure 7: Indicated points (red) using different positions.
500 images for training and testing, respectively. The text instances
are labeled by polygons at text-line level.
we use the point-based metric to evaluate the proposed SPTS in the
3.2 Evaluation Protocol following experiments.
The existing evaluation protocol of text spotting tasks consists
3.3 Implemented Details
of two steps. Firstly, the intersection over union (IoU) scores be-
tween ground-truth (GT) and detected boxes are calculated; and The model is first pretrained on a combination dataset that in-
only if the IoU score is larger than a designated threshold (usually cludes Curved Synthetic Dataset 150k [23], MLT-2017 [31], ICDAR
set to 0.5), the boxes are matched. Then, the recognized content 2013 [16], ICDAR 2015 [14], and Total-Text [5] for 150 epochs, which
inside each matched bounding box is compared with the GT tran- is optimized by the AdamW [28] with an initial learning rate of
scription; only if the predicted text is the same as the GT will it 5 × 10−4 , while the learning rate is linearly decayed to 1 × 10−5 .
contribute to the end-to-end accuracy. However, in the proposed After pretraining, the model is then fine-tuned on the training split
method, each text instance is represented by a single-point; thus, of each target dataset for another 200 epochs, with a fixed learning
the evaluation metric based on the IoU is not available to measure rate of 1 × 10−5 . The entire model is distributively trained on 32
the performance. Meanwhile, comparing the localization perfor- NVIDIA V100 GPUs with a batch size of 32. Note that the effec-
mance between bounding-box-based methods and the proposed tive batch size is 64 because two independent augmentations are
point-based SPTS might be unfair, e.g., directly treating points inside performed on each image in a mini-batch, following [4, 11]. In ad-
a bounding box as true positives may overestimate the detection dition, we utilize ResNet-50 as the backbone network, while both
performance. To this end, we propose a new evaluation metric to the Transformer encoder and decoder consist of 6 layers with eight
ensure a relatively fair comparison to existing approaches, which heads. Regarding the architecture of the Transformer, we adopt the
mainly considers the end-to-end accuracy as it reflects both detec- Pre-LN Transformer [41]. During training, short size of the input
tion and recognition performance (failure detections usually lead image is randomly resized to a range from 640 to 896 (intervals of
to incorrect recognition results). Specifically, as shown in Fig. 6, 32) while keeping the longer side shorter than 1,600 pixels. Random
we modified the text instance matching rule by replacing the IoU cropping and rotating are employed for data augmentation. At the
metric with a distance metric, i.e., the predicted point that has the inference stage, we resize short edge to 1,000 while keeping longer
nearest distance to the central point of the GT box would be se- side shorter than 1824 pixels, following the previous works [23, 26].
lected, and the recognition results will be measured by the same
full-matching rules used in existing benchmarks. Only one pre- 3.4 Ablation Study
dicted point with the highest confidence would be matched to the 3.4.1 Ablation study of the position of the indicated point. In this
ground truth; others are then marked as false positives. paper, we propose to simplify the bounding box to a single-point.
To explore whether the proposed evaluation protocol can gen- Intuitively, all points in the region enclosed by the bounding box
uinely represent the model accuracy, Table 1 compares the end-to- should be able to represent the target text instance. To explore the
end recognition accuracy of ABCNetv1 [23] and ABCNetv2 [26] differences, we conduct ablation studies that use three different
on Total-Text [5] and SCUT-CTW1500 [25] under two metrics, i.e., strategies to get the indicated points (see Fig. 7), i.e., the central
the commonly used bounding box metric that is based on IoU, and point obtained by averaging the upper and lower midpoints, the
the proposed point-based metric. The results demonstrate that the top-left corner, and the random point inside the box. It should be
point-based evaluation protocol can well reflect the performance, noted that, we use the corresponding ground-truth here to calculate
where the difference between the values evaluated by box-based the distance matrix for evaluating the performance, i.e., the distance
and point-based metrics are no more than 0.5%. For example, the to the ground-truth top-left point is used for top-left, the distance to
ABCNetv1 model achieves 53.5% and 53.0% scores on the SCUT- the ground-truth central point for central, and the closest distance
CTW1500 dataset under the two metrics, respectively. Therefore, to the ground-truth polygon for random.
MM ’22, October 10–14, 2022, Lisboa, Portugal Dezhi Peng et al.

Table 2: Ablation study of the position of the indicated point. Table 6: End-to-end recognition results on Total-Text. “None" rep-
resents lexicon-free. “Full" represents that we use all the words ap-
E2E Total-Text E2E CTW1500 peared in the test set.
Position
None Full None Full
Central 74.2 82.4 63.6 83.8
Total-Text End-to-End
Top-left 71.6 79.7 61.4 82.0 Method
None Full
Random 73.2 80.8 62.3 81.1
Bounding Box-based methods
CharNet [21] 66.6 -
Table 3: Comparison with different shapes of bounding boxes. 𝑁𝑝 is ABCNet [23] 64.2 75.7
PGNet [39] 63.1 -
the number of parameters required to describe the location of text Mask TextSpotter [29] 65.3 77.4
instances by different representations. Qin et al. [33] 67.8 -
Mask TextSpotter v3 [19] 71.2 78.4
MANGO [32] 72.9 83.6
Total-Text SCUT-CTW1500
Variants 𝑁𝑝 PAN++ [40] 68.6 78.6
None Full None Full
ABCNet v2 [26] 70.4 78.1
SPTS-Bezier 60.6 71.6 52.6 73.9 16
Point-based method
SPTS-Rect 71.6 80.4 62.2 82.3 4
SPTS (Ours) 74.2 82.4
SPTS-Point 74.2 82.4 63.6 83.8 2

Table 4: End-to-end recognition results on ICDAR 2013. “S”, “W”, Table 7: End-to-end recognition results on SCUT-CTW1500. “None"
and “G” represent recognition with “Strong”, “Weak”, and “Generic” represents lexicon-free. “Full" represents that we use all the words
lexicon, respectively. appeared in the test set. ABCNet* means using the github check-
point2 .
IC13 End-to-End
Method
S W G SCUT-CTW1500 End-to-End
Method
Bounding Box-based methods None Full
Jaderberg et al. [13] 86.4 - - Bounding Box-based methods
Textboxes [20] 91.6 89.7 83.9 TextDragon [8] 39.7 72.4
Deep Text Spotter [3] 89.0 86.0 77.0 ABCNet [23] 45.2 74.1
Li et al. [17] 91.1 89.8 84.6 ABCNet* [23] 53.2 76.0
MaskTextSpotter [29] 92.2 91.1 86.5 MANGO [32] 58.9 78.7
Point-based method ABCNet v2 [26] 57.5 77.2
SPTS (Ours) 93.3 91.7 88.5 Point-based method
SPTS (Ours) 63.6 83.8

Table 5: End-to-end recognition results on ICDAR 2015. “S”, “W”,


and “G” represent recognition with “Strong”, “Weak”, and “Generic”
lexicon, respectively. results, each method uses corresponding representations to match
the GT box in the evaluation. That is to say, the single-point model
IC15 End-to-End (original SPTS) uses the evaluation metrics introduced in Sec. 3.2, i.e.,
Method
S W G distance between points; the predictions of SPTS-Rect are matched
Bounding Box-based methods to the circumscribed rectangle of the polygonal annotations; the
FOTS [22] 81.1 75.9 60.8
Mask TextSpotter [18] 83.0 77.7 73.5 SPTS-Bezier adopts the original metric that matches polygon boxes.
CharNet [21] 83.1 79.2 69.1 As shown in Table 3, the SPTS-point achieves the best performance
TextDragon [8] 82.5 78.3 65.2
Mask TextSpotter v3 [19] 83.3 78.1 74.2
on both the Total-Text and SCUT-CTW1500 datasets, outperforming
MANGO [32] 81.8 78.9 67.3 the other two representations by a large margin. Such experimen-
ABCNetV2 [26] 82.7 78.5 73.0 tal results suggest that a low-cost annotation, i.e., the indicated
PAN++ [40] 82.7 78.2 69.2
Point-based method
point, is capable of providing supervision for the text spotting task.
SPTS (Ours) 77.5 70.2 65.8 The reason for the low performance of SPTS-Bezier and SPTS-Rect
may be that longer sequences (e.g., SPTS-Bezier with 𝑁𝑝 = 16 vs.
SPTS-Point with 𝑁𝑝 = 2) make them difficult to converge; thus,
The results are shown in Table 2, where the results of central, SPTS-Bezier and SPTS-Rect cannot achieve comparable accuracy
top-left, and random are close on both datasets. It suggests that under the same training schedule.
the performance is not very sensitive to the positions of the point
annotation. 3.5 Comparison with Existing Methods on
3.4.2 Comparison between different representations. The proposed Scene Text Benchmarks
SPTS can be easily extended to produce bounding boxes by modi- 3.5.1 Horizontal-Text Dataset. Table 4 compares the proposed SPTS
fying the point coordinates to bounding box locations during se- with existing methods on the widely used ICDAR 2013 [16] bench-
quence construction. Here, we conduct ablations to explore the mark. Our method achieves state-of-the-art results with all three
influence by using different representations of the text instances. lexicons. It should be noted that the proposed SPTS only utilizes
Specifically, three variants are explored, including the Bezier curve a single-point for training, while the other approaches are fully
bounding box (SPTS-Bezier), the rectangular bounding box (SPTS- trained with more costly bounding boxes.
Rect), and the indicated point (SPTS-point). Since we only focus on
end-to-end performance here, to minimize the impact of detection 2 https://github.com/aim-uofa/AdelaiDet/blob/master/configs/BAText/README.md
SPTS: Single-Point Text Spotting MM ’22, October 10–14, 2022, Lisboa, Portugal

HANG ASHBURY
NHA NHAT
1500
HAIGHT

OH SUSHI BAR
HEAD
RUSH
TEA HOUSE Japanese Restaurant

WINE TASTING Oh SuShi Bar

Aurora vi ONhat

GASTRONOMY 17C HUNG VUONG NHA TRANG 058 352 57 29

ICE CREAM MIKE DITKA'S

MOBY DICK

FOOT BRIDGE EST. 1930


89
HOTEL
Willow
Lounge & OYSTER FARM
LOBSTER
RESTAURANT BY RESERVATION
PLAINFIELD ROAD & ROUTE 8
25814
WILLOWBROOK, ILL. VACANCY SANDRIDGE RD.

PERKINS COVE

Gt.
Yarmouth

Ipswich

(A12)
Accommodation Office
Student Accounts
South Campus
Harwich

Do know
you anyone
(A120)

starting university

Clacton A133
this year?

BRITISH
MARC
EXIT

TREES TOBLERONE
ROBINSONS COCOA
SALE

Figure 8: Qualitative results on the scene text benchmarks. Images are selected from Total-Text (first row), SCUT-CTW1500 (second row),
ICDAR 2013 (third row), and ICDAR 2015 (fourth row). Zoom in for best view.

3.5.2 Multi-Oriented Dataset. The quantitative results of the IC- SPTS outperforms some recently proposed methods by a large mar-
DAR 2015 [14] dataset are shown in Table 5. A performance gap gin. The reason why our methods can achieve better accuracy on
between the proposed SPTS and state-of-the-art methods can still arbitrary-shaped texts might be: (1) The proposed SPTS discards the
be found, which shows some limitations of our method for tiny task-specific modules (e.g., RoI modules) designed based on prior
texts that are often presented in the ICDAR 2015 dataset. Because knowledge; therefore, the recognition accuracy is decoupled with
the sequence is directly decoded from the feature of the entire im- the detection results, i.e., SPTS can achieve acceptable recognition
age without dedicated RoI operations, the tiny texts are difficult to results even the detection position is shifted. However, the recogni-
handle with our method. tion heads of other methods heavily rely on the detection results,
which is the main reason of their poor end-to-end accuracy. Once
3.5.3 Arbitrarily Shaped Dataset. We further compare our method the text instance cannot be perfectly localized, their recognition
with existing approaches on the benchmarks containing arbitrarily heads fail to work. (2) Although previous models are trained in
shaped texts, including Total-Text [5] and SCUT-CTW1500 [25]. As an end-to-end manner, the interactions between their detection
shown in Table 6, SPTS achieves state-of-the-art performance only and recognition branches are limited. Specifically, the features fed
using extremely low-cost point annotations. Additionally, Table 7 to the recognition module are sampled based on the ground-truth
shows that our method outperforms state-of-the-art approaches by position while training but from detection results at the inference
a large margin on the challenging SCUT-CTW1500 dataset, which stage, leading to feature misalignment, which is far more severe
further identifies the potentiality of our method. on curved text instances. However, by tackling the spotting task in
a sequence modeling manner, the proposed SPTS eliminates such
issues, thus showing more robustness on arbitrarily shaped datasets.
3.5.4 Summary. In summary, the proposed SPTS can achieve state-
The visualization results of SPTS on four testing datasets are shown
of-the-art performance compared with previous text spotters on
in Fig. 8.
several widely used benchmarks. Especially on the two curved
datasets, i.e., Total-Text [5] and SCUT-CTW1500 [25], the proposed
MM ’22, October 10–14, 2022, Lisboa, Portugal Dezhi Peng et al.

Table 8: Comparison between the end-to-end recognition results of the SPTS and NPTS models.

Total-Text SCUT-CTW1500 ICDAR 2013 ICDAR 2015


Method
None Full None Full S W G S W G
SPTS 74.2 82.4 63.6 83.8 93.3 91.7 88.5 77.5 70.2 65.8
NPTS 64.7 71.9 55.4 74.3 90.4 84.9 80.2 69.4 60.3 55.6

CARMINES Turkish
CARMINES Welcome
CARMINES Delight
Cat
Cat
Dog
COMPANY
WALKER PERSON
BREWING PRECIOUS
FIRESTONE IS
FE EVERY
TRESTONE
Figure 10: Qualitative object detection results on the Pascal VOC
2012 validation set under the single-point supervision.
BAR
TICKET!
KNOW
food
UENUC
YOUR
CAFE
First
DUnn

that single-point might be viable to provide extremely low-cost


annotation for general object detection.
giordano
Raffles
SPING
City
ladies
4 LIMITATION
One limitation of the proposed framework is that the training pro-
cedure requires a large number of computing resources. For ex-
Figure 9: Qualitative results of the NPTS model on several scene text ample, 150 epochs for 160k scene text pretraining and 200 epochs
benchmarks. Images are selected from Total-Text (first row), SCUT- for fine-tuning require approximately 63 hours when the model is
CTW1500 (second row), ICDAR 2013 (third row), and ICDAR 2015 distributively trained using 32 NVIDIA V100 GPU cards. Moreover,
(fourth row). Zoom in for best view.
owing to the auto-regressive decoding manner, the inference speed
is only 1.0 fps on ICDAR 2013 using one NVIDIA V100 GPU.
Another limitation of SPTS is that it can not well handle ex-
3.6 Extensions of SPTS tremely tiny text spotting (as indicated in Sec. 3.5.2), because the
feature representation ability of SPTS can not reach a high resolu-
3.6.1 No-Point Text Spotting. The experiments suggest that the
tion for extracting effective tiny text features. This issue deserves
detection and recognition may have been decoupled. Based on the
further study in the future.
results, we further show that SPTS can be converged even without
the supervision of the single point annotations. The No-Point Text
Spotting (NPTS) model is obtained by removing the coordinates of 5 CONCLUSION
the indicated points from the constructed sequence. Fig. 9 shows We propose SPTS, which is, to the best of our knowledge, a pi-
the qualitative results of NPTS, which indicates the model may oneering method that tackles scene text spotting using only the
have learned the ability to implicitly find out the locations of the extremely low-cost single-point annotation. The successful attempt
text merely based on the transcriptions. The comparison between sheds some brand-new insights that challenges the necessity of the
the end-to-end recognition results of the SPTS and NPTS models is traditional box annotations in the field. SPTS is an auto-regressive
presented in Tab. 8. The evaluation metric described in Sec. 3.2 is Transformer-based framework that can simply generate the results
adapted for NPTS, where the distance matrix between the predicted as sequential tokens, which can avoid complex post-processing
and GT points is replaced with an edit distance matrix between the or exclusive sampling stages. Based on such a concise framework,
predicted and GT transcriptions. Despite the obvious gap between extensive experiments demonstrate state-of-the-art performance of
SPTS and NPTS, the preliminary results achieved by NPTS are still SPTS on various datasets. We further show that our method has the
surprising and very encouraging, which is worth studying in the potential to be extended to the no-point text spotting and generic
future. object detection tasks.
3.6.2 Single-Point Object Detection. To demonstrate the generality
of SPTS, we conduct experiments on the Pascal VOC [7] object ACKNOWLEDGEMENT
detection task, where the model is trained with central points and This research is supported in part by NSFC (Grant No.: 61936003),
the corresponding categories. All other settings are identical to GD-NSF (no.2017A030312006, No.2021A1515011870), and the Sci-
the text spotting experiment. Some preliminary qualitative results ence and Technology Foundation of Guangzhou Huangpu Devel-
on the validation set are shown in Fig. 10. The results suggest opment District (Grant 2020GH17).
SPTS: Single-Point Text Spotting MM ’22, October 10–14, 2022, Lisboa, Portugal

REFERENCES [23] Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei
[1] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Wang. 2020. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve
Character region awareness for text detection. In Proc. IEEE Conf. Comp. Vis. Patt. Network. Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2020), 9809–9818.
Recogn. 9365–9374. [24] Yuliang Liu and Lianwen Jin. 2017. Deep matching prior network: Toward
[2] Christian Bartz, Haojin Yang, and Christoph Meinel. 2018. SEE: Towards semi- tighter multi-oriented text detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
supervised end-to-end scene text recognition. In Proc. AAAI Conf. Artificial Intell. 1962–1969.
6674–6681. [25] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. 2019.
[3] Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep textspotter: An end- Curved scene text detection via transverse and longitudinal sequence connection.
to-end trainable scene text localization and recognition framework. In Proc. IEEE Pattern Recogn. 90 (2019), 337–345.
Int. Conf. Comp. Vis. 2204–2212. [26] Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng Chen, Chongyu Liu,
[4] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. 2022. and Hao Chen. 2021. ABCNet v2: Adaptive Bezier-Curve Network for Real-time
Pix2Seq: A language modeling framework for object detection. In Proc. Int. Conf. End-to-end Text Spotting. IEEE Trans. Pattern Anal. Mach. Intell. (2021), 1–1.
Learn. Represent. https://doi.org/10.1109/TPAMI.2021.3107437
[5] Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-Text: A comprehensive [27] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong
dataset for scene text detection and recognition. In Proc. Int. Conf. Doc. Anal. and Yao. 2018. TextSnake: A flexible representation for detecting text of arbitrary
Recognit., Vol. 1. IEEE, 935–942. shapes. In Proc. Eur. Conf. Comp. Vis. 20–36.
[6] Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, [28] Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization.
ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. 2019. ICDAR2019 Proc. Int. Conf. Learn. Representations (2018).
robust reading challenge on arbitrary-shaped text-RRC-ArT. In Proc. Int. Conf. [29] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018.
Doc. Anal. and Recognit. 1571–1576. Mask TextSpotter: An end-to-end trainable neural network for spotting text with
[7] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and arbitrary shapes. In Proc. Eur. Conf. Comp. Vis. 67–83.
Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. Int. [30] Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis
J. Comput. Vis. 88, 2 (2010), 303–338. Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin
[8] Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Liu, et al. 2019. ICDAR 2019 Robust Reading Challenge on Multi-lingual Scene
TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting. Text Detection and Recognition–RRC-MLT-2019. In Proc. Int. Conf. Doc. Anal.
In Proc. IEEE Int. Conf. Comp. Vis. 9076–9085. and Recognit. 1582–1587.
[9] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for [31] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas,
text localisation in natural images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. 2017.
2315–2324. ICDAR 2017 robust reading challenge on multi-lingual scene text detection and
[10] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. script identification-RRC-MLT. In Proc. Int. Conf. Doc. Anal. and Recognit., Vol. 1.
2018. An end-to-end textspotter with explicit alignment and attention. In Proc. IEEE, 1454–1459.
IEEE Conf. Comp. Vis. Patt. Recogn. 5020–5029. [32] Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei
[11] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Wu. 2021. MANGO: A Mask Attention Guided One-Stage Scene Text Spotter. In
Soudry. 2020. Augment your batch: Improving generalization through instance Proc. AAAI Conf. Artificial Intell. 2467–2476.
repetition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. 8129–8138. [33] Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao.
[12] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui 2019. Towards unconstrained end-to-end text spotting. In Proc. IEEE Int. Conf.
Ding. 2017. WordSup: Exploiting word annotations for character based text Comp. Vis. 4704–4714.
detection. In Proc. IEEE Int. Conf. Comp. Vis. 4940–4949. [34] Shangxuan Tian, Shijian Lu, and Chongshou Li. 2017. WeText: Scene text detec-
[13] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. tion under weak supervision. In Proc. IEEE Int. Conf. Comp. Vis. 1492–1500.
Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. [35] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text
116, 1 (2016), 1–20. in natural image with connectionist text proposal network. In Proc. Eur. Conf.
[14] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Comp. Vis. Springer, 56–72.
Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ra- [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
maseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
robust reading. In Proc. Int. Conf. Doc. Anal. and Recognit. IEEE, 1156–1160. you Need. In Proc. Advances in Neural Inf. Process. Syst., Vol. 30.
[15] Dimosthenis Karatzas, S Robles Mestre, Joan Mas, Farshad Nourbakhsh, and [37] Fangfang Wang, Yifeng Chen, Fei Wu, and Xi Li. 2020. TextRay: Contour-based
P Pratim Roy. 2011. ICDAR 2011 robust reading competition-challenge 1: reading Geometric Modeling for Arbitrary-shaped Scene Text Detection. In Proc. ACM
text in born-digital images (web and email). In Proc. Int. Conf. Doc. Anal. and Int. Conf. Multimedia. 111–119.
Recognit. 1485–1490. [38] Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao
[16] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, He, Yongpan Wang, and Wenyu Liu. 2020. All You Need Is Boundary: Toward
Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Arbitrary-Shaped Text Spotting. In Proc. AAAI Conf. Artificial Intell. 12160–12167.
Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust [39] Pengfei Wang, Chengquan Zhang, Fei Qi, Shanshan Liu, Xiaoqiang Zhang,
reading competition. In Proc. Int. Conf. Doc. Anal. and Recognit. IEEE, 1484–1493. Pengyuan Lyu, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2021.
[17] Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards end-to-end text spotting PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Net-
with convolutional recurrent neural networks. In Proc. IEEE Int. Conf. Comp. Vis. work. In Proc. AAAI Conf. Artificial Intell. 2782–2790.
5238–5246. [40] Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang, Yang Zhibo, Tong Lu,
[18] Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu, and Xiang and Chunhua Shen. 2021. PAN++: Towards Efficient and Accurate End-to-End
Bai. 2019. Mask TextSpotter: An end-to-end trainable neural network for spotting Spotting of Arbitrarily-Shaped Text. IEEE Transactions on Pattern Analysis and
text with arbitrary shapes. IEEE Trans. Pattern Anal. Mach. Intell. (2019), 532–548. Machine Intelligence (2021), 1–1. https://doi.org/10.1109/TPAMI.2021.3077555
[19] Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. 2020. Mask [41] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing,
TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting. Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer
In Proc. Eur. Conf. Comp. Vis. 706–722. normalization in the Transformer architecture. In Proc. Int. Conf. Mach. Learn.
[20] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. 10524–10533.
TextBoxes: A fast text detector with a single deep neural network. In Proc. AAAI [42] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts
Conf. Artificial Intell. 4161–4167. of arbitrary orientations in natural images. In Proc. IEEE Conf. Comp. Vis. Patt.
[21] Xing Linjie, Tian Zhi, Huang Weilin, and R. Scott Matthew. 2019. Convolutional Recogn. IEEE, 1083–1090.
Character Networks. In Proc. IEEE Int. Conf. Comp. Vis. 9126–9136. [43] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and
[22] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS: Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proc.
Fast oriented text spotting with a unified network. In Proc. IEEE Conf. Comp. Vis. IEEE Conf. Comp. Vis. Patt. Recogn. 5551–5560.
Patt. Recogn. 5676–5685. [44] Yiqin Zhu, Jianyong Chen, Lingyu Liang, Zhanghui Kuang, Lianwen Jin, and
Wayne Zhang. 2021. Fourier contour embedding for arbitrary-shaped text detec-
tion. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. 3123–3131.
MM ’22, October 10–14, 2022, Lisboa, Portugal Dezhi Peng et al.

Appendix C FURTHER DISCUSSION ON THE


REPRESENTATION OF TEXT INSTANCES
A TINY TEXT SPOTTING In Sec. 3.4.2, the ablation experiments demonstrate that the SPTS-
As discussed in Sec. 3.5.2 and Sec. 4, the proposed SPTS can not Point with fewer parameters for describing the location of text
accurately recognize tiny texts because it directly predicts the se- instances outperforms SPST-Rect and SPTS-Bezier. The longer se-
quence based on the low-resolution high-level features without quence required by SPTS-Rect and SPTS-Bezier may make them
RoI operations. Especially on the ICDAR 2015 dataset, there still difficult to converge. In Tab. 3, the results of SPTS-Rect and SPTS-
is a performance gap (65.8 vs. 74.2) between our method and the Bezier are obtained using the same training schedule as SPTS-Point.
state-of-the-art approaches. Quantitatively, if the texts with area To further explore their potentiality, we compare the SPTS-Bezier
smaller than 3000 (after resizing) are ignored during evaluation, the trained for 2× epochs with SPTS-Point in Tab. 11. Normally, the
F-measure with generic lexicons on ICDAR 2015 will be significantly SPTS-Bezier should perform better than SPTS-Point owing to the
improved from 65.8 to 73.5. Furthermore, current state-of-the-art more detailed annotation. However, it can be seen that the SPTS-
methods on ICDAR 2015 usually adopt larger image size during Bezier with 2× epochs does not significantly outperform the coun-
testing. For example, the short sides of the testing images are re- terpart with 1× epochs and is still inferior to the SPTS-Point with 1×
sized to 1440 pixels while the long sides are shorter than 4000 pixels epochs. The reason may be the dramatically increased difficulty of
in Mask TextSpotterV3 [19]. As shown in Tab. 9, the performance the convergence of the Transformer with longer decoded sequences.
of SPTS on ICDAR 2015 with larger testing size is much better than More training data and epochs may be necessary to address this
that with smaller testing size, indicating the tiny text spotting is an problem.
important issue worth studying in the future.

B ORDER OF TEXT INSTANCES


As described in Sec. 2.1, the text instances are randomly ordered in
the constructed sequence following Pix2Seq [4]. In this section, we
further investigate the impact of the order of text instances. The
performances on Total-Text and SCUT-CTW1500 of different order-
ing strategies are presented in Tab. 10. The “Area” and “Dist2ori”
mean that text instances are sorted by the area and the distance to
the topleft origin in descending order, respectively. The “Topdown” Table 10: Ablation study of different ordering strategies of
indicates that text instances are arranged from top to bottom. It can text instances in the sequence construction.
be seen that the random order adopted in SPTS achieves the best
performance, which may be explained by the improved robustness Total-Text SCUT-CTW1500
Order
None Full None Full
due to the different sequences constructed for the same image at
Area 70.7 79.2 59.0 75.3
different iterations.
Topdown 73.2 81.3 62.7 83.7
Table 9: End-to-end recognition results on ICDAR 2015. “S”, Dist2ori 72.1 81.8 61.1 79.6
“W”, and “G” represent recognition with “Strong”, “Weak”, Random 74.2 82.4 63.6 83.8
and “Generic” lexicon, respectively. Bold indicates the state
of the art and underline indicates the second best. Table 11: Further comparison of different representations of
text instances.
IC15 End-to-End
Method Total-Text SCUT-CTW1500
S W G Variants Epochs 𝑁𝑝
Bounding Box-based methods None Full None Full
FOTS [22] 81.1 75.9 60.8 SPTS-Bezier 1× 60.6 71.6 52.6 73.9 16
Mask TextSpotter [18] 83.0 77.7 73.5 SPTS-Bezier 2× 62.9 74.4 51.1 74.3 16
CharNet [21] 83.1 79.2 69.1 SPTS-Point 1× 74.2 82.4 63.6 83.8 2
TextDragon [8] 82.5 78.3 65.2
Mask TextSpotter v3 [19] 83.3 78.1 74.2
MANGO [32] 81.8 78.9 67.3
ABCNetV2 [26] 82.7 78.5 73.0
PAN++ [40] 82.7 78.2 69.2
Point-based method
SPTS (1000) 77.5 70.2 65.8
SPTS (1440) 79.5 74.1 70.2

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy