SPTS: Single-Point Text Spotting: Dezhi Peng Xinyu Wang Yuliang Liu
SPTS: Single-Point Text Spotting: Dezhi Peng Xinyu Wang Yuliang Liu
Technologies
(a) Rectangle (55s). (b) Quadrilateral (96s). (c) Character (581s). (d) Polygon (172s). (e) Single-Point (11s).
Figure 1: Different annotation styles and their time cost (for all the text instances in the sample image) measured by the LabelMe1 tool. Green
areas are positive samples, while red dashed boxes are noises that may be possibly included. The time cost of single-point annotation is more
than 50 times faster than character-level annotation.
t1 t2 t3 tn <EOS>
Predicted Sequence
... PLANET
Transformer
CNN
Encoder ... HOLLYWOOD
<SOS> t1 t2 tn-1 tn
Encoder Decoder
Figure 3: Overall framework of the proposed SPTS. The visual and contextual features are first extracted by a series of CNN and Transformer
encoders. Then, the features are auto-regressively decoded into a sequence that contains both localization and recognition information, which
is subsequently translated into point coordinates and text transcriptions. Only a point-level annotation is required for training.
and regression-based methods [8, 23, 38]. The former first predicts Recently, Pix2Seq [4] pioneered to cast the generic object detec-
masks to segment text instances, then the features inside text re- tion problem as a language modeling task, based on an intuitive
gions are sampled and grouped for further recognition. For example, assumption that if a deep model knows what and where the tar-
Mask TextSpotterv3 [19] proposes a Segmentation Proposal Net- get is, it can be taught to tell the results by the desired sequence.
work (SPN) instead of the regular RPN to decouple neighboring Thanks to the concise pipeline, labels with different attributes such
text instances accurately, thus significantly improving the perfor- as location coordinates and object categories can be integrated into
mance. In addition, regression-based methods usually parameterize a single sequence, enabling an end-to-end trainable framework
the text instances as a sequence of coordinates and subsequently without task-specific modules (e.g., Region Proposal Networks and
learn to predict them. For instance, ABCNet [23] converts polygons RoI pooling layers), which can thus be adapted to the text spotting
into Bezier curves, significantly improving performance on curved task. Inspired by this, we propose Single-Point Text Spotting (SPTS).
scene texts. Wang et al. [38] first localize the boundary points of Unlike Pix2Seq which is designed for object detection only and still
text instances, then the features rectified by Thin-Plate-Spline are requires the bounding boxes for all instances, our SPTS tackles
fed into the recognition branch, demonstrating promising accuracy both text detection and recognition as a end-to-end sequence pre-
on arbitrary-shaped instances. Moreover, Xing et al. [21] boost the diction task, using the single point location and text annotations.
text spotting performance by utilizing character-level annotations, Compared with existing text spotting approaches, SPTS follows a
where character bounding boxes, as well as type segmentation much more simple and concise pipeline where the input images
maps, are predicted simultaneously, enabling impressive perfor- are translated into a sequence containing location and recognition
mance. Even though different representations are adopted in the results, genuinely enabling the text detection and recognition task
above methods to describe the text instances, they are all actually simultaneously.
derived from one of the rectangular, quadrilateral, or polygonal Specifically, as shown in Fig. 3, each input image is first sequen-
bounding boxes. Such annotations must be carefully labeled by hu- tially encoded by a CNN and a Transformer encoder [36] to extract
man beings, thus are quite expensive, limiting the scale of training visual and contextual features. Then, the captured features are de-
datasets. coded by a Transformer decoder [36], where tokens are predicted in
In this paper, we propose Single-Point Text Spotting (SPTS), an auto-regressive manner. Unlike previous algorithms, we further
which is, to the best of our knowledge, the first scene text spotter simplify the bounding box to the center of the text instance, the
that does not rely on bounding box annotations at all. Specifically, corner point located at the top left of the first character, or the
each text instance is represented by a single-point (see Fig. 1e) with random point within the text instance, as described in Fig. 7. Bene-
a meager cost. The fact that this point does not need to be accurately fiting from such a simple yet effective representation, the modules
marked further demonstrates the possibility of learning in a weakly carefully designed based on prior knowledge, such as grouping
supervised manner, considerably lowering down the labeling cost. strategies utilized in segmentation-based methods and feature sam-
pling blocks equipped in box-based text spotters, can be eschewed.
2 METHODOLOGY Therefore, the recognition accuracy will not be limited by poor
detection results, significantly improving the model robustness.
Most of the existing text spotting algorithms treat the problem as
two sub-tasks, i.e., text detection and recognition, albeit the en-
tire network might be end-to-end optimized. Customized modules 2.1 Sequence Construction
such as BezierAlign [23], RoISlide [8], and RoIMasking [19] are The fact that a sequence can carry information with multiple at-
required to bridge the detection and recognition modules, where tributes naturally enables the text spotting task, where text in-
backbone features are cropped and shared between detection and stances are simultaneously localized and recognized. To express
recognition heads. Under such types of design, the recognition and the target text instances by a sequence, it is required to convert
detection modules are highly coupled. For example, the features fed the continuous descriptions (e.g., bounding boxes) to a discretized
to the recognition head are usually cropped from the ground-truth space. To this end, as shown in Fig. 4, we follow Pix2Seq [4] to
bounding box at the training stage since detection results are not build the target sequence; what distinguishes our methods is that
good enough in the first iterations; thus, the recognition result is we further simplify the bounding box to a single-point and use
susceptible to interference from the detected bounding box during the variable-length transcription instead of the single-token object
the test phase. category.
MM ’22, October 10–14, 2022, Lisboa, Portugal Dezhi Peng et al.
Coordinates Transcription
Original Annotations
Output Seq.
<EOS>
x1,y1 HAIGHT
x2,y2 ASHBURY
HAIGHT x y V I N T A G E ∅ ∅ x y ... ∅ ∅ x ... ...∅ >
x3,y3 VINTAGE
Input Seq.
... ∅ y ...
Discretized Instances
< x y V I N T A G E ∅ ∅ x ∅ x ∅
VINTAGE ⌊x1/w×nbins⌋, ⌊y1/h×nbins⌋ H,A,I,G,H,T,<PAD>,..
Target Sequence
Figure 5: Input and output sequences of the decoder.
Coord. Transcription <PAD>
Table 2: Ablation study of the position of the indicated point. Table 6: End-to-end recognition results on Total-Text. “None" rep-
resents lexicon-free. “Full" represents that we use all the words ap-
E2E Total-Text E2E CTW1500 peared in the test set.
Position
None Full None Full
Central 74.2 82.4 63.6 83.8
Total-Text End-to-End
Top-left 71.6 79.7 61.4 82.0 Method
None Full
Random 73.2 80.8 62.3 81.1
Bounding Box-based methods
CharNet [21] 66.6 -
Table 3: Comparison with different shapes of bounding boxes. 𝑁𝑝 is ABCNet [23] 64.2 75.7
PGNet [39] 63.1 -
the number of parameters required to describe the location of text Mask TextSpotter [29] 65.3 77.4
instances by different representations. Qin et al. [33] 67.8 -
Mask TextSpotter v3 [19] 71.2 78.4
MANGO [32] 72.9 83.6
Total-Text SCUT-CTW1500
Variants 𝑁𝑝 PAN++ [40] 68.6 78.6
None Full None Full
ABCNet v2 [26] 70.4 78.1
SPTS-Bezier 60.6 71.6 52.6 73.9 16
Point-based method
SPTS-Rect 71.6 80.4 62.2 82.3 4
SPTS (Ours) 74.2 82.4
SPTS-Point 74.2 82.4 63.6 83.8 2
Table 4: End-to-end recognition results on ICDAR 2013. “S”, “W”, Table 7: End-to-end recognition results on SCUT-CTW1500. “None"
and “G” represent recognition with “Strong”, “Weak”, and “Generic” represents lexicon-free. “Full" represents that we use all the words
lexicon, respectively. appeared in the test set. ABCNet* means using the github check-
point2 .
IC13 End-to-End
Method
S W G SCUT-CTW1500 End-to-End
Method
Bounding Box-based methods None Full
Jaderberg et al. [13] 86.4 - - Bounding Box-based methods
Textboxes [20] 91.6 89.7 83.9 TextDragon [8] 39.7 72.4
Deep Text Spotter [3] 89.0 86.0 77.0 ABCNet [23] 45.2 74.1
Li et al. [17] 91.1 89.8 84.6 ABCNet* [23] 53.2 76.0
MaskTextSpotter [29] 92.2 91.1 86.5 MANGO [32] 58.9 78.7
Point-based method ABCNet v2 [26] 57.5 77.2
SPTS (Ours) 93.3 91.7 88.5 Point-based method
SPTS (Ours) 63.6 83.8
HANG ASHBURY
NHA NHAT
1500
HAIGHT
OH SUSHI BAR
HEAD
RUSH
TEA HOUSE Japanese Restaurant
Aurora vi ONhat
MOBY DICK
PERKINS COVE
Gt.
Yarmouth
Ipswich
(A12)
Accommodation Office
Student Accounts
South Campus
Harwich
Do know
you anyone
(A120)
starting university
Clacton A133
this year?
BRITISH
MARC
EXIT
TREES TOBLERONE
ROBINSONS COCOA
SALE
Figure 8: Qualitative results on the scene text benchmarks. Images are selected from Total-Text (first row), SCUT-CTW1500 (second row),
ICDAR 2013 (third row), and ICDAR 2015 (fourth row). Zoom in for best view.
3.5.2 Multi-Oriented Dataset. The quantitative results of the IC- SPTS outperforms some recently proposed methods by a large mar-
DAR 2015 [14] dataset are shown in Table 5. A performance gap gin. The reason why our methods can achieve better accuracy on
between the proposed SPTS and state-of-the-art methods can still arbitrary-shaped texts might be: (1) The proposed SPTS discards the
be found, which shows some limitations of our method for tiny task-specific modules (e.g., RoI modules) designed based on prior
texts that are often presented in the ICDAR 2015 dataset. Because knowledge; therefore, the recognition accuracy is decoupled with
the sequence is directly decoded from the feature of the entire im- the detection results, i.e., SPTS can achieve acceptable recognition
age without dedicated RoI operations, the tiny texts are difficult to results even the detection position is shifted. However, the recogni-
handle with our method. tion heads of other methods heavily rely on the detection results,
which is the main reason of their poor end-to-end accuracy. Once
3.5.3 Arbitrarily Shaped Dataset. We further compare our method the text instance cannot be perfectly localized, their recognition
with existing approaches on the benchmarks containing arbitrarily heads fail to work. (2) Although previous models are trained in
shaped texts, including Total-Text [5] and SCUT-CTW1500 [25]. As an end-to-end manner, the interactions between their detection
shown in Table 6, SPTS achieves state-of-the-art performance only and recognition branches are limited. Specifically, the features fed
using extremely low-cost point annotations. Additionally, Table 7 to the recognition module are sampled based on the ground-truth
shows that our method outperforms state-of-the-art approaches by position while training but from detection results at the inference
a large margin on the challenging SCUT-CTW1500 dataset, which stage, leading to feature misalignment, which is far more severe
further identifies the potentiality of our method. on curved text instances. However, by tackling the spotting task in
a sequence modeling manner, the proposed SPTS eliminates such
issues, thus showing more robustness on arbitrarily shaped datasets.
3.5.4 Summary. In summary, the proposed SPTS can achieve state-
The visualization results of SPTS on four testing datasets are shown
of-the-art performance compared with previous text spotters on
in Fig. 8.
several widely used benchmarks. Especially on the two curved
datasets, i.e., Total-Text [5] and SCUT-CTW1500 [25], the proposed
MM ’22, October 10–14, 2022, Lisboa, Portugal Dezhi Peng et al.
Table 8: Comparison between the end-to-end recognition results of the SPTS and NPTS models.
CARMINES Turkish
CARMINES Welcome
CARMINES Delight
Cat
Cat
Dog
COMPANY
WALKER PERSON
BREWING PRECIOUS
FIRESTONE IS
FE EVERY
TRESTONE
Figure 10: Qualitative object detection results on the Pascal VOC
2012 validation set under the single-point supervision.
BAR
TICKET!
KNOW
food
UENUC
YOUR
CAFE
First
DUnn
REFERENCES [23] Yuliang Liu, Hao Chen, Chunhua Shen, Tong He, Lianwen Jin, and Liangwei
[1] Youngmin Baek, Bado Lee, Dongyoon Han, Sangdoo Yun, and Hwalsuk Lee. 2019. Wang. 2020. ABCNet: Real-time Scene Text Spotting with Adaptive Bezier-Curve
Character region awareness for text detection. In Proc. IEEE Conf. Comp. Vis. Patt. Network. Proc. IEEE Conf. Comp. Vis. Patt. Recogn. (2020), 9809–9818.
Recogn. 9365–9374. [24] Yuliang Liu and Lianwen Jin. 2017. Deep matching prior network: Toward
[2] Christian Bartz, Haojin Yang, and Christoph Meinel. 2018. SEE: Towards semi- tighter multi-oriented text detection. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn.
supervised end-to-end scene text recognition. In Proc. AAAI Conf. Artificial Intell. 1962–1969.
6674–6681. [25] Yuliang Liu, Lianwen Jin, Shuaitao Zhang, Canjie Luo, and Sheng Zhang. 2019.
[3] Michal Busta, Lukas Neumann, and Jiri Matas. 2017. Deep textspotter: An end- Curved scene text detection via transverse and longitudinal sequence connection.
to-end trainable scene text localization and recognition framework. In Proc. IEEE Pattern Recogn. 90 (2019), 337–345.
Int. Conf. Comp. Vis. 2204–2212. [26] Yuliang Liu, Chunhua Shen, Lianwen Jin, Tong He, Peng Chen, Chongyu Liu,
[4] Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. 2022. and Hao Chen. 2021. ABCNet v2: Adaptive Bezier-Curve Network for Real-time
Pix2Seq: A language modeling framework for object detection. In Proc. Int. Conf. End-to-end Text Spotting. IEEE Trans. Pattern Anal. Mach. Intell. (2021), 1–1.
Learn. Represent. https://doi.org/10.1109/TPAMI.2021.3107437
[5] Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-Text: A comprehensive [27] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong
dataset for scene text detection and recognition. In Proc. Int. Conf. Doc. Anal. and Yao. 2018. TextSnake: A flexible representation for detecting text of arbitrary
Recognit., Vol. 1. IEEE, 935–942. shapes. In Proc. Eur. Conf. Comp. Vis. 20–36.
[6] Chee Kheng Chng, Yuliang Liu, Yipeng Sun, Chun Chet Ng, Canjie Luo, Zihan Ni, [28] Ilya Loshchilov and Frank Hutter. 2018. Decoupled weight decay regularization.
ChuanMing Fang, Shuaitao Zhang, Junyu Han, Errui Ding, et al. 2019. ICDAR2019 Proc. Int. Conf. Learn. Representations (2018).
robust reading challenge on arbitrary-shaped text-RRC-ArT. In Proc. Int. Conf. [29] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018.
Doc. Anal. and Recognit. 1571–1576. Mask TextSpotter: An end-to-end trainable neural network for spotting text with
[7] Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and arbitrary shapes. In Proc. Eur. Conf. Comp. Vis. 67–83.
Andrew Zisserman. 2010. The pascal visual object classes (VOC) challenge. Int. [30] Nibal Nayef, Yash Patel, Michal Busta, Pinaki Nath Chowdhury, Dimosthenis
J. Comput. Vis. 88, 2 (2010), 303–338. Karatzas, Wafa Khlif, Jiri Matas, Umapada Pal, Jean-Christophe Burie, Cheng-lin
[8] Wei Feng, Wenhao He, Fei Yin, Xu-Yao Zhang, and Cheng-Lin Liu. 2019. Liu, et al. 2019. ICDAR 2019 Robust Reading Challenge on Multi-lingual Scene
TextDragon: An End-to-End Framework for Arbitrary Shaped Text Spotting. Text Detection and Recognition–RRC-MLT-2019. In Proc. Int. Conf. Doc. Anal.
In Proc. IEEE Int. Conf. Comp. Vis. 9076–9085. and Recognit. 1582–1587.
[9] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data for [31] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas,
text localisation in natural images. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. 2017.
2315–2324. ICDAR 2017 robust reading challenge on multi-lingual scene text detection and
[10] Tong He, Zhi Tian, Weilin Huang, Chunhua Shen, Yu Qiao, and Changming Sun. script identification-RRC-MLT. In Proc. Int. Conf. Doc. Anal. and Recognit., Vol. 1.
2018. An end-to-end textspotter with explicit alignment and attention. In Proc. IEEE, 1454–1459.
IEEE Conf. Comp. Vis. Patt. Recogn. 5020–5029. [32] Liang Qiao, Ying Chen, Zhanzhan Cheng, Yunlu Xu, Yi Niu, Shiliang Pu, and Fei
[11] Elad Hoffer, Tal Ben-Nun, Itay Hubara, Niv Giladi, Torsten Hoefler, and Daniel Wu. 2021. MANGO: A Mask Attention Guided One-Stage Scene Text Spotter. In
Soudry. 2020. Augment your batch: Improving generalization through instance Proc. AAAI Conf. Artificial Intell. 2467–2476.
repetition. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. 8129–8138. [33] Siyang Qin, Alessandro Bissacco, Michalis Raptis, Yasuhisa Fujii, and Ying Xiao.
[12] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui 2019. Towards unconstrained end-to-end text spotting. In Proc. IEEE Int. Conf.
Ding. 2017. WordSup: Exploiting word annotations for character based text Comp. Vis. 4704–4714.
detection. In Proc. IEEE Int. Conf. Comp. Vis. 4940–4949. [34] Shangxuan Tian, Shijian Lu, and Chongshou Li. 2017. WeText: Scene text detec-
[13] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2016. tion under weak supervision. In Proc. IEEE Int. Conf. Comp. Vis. 1492–1500.
Reading text in the wild with convolutional neural networks. Int. J. Comput. Vis. [35] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text
116, 1 (2016), 1–20. in natural image with connectionist text proposal network. In Proc. Eur. Conf.
[14] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, Comp. Vis. Springer, 56–72.
Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ra- [36] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
maseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. 2017. Attention is All
robust reading. In Proc. Int. Conf. Doc. Anal. and Recognit. IEEE, 1156–1160. you Need. In Proc. Advances in Neural Inf. Process. Syst., Vol. 30.
[15] Dimosthenis Karatzas, S Robles Mestre, Joan Mas, Farshad Nourbakhsh, and [37] Fangfang Wang, Yifeng Chen, Fei Wu, and Xi Li. 2020. TextRay: Contour-based
P Pratim Roy. 2011. ICDAR 2011 robust reading competition-challenge 1: reading Geometric Modeling for Arbitrary-shaped Scene Text Detection. In Proc. ACM
text in born-digital images (web and email). In Proc. Int. Conf. Doc. Anal. and Int. Conf. Multimedia. 111–119.
Recognit. 1485–1490. [38] Hao Wang, Pu Lu, Hui Zhang, Mingkun Yang, Xiang Bai, Yongchao Xu, Mengchao
[16] Dimosthenis Karatzas, Faisal Shafait, Seiichi Uchida, Masakazu Iwamura, He, Yongpan Wang, and Wenyu Liu. 2020. All You Need Is Boundary: Toward
Lluis Gomez i Bigorda, Sergi Robles Mestre, Joan Mas, David Fernandez Mota, Arbitrary-Shaped Text Spotting. In Proc. AAAI Conf. Artificial Intell. 12160–12167.
Jon Almazan Almazan, and Lluis Pere De Las Heras. 2013. ICDAR 2013 robust [39] Pengfei Wang, Chengquan Zhang, Fei Qi, Shanshan Liu, Xiaoqiang Zhang,
reading competition. In Proc. Int. Conf. Doc. Anal. and Recognit. IEEE, 1484–1493. Pengyuan Lyu, Junyu Han, Jingtuo Liu, Errui Ding, and Guangming Shi. 2021.
[17] Hui Li, Peng Wang, and Chunhua Shen. 2017. Towards end-to-end text spotting PGNet: Real-time Arbitrarily-Shaped Text Spotting with Point Gathering Net-
with convolutional recurrent neural networks. In Proc. IEEE Int. Conf. Comp. Vis. work. In Proc. AAAI Conf. Artificial Intell. 2782–2790.
5238–5246. [40] Wenhai Wang, Enze Xie, Xiang Li, Xuebo Liu, Ding Liang, Yang Zhibo, Tong Lu,
[18] Minghui Liao, Pengyuan Lyu, Minghang He, Cong Yao, Wenhao Wu, and Xiang and Chunhua Shen. 2021. PAN++: Towards Efficient and Accurate End-to-End
Bai. 2019. Mask TextSpotter: An end-to-end trainable neural network for spotting Spotting of Arbitrarily-Shaped Text. IEEE Transactions on Pattern Analysis and
text with arbitrary shapes. IEEE Trans. Pattern Anal. Mach. Intell. (2019), 532–548. Machine Intelligence (2021), 1–1. https://doi.org/10.1109/TPAMI.2021.3077555
[19] Minghui Liao, Guan Pang, Jing Huang, Tal Hassner, and Xiang Bai. 2020. Mask [41] Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing,
TextSpotter v3: Segmentation Proposal Network for Robust Scene Text Spotting. Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On layer
In Proc. Eur. Conf. Comp. Vis. 706–722. normalization in the Transformer architecture. In Proc. Int. Conf. Mach. Learn.
[20] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. 10524–10533.
TextBoxes: A fast text detector with a single deep neural network. In Proc. AAAI [42] Cong Yao, Xiang Bai, Wenyu Liu, Yi Ma, and Zhuowen Tu. 2012. Detecting texts
Conf. Artificial Intell. 4161–4167. of arbitrary orientations in natural images. In Proc. IEEE Conf. Comp. Vis. Patt.
[21] Xing Linjie, Tian Zhi, Huang Weilin, and R. Scott Matthew. 2019. Convolutional Recogn. IEEE, 1083–1090.
Character Networks. In Proc. IEEE Int. Conf. Comp. Vis. 9126–9136. [43] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and
[22] Xuebo Liu, Ding Liang, Shi Yan, Dagui Chen, Yu Qiao, and Junjie Yan. 2018. FOTS: Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In Proc.
Fast oriented text spotting with a unified network. In Proc. IEEE Conf. Comp. Vis. IEEE Conf. Comp. Vis. Patt. Recogn. 5551–5560.
Patt. Recogn. 5676–5685. [44] Yiqin Zhu, Jianyong Chen, Lingyu Liang, Zhanghui Kuang, Lianwen Jin, and
Wayne Zhang. 2021. Fourier contour embedding for arbitrary-shaped text detec-
tion. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn. 3123–3131.
MM ’22, October 10–14, 2022, Lisboa, Portugal Dezhi Peng et al.