Sast
Sast
Multi-task
TCL_map
Text
𝑋 ′′
TVO_map Instance
Segmentation
Feature Extractor
TCO_map
X
CABs* Shape Representation
TBO_map
* Context Attention Blocks
Figure 3: The pipeline of proposed method: 1) Extract feature from input image, and learn TCL, TBO, TCO, TVO maps as a
multi-task problem; 2) Achieve instance segmentation by Text Instance Segmentation Module, and the mechanism of point-
to-quad assignment is illustrated in c; 3) Restore polygonal representation of text instances of arbitrary shapes.
Conv 1x1 Conv 1x1, R T, Conv 1x1, R Multiplication Concatenate (c) (d)
X 𝑋′ 𝑋 ′′
Figure 5: Label Generation: (a) Text center region of a curved
text is annotated in red; (b) The generation of TBO map; The
CAB CAB four vertices (red stars in c) and center point(red star in d) of
bounding box, to which the TVO and TCO map refer.
Method Recall Precision Hmean T (ms) Method Recall Precision Hmean FPS
TCL + CC + expand 55.51 61.55 58.37 – CTPN [32] 53.80 60.40 56.90 –
TCL + CC + TBO 74.65 83.13 78.66 5.84 EAST [47] 49.10 78.70 60.40 –
TCL + TVO + TCO + TBO 76.69 83.89 80.12 6.26 DMPNet [21] 56.00 69.90 62.20 –
CTD [42] 65.20 74.30 69.50 15.20
Table 2: Ablation study for the trade-off between speed and CTD + TLOC [42] 69.80 74.30 73.40 13.30
accuracy on SCUT-CTW1500. SLPR [48] 70.10 80.10 74.80 –
TextSnake [23] 85.30 67.90 75.60 12.07
PSENet-4s [35] 78.13 85.49 79.29 8.40
Method Recall Precision Hmean FPS PSENet-2s [35] 79.30 81.95 80.60 –
1s 76.34 86.23 80.98 10.54 PSENet-1s [35] 79.89 82.50 81.17 3.90
2s 70.86 86.30 77.82 21.68 TextField [38] 79.80 83.00 81.40 –
4s 76.69 83.86 80.12 30.21 SAST 77.05 85.31 80.97 27.63
8s 71.54 80.67 75.83 38.05 SAST MS 81.71 81.19 81.45 –
Table 3: Ablation study for the effectiveness of CABs on Table 5: Evaluation on Total-Text for detecting text lines of
SCUT-CTW1500. arbitrary shapes.
Table 6: Evaluation on ICDAR 2015 for detecting oriented method, we run testing on SCUT-CTW1500 with a workstation
text. equipped with NVIDIA TITAN Xp. The test image is resized to
512 × 512 and the batch size is set to 1 on a single GPU. It takes
Method Recall Precision Hmean 29.58 ms and 6.61 ms in the process of network inference and post-
processing respectively, which is written in Python code 1 and can
DMPNet [21] 68.22 73.23 70.64
be further optimized. It runs at 27.63 FPS with a Hmean of 80.97%,
SegLink [30] 76.50 74.74 75.61
surpassing most of the existing arbitrarily-shaped text detectors in
SSTD [9] 73.86 80.23 76.91
both accuracy and efficiency2 , as depicted in Tab. 4.
WordSup [11] 77.03 79.33 78.16
RRPN [26] 77.13 83.52 80.20
EAST [47] 78.33 83.27 80.72
5 CONCLUSION AND FUTURE WORK
He et al. [10] 80.00 82.00 81.00 In this paper, we propose an efficient single-shot arbitrarily-shaped
TextField [38] 80.50 84.30 82.40 text detector together with Context Attention Blocks and a mecha-
TextSnake [23] 80.40 84.90 82.60 nism of point-to-quad assignment, which integrates both high-level
PixelLink [2] 82.00 85.50 83.70 object knowledge and low-level pixel information to obtain text
RRD [18] 80.00 88.00 83.80 instances from a context-enhanced segmentation. Several experi-
Lyu et al. [25] 79.70 89.50 84.30 ments demonstrate that the proposed SAST is effective in detecting
PSENet-4s [35] 83.87 87.98 85.88 arbitrarily-shaped text, and is also robust in generalizing to mul-
IncepText [39] 84.30 89.40 86.80 tilingual scene text datasets. Qualitative results show that SAST
PSENet-1s [35] 85.51 88.71 87.08 helps to alleviate some common challenges in segmentation-based
PSENet-2s [35] 85.22 89.30 87.21 text detector, such as the problem of fragments and the separation
of adjacent text instances. Moreover, with a commonly used GPU,
SAST 87.09 86.72 86.91 SAST runs fast and may be sufficient for some real-time applica-
SAST MS 87.34 87.55 87.44 tions, e.g., augmented reality translation. However, it is difficult for
SAST to detect some extreme cases, which mainly are very small
Table 7: Evaluation on ICDAR2017-MLT for the generaliza-
text regions. In the future, we are interested in improving the ability
tion ability of SAST on multilingual scene text detection.
of small text detection and developing an end-to-end text reading
system for text of arbitrary shapes.
Method Recall Precision Hmean
Lyu et al. [25] 56.60 83.80 66.80 ACKNOWLEDGMENTS
AF-RPN [46] 66.00 75.00 70.00 This work is supported in part by the National Natural Science
PSENet-4s [35] 67.56 75.98 71.52 Foundation of China (NSFC) under Grant 61572387, Grant 61632019,
PSENet-2s [35] 68.35 76.97 72.40 Grant 61836008, and Grant 61672404, and the Foundation for Inno-
Lyu et al. MS [25] 70.60 74.30 72.40 vative Research Groups of the NSFC under Grant 61621005.
PSENet-1s [35] 68.40 77.01 72.45
SAST 67.56 70.00 68.76 1 TheNMS part is written in C++.
2 The
SAST MS 66.53 79.35 72.37 speed of different methods is depicted for reference only, which might be evalu-
ated with different hardware environments.
REFERENCES 4th Int. Conf. 3D Vision (3DV). IEEE, 565–571.
[1] Chee Kheng Ch’ng and Chee Seng Chan. 2017. Total-Text: A comprehensive [28] Nibal Nayef, Fei Yin, Imen Bizid, Hyunsoo Choi, Yuan Feng, Dimosthenis Karatzas,
dataset for scene text detection and recognition. In Int. Conf. Doc. Anal. Recognit. Zhenbo Luo, Umapada Pal, Christophe Rigaud, Joseph Chazalon, et al. 2017.
(ICDAR), Vol. 1. IEEE, 935–942. ICDAR2017 robust reading challenge on multi-lingual scene text detection and
[2] Dan Deng, Haifeng Liu, Xuelong Li, and Deng Cai. 2018. PixelLink: Detecting script identification-rrc-mlt. In Int. Conf. Doc. Anal. Recognit. (ICDAR), Vol. 1.
scene text via instance segmentation. In Proc. AAAI Conf. Artif. Intell. (AAAI). IEEE, 1454–1459.
[3] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- [29] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN:
geNet: A large-scale hierarchical image database. In IEEE Conf. Comp. Vis. Patt. Towards real-time object detection with region proposal networks. In Adv. Neural
Recognit. (CVPR). IEEE, 248–255. Inf. Process. Syst. (NIPS). 91–99.
[4] Alireza Fathi, Zbigniew Wojna, Vivek Rathod, Peng Wang, Hyun Oh Song, Sergio [30] Baoguang Shi, Xiang Bai, and Serge Belongie. 2017. Detecting oriented text
Guadarrama, and Kevin P Murphy. 2017. Semantic instance segmentation via in natural images by linking segments. In IEEE Conf. Comp. Vis. Patt. Recognit.
deep metric learning. arXiv:1703.10277 (CVPR). 2550–2558.
[5] R. Girshick. 2015. Fast R-CNN. In IEEE Int. Conf. Comp. Vis. (ICCV). 1440–1448. [31] Bharat Singh and Larry S Davis. 2018. An analysis of scale invariance in object
[6] Ankush Gupta, Andrea Vedaldi, and Andrew Zisserman. 2016. Synthetic data detection snip. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 3578–3587.
for text localisation in natural images. In IEEE Conf. Comp. Vis. Patt. Recognit. [32] Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao. 2016. Detecting text in
(CVPR). 2315–2324. natural image with connectionist text proposal network. In Eur. Conf. Comp. Vis.
[7] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask (ECCV). Springer, 56–72.
R-CNN. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 2961–2969. [33] Jonas Uhrig, Eike Rehder, Björn Fröhlich, Uwe Franke, and Thomas Brox. 2018.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual Box2Pix: Single-shot instance segmentation by assigning pixels to object boxes.
learning for image recognition. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). In IEEE Intell. Veh. Symp. (IV). IEEE, 292–299.
770–778. [34] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,
[9] Pan He, Weilin Huang, Tong He, Qile Zhu, Yu Qiao, and Xiaolin Li. 2017. Single Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all
shot text detector with regional attention. In IEEE Conf. Comp. Vis. Patt. Recognit. you need. In Adv. Neural Inf. Process. Syst. (NIPS). 5998–6008.
(CVPR). 3047–3055. [35] Wenhai Wang, Enze Xie, Xiang Li, Wenbo Hou, Tong Lu, Gang Yu, and Shuai
[10] Wenhao He, Xu-Yao Zhang, Fei Yin, and Cheng-Lin Liu. 2017. Deep direct Shao. 2019. Shape Robust Text Detection With Progressive Scale Expansion
regression for multi-oriented scene text detection. In IEEE Conf. Comp. Vis. Patt. Network. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 9336–9345.
Recognit. (CVPR). 745–753. [36] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. 2018. Non-local
[11] Han Hu, Chengquan Zhang, Yuxuan Luo, Yuzhuo Wang, Junyu Han, and Errui neural networks. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 7794–7803.
Ding. 2017. WordSup: Exploiting Word Annotations for Character Based Text [37] Yue Wu and Prem Natarajan. 2017. Self-organized text detection with minimal
Detection. In IEEE Int. Conf. Comp. Vis. (ICCV). 4950–4959. post-processing via border learning. In IEEE Conf. Comp. Vis. Patt. Recognit.
[12] Lichao Huang, Yi Yang, Yafeng Deng, and Yinan Yu. 2015. DenseBox: Unifying (CVPR). 5000–5009.
landmark localization with end to end object detection. arXiv:1509.04874 [38] Yongchao Xu, Yukang Wang, Wei Zhou, Yongpan Wang, Zhibo Yang, and Xiang
[13] Zilong Huang, Xinggang Wang, Lichao Huang, Chang Huang, Yunchao Wei, Bai. 2019. TextField: Learning A Deep Direction Field for Irregular Scene Text
and Wenyu Liu. 2018. CCNet: Criss-cross attention for semantic segmentation. Detection. IEEE Trans. Image Process. (2019). arXiv:1812.01393
arXiv:1811.11721 [39] Qiangpeng Yang, Mengli Cheng, Wenmeng Zhou, Yan Chen, Minghui Qiu, and
[14] Zhida Huang, Zhuoyao Zhong, Lei Sun, and Qiang Huo. 2019. Mask R-CNN with Wei Lin. 2018. IncepText: a new inception-text module with deformable PSROI
pyramid attention network for scene text detection. In Winter Conf. Appl. Comp. pooling for multi-oriented scene text detection. In Int. Joint Conf. Artif. Intell.
Vis. (WACV). IEEE, 764–772. (IJCAI). IJCAI, 1071–1077.
[15] Dimosthenis Karatzas, Lluis Gomez-Bigorda, Anguelos Nicolaou, Suman Ghosh, [40] Qixiang Ye and David Doermann. 2015. Text detection and recognition in imagery:
Andrew Bagdanov, Masakazu Iwamura, Jiri Matas, Lukas Neumann, Vijay Ra- A survey. IEEE Trans. Pattern Anal. Mach. Intell. 37, 7 (2015), 1480–1500.
maseshan Chandrasekhar, Shijian Lu, et al. 2015. ICDAR 2015 competition on [41] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, Gang Yu, and Nong
robust reading. In Int. Conf. Doc. Anal. Recognit. (ICDAR). IEEE, 1156–1160. Sang. 2018. Learning a discriminative feature network for semantic segmentation.
[16] Alexander Kirillov, Evgeny Levinkov, Bjoern Andres, Bogdan Savchynskyy, and In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). IEEE, 1857–1866.
Carsten Rother. 2017. InstanceCut: from edges to instances with multicut. In [42] Liu Yuliang, Jin Lianwen, Zhang Shuaitao, and Zhang Sheng. 2017. Detecting
IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). IEEE, 7322–7331. curve text in the wild: New dataset and new solution. arXiv:1712.02170
[17] Minghui Liao, Baoguang Shi, Xiang Bai, Xinggang Wang, and Wenyu Liu. 2017. [43] Chengquan Zhang, Borong Liang, Zuming Huang, Mengyi En, Junyu Han, Errui
TextBoxes: A fast text detector with a single deep neural network. In Proc. AAAI Ding, and Xinghao Ding. 2019. Look More Than Once: An Accurate Detector for
Conf. Artif. Intell. (AAAI). 4161–4167. Text of Arbitrary Shapes. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR).
[18] Minghui Liao, Zhen Zhu, Baoguang Shi, Gui-song Xia, and Xiang Bai. 2018. [44] Zheng Zhang, Chengquan Zhang, Wei Shen, Cong Yao, Wenyu Liu, and Xiang
Rotation-sensitive regression for oriented scene text detection. In IEEE Conf. Bai. 2016. Multi-oriented text detection with fully convolutional networks. In
Comp. Vis. Patt. Recognit. (CVPR). 5909–5918. IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 4159–4167.
[19] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and [45] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet,
Serge Belongie. 2017. Feature pyramid networks for object detection. In IEEE Zhizhong Su, Dalong Du, Chang Huang, and Philip HS Torr. 2015. Conditional
Conf. Comp. Vis. Patt. Recognit. (CVPR). 2117–2125. random fields as recurrent neural networks. In IEEE Conf. Comp. Vis. Patt. Recognit.
[20] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, (CVPR). 1529–1537.
Cheng-Yang Fu, and Alexander C Berg. 2016. SSD: Single shot multibox detector. [46] Zhuoyao Zhong, Lei Sun, and Qiang Huo. 2018. An Anchor-Free Region Proposal
In Eur. Conf. Comp. Vis. (ECCV). Springer, 21–37. Network for Faster R-CNN based Text Detection Approaches. arXiv:1804.09003
[21] Yuliang Liu and Lianwen Jin. 2017. Deep matching prior network: Toward tighter [47] Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and
multi-oriented text detection. In IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). Jiajun Liang. 2017. EAST: An efficient and accurate scene text detector. In IEEE
1962–1969. Conf. Comp. Vis. Patt. Recognit. (CVPR). 5551–5560.
[22] Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu, Houqiang Li, and Yan [48] Yixing Zhu and Jun Du. 2018. Sliding line point regression for shape robust scene
Lu. 2018. Affinity derivation and graph merge for instance segmentation. In Eur. text detection. In Int. Conf. Pattern Recognit. (ICPR). 3735–3740.
Conf. Comp. Vis. (ECCV). 686–703. [49] Yingying Zhu, Cong Yao, and Xiang Bai. 2016. Scene text detection and recog-
[23] Shangbang Long, Jiaqiang Ruan, Wenjie Zhang, Xin He, Wenhao Wu, and Cong nition: Recent advances and future trends. Frontiers of Computer Science 10, 1
Yao. 2018. TextSnake: A flexible representation for detecting text of arbitrary (2016), 19–36.
shapes. In Eur. Conf. Comp. Vis. (ECCV). 20–36.
[24] Pengyuan Lyu, Minghui Liao, Cong Yao, Wenhao Wu, and Xiang Bai. 2018.
Mask TextSpotter: An end-to-end trainable neural network for spotting text with
arbitrary shapes. In Eur. Conf. Comp. Vis. (ECCV). 67–83.
[25] Pengyuan Lyu, Cong Yao, Wenhao Wu, Shuicheng Yan, and Xiang Bai. 2018. Multi-
oriented scene text detection via corner localization and region segmentation. In
IEEE Conf. Comp. Vis. Patt. Recognit. (CVPR). 7553–7563.
[26] Jianqi Ma, Weiyuan Shao, Hao Ye, Li Wang, Hong Wang, Yingbin Zheng, and Xi-
angyang Xue. 2018. Arbitrary-oriented scene text detection via rotation proposals.
IEEE Trans. Multimedia 20, 11 (2018), 3111–3122.
[27] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. 2016. V-Net: Fully
convolutional neural networks for volumetric medical image segmentation. In