Abstract
With the development of deep neural networks, the demand for a significant amount of annotated training data becomes the performance bottlenecks in many fields of research and applications. Image synthesis can generate annotated images automatically and freely, which gains increasing attention recently. In this paper, we propose to synthesize scene text images from the 3D virtual worlds, where the precise descriptions of scenes, editable illumination/visibility, and realistic physics are provided. Different from the previous methods which paste the rendered text on static 2D images, our method can render the 3D virtual scene and text instances as an entirety. In this way, real-world variations, including complex perspective transformations, various illuminations, and occlusions, can be realized in our synthesized scene text images. Moreover, the same text instances with various viewpoints can be produced by randomly moving and rotating the virtual camera, which acts as human eyes. The experiments on the standard scene text detection benchmarks using the generated synthetic data demonstrate the effectiveness and superiority of the proposed method.
Similar content being viewed by others
References
Gupta A, Vedaldi A, Zisserman A. Synthetic data for text localisation in natural images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 2315–2324
Zhan F, Lu S, Xue C. Verisimilar image synthesis for accurate detection and recognition of texts in scenes. In: Proceedings of European Conference on Computer Vision, 2018. 249–266
Jaderberg M, Simonyan K, Vedaldi A, et al. Synthetic data and artificial neural networks for natural scene text recognition. 2014. ArXiv: 1406.2227
Zhu Z, Huang T, Shi B, et al. Progressive pose attention transfer for person image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 2347–2356
Varol G, Romero J, Martin X, et al. Learning from synthetic humans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 109–117
Papon J, Schoeler M. Semantic pose using deep networks trained on synthetic RGB-D. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 774–782
McCormac J, Handa A, Leutenegger S, et al. Scenenet RGB-D: 5 m photorealistic images of synthetic indoor trajectories with ground truth. 2016. ArXiv: 1612.05079
Ros G, Sellart L, Materzynska J, et al. The synthia dataset: a large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 3234–3243
Saleh F S, Aliakbarian M S, Salzmann M, et al. Effective use of synthetic data for urban scene semantic segmentation. In: Proceedings of European Conference on Computer Vision, 2018. 86–103
Peng X, Sun B, Ali K, et al. Learning deep object detectors from 3D models. In: Proceedings of IEEE International Conference on Computer Vision, 2015. 1278–1286
Tremblay J, To T, Birchfield S. Falling things: a synthetic dataset for 3D object detection and pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 2038–2041
Hinterstoisser S, Pauly O, Heibel H, et al. An annotation saved is an annotation earned: using fully synthetic training for object instance detection. 2019. ArXiv: 1902.09967
Ye Y Y, Zhang C, Hao X L. Arpnet: attention regional proposal network for 3D object detection. Sci China Inf Sci, 2019, 62: 220104
Cao J, Pang Y, Li X. Learning multilayer channel features for pedestrian detection. IEEE Trans Image Process, 2017, 26: 3210–3220
Cao J, Pang Y, Li X. Pedestrian detection inspired by appearance constancy and shape symmetry. IEEE Trans Image Process, 2016, 25: 5538–5551
Quiter C, Ernst M. deepdrive/deepdrive: 2.0. 2018. https://zenodo.org/record/1248998#.Xhd25Ef0laQ
Martinez M, Sitawarin C, Finch K, et al. Beyond grand theft auto V for training, testing and enhancing deep learning in self driving cars. 2017. ArXiv: 1712.01397
Qiu W, Yuille A. Unrealcv: connecting computer vision to unreal engine. In: Proceedings of European Conference on Computer Vision, 2016. 909–916
Ganoni O, Mukundan R. A framework for visually realistic multi-robot simulation in natural environment. 2017. ArXiv: 1708.01938
Wang T, Wu J D, Coates A, et al. End-to-end text recognition with convolutional neural networks. In: Proceedings of the 21st International Conference on Pattern Recognition (ICPR), 2012. 3304–3308
Zhan F, Zhu H, Lu S. Spatial fusion gan for image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019. 3653–3662
Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 2672–2680
Ye Q, Doermann D. Text detection and recognition in imagery: a survey. IEEE Trans Pattern Anal Mach Intell, 2015, 37: 1480–1500
Bai X, Yang M K, Shi B G, et al. Deep learning for scene text detection and recognition (in Chinese). Sci Sin Inform, 2018, 48: 531–544
Liu Y, Jin L, Zhang S, et al. Detecting curve text in the wild: new dataset and new solution. 2017. ArXiv: 1712.02170
Liao M, Shi B, Bai X, et al. TextBoxes: a fast text detector with a single deep neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2017. 4161–4167
Ma J, Shao W, Ye H, et al. Arbitrary-oriented scene text detection via rotation proposals. IEEE Trans Multimedia, 2018, 20: 3111–3122
Liu Y, Jin L. Deep matching prior network: toward tighter multi-oriented text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 1962–1969
He W, Zhang Y-X, Yin F, et al. Deep direct regression for multi-oriented scene text detection. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 745–753
Zhou X, Yao C, Wen H, et al. EAST: an efficient and accurate scene text detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5551–5560
Liao M, Zhu Z, Shi B, et al. Rotation-sensitive regression for oriented scene text detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 5909–5918
Liao M, Lyu P, He M, et al. Mask TextSpotter: an end-to-end trainable neural network for spotting text with arbitrary shapes. IEEE Trans Pattern Anal Mach Intell, 2019
Ren S, He K, Girshick R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. In: Proceedings of Advances in Neural Information Processing Systems, 2015. 91–99
Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, 2016. 21–37
Liao M, Shi B, Bai X. TextBoxes++: a single-shot oriented scene text detector. IEEE Trans Image Process, 2018, 27: 3676–3690
Shi B, Bai X, Belongie S. Detecting oriented text in natural images by linking segments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 2550–2558
Wu Y, Natarajan P. Self-organized text detection with minimal post-processing via border learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017. 5000–5009
Long S, Ruan J, Zhang W, et al. Textsnake: a flexible representation for detecting text of arbitrary shapes. In: Proceedings of European Conference on Computer Vision, 2018. 20–36
Deng D, Liu H, Li X, et al. Pixellink: detecting scene text via instance segmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, 2018. 6773–6780
Lyu P, Yao C, Wu W, et al. Multi-oriented scene text detection via corner localization and region segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018. 7553–7563
Chen J, Lian Z H, Wang Y Z, et al. Irregular scene text detection via attention guided border labeling. Sci China Inf Sci, 2019, 62: 220103
Arbeláez P, Maire M, Fowlkes C, et al. Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell, 2011, 33: 898–916
Lu S J, Tan C, Lim J-H. Robust and efficient saliency modeling from image co-occurrence histograms. IEEE Trans Pattern Anal Mach Intell, 2014, 36: 195–201
Lin Y-T, Maire M, Belongie S, et al. Microsoft COCO: common objects in context. In: Proceedings of European Conference on Computer Vision, 2014. 740–755
Roth S D. Ray casting for modeling solids. Comput Graph Image Process, 1982, 18: 109–144
Karatzas D, Shafait F, Uchida S, et al. ICDAR 2013 robust reading competition. In: Proceedings of International Conference on Document Analysis and Recognition, 2013. 1484–1493
Karatzas D, Gomez-Bigorda L, Nicolaou A, et al. ICDAR 2015 competition on robust reading. In: Proceedings of International Conference on Document Analysis and Recognition, 2015. 1156–1160
He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770–778
Acknowledgements
This work was supported by National Natural Science Foundation of China (Grant No. 61733007). Xiang BAI was supported by National Program for Support of Top-notch Young Professionals and the Program for HUST Academic Frontier Youth Team (Grant No. 2017QYTD08).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Liao, M., Song, B., Long, S. et al. SynthText3D: synthesizing scene text images from 3D virtual worlds. Sci. China Inf. Sci. 63, 120105 (2020). https://doi.org/10.1007/s11432-019-2737-0
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11432-019-2737-0