Abstract
The presence of non-text components in the document image hinders the result of an optical character recognition (OCR)-based document analysis system. Thus, text and non-text separation has become an essential task in the domain of document image processing. To address this issue, in the present work, a simple two-stage method is developed to separate the text and the non-text components from the images of handwritten scientific documents. Before starting the actual process, connected components from the document pages are extracted. Then, in the first stage, some commonly occurred components are identified and separated out as graphics. In the second stage, remaining components are passed through feature extraction and subsequent classification processes. Evaluating the system on handwritten scientific document images, it is found that 87.16% components are classified correctly as text or non-text.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Oyedotun, O.K., Khashman, A.: Document segmentation using textural features summarization and feedforward neural network. Appl. Intell. 1–15 (2016)
Lin, M.W., Tapamo, J.-R., Ndovie, B.: A texture-based method for document segmentation and classification. South African Comput. J. 36(1), 49–56 (2006)
Vil’kin, A.M., Safonov, I.V., Egorova, M.A.: Algorithm for segmentation of documents based on texture features. Pattern Recogn. Image Anal. 23(1), 153–159 (2013)
Park, H.C., Ok, S.Y., Cho, H.: Word extraction in text/graphic mixed image using 3-dimensional graph model. ICCPOL 99, 171–176 (1999)
Le, V.P., Nayef, N., Visani, M., Ogier, J.-M., De Tran, C.: Text and non-text segmentation based on connected component features in document analysis and recognition (ICDAR). In: 13th International Conference on 2015, pp. 1096–1100 (2015)
Tran, T.-A., Na, I.-S., Kim, S.-H.: Separation of text and non-text in document layout analysis using a recursive filter. KSII Trans. Inter. Inf. Syst. 9(10), 4072–4091 (2015)
Sarkar, R., Moulik, S., Das, N., Basu, S., Nasipuri, M., Kundu, M.: Suppression of non-text components in handwritten document images. In: ICIIP 2011—Proceedings of International Conference Image Information Process on 2011, no. Iciip (2011)
Bhowmik, S., Sarkar, R., Nasipuri, M.: Text and non-text separation in handwritten document images using local binary pattern operator. In: Proceedings of the First International Conference on Intelligent Computing and Communication, 2017, pp. 507–515
Moll, M.A., Baird, H.S., An, C.: Truthing for pixel-accurate segmentation. In: Document Analysis Systems, 2008. DAS’08. The Eighth IAPR International Workshop on 2008, pp. 379–385
Moll, M.A., Baird, H.S.: Segmentation-based retrieval of document images from diverse collections. Electron. Imag. 2008, 68150L–68150L (2008)
Shih, F.Y., Chen, S.S.: Adaptive document block segmentation and classification. IEEE Trans. Syst. Man Cybern. Part B 26(5), 797–802 (1996)
Das, B., Bhowmik, S., Saha, A., Sarkar, R.: An adaptive foreground-background separation method for effective binarization of document images. In: International Conference on Soft Computing and Pattern Recognition pp. 515–524. Springer, Cham Dec. 2016
Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern. 9(1), 62–66 (1979)
AbuBaker, A., Qahwaji, R., Ipson, S., Saleh, M.: One scan connected component labeling technique. In: Signal Processing and Communications, 2007. IEEE International Conference on ICSPC 2007, pp. 1283–1286
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Bhowmik, S., Kundu, S., De, B.K., Sarkar, R., Nasipuri, M. (2019). A Two-Stage Approach for Text and Non-text Separation from Handwritten Scientific Document Images. In: Chandra, P., Giri, D., Li, F., Kar, S., Jana, D. (eds) Information Technology and Applied Mathematics. Advances in Intelligent Systems and Computing, vol 699. Springer, Singapore. https://doi.org/10.1007/978-981-10-7590-2_3
Download citation
DOI: https://doi.org/10.1007/978-981-10-7590-2_3
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7589-6
Online ISBN: 978-981-10-7590-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)