Document Image Classification Using Deep Learning[1]-3
The document presents a framework for classifying scanned TIFF document images using deep learning techniques, specifically convolutional neural networks (CNNs) and transfer learning. It achieves high accuracy rates by utilizing pre-trained models and feature reduction methods, demonstrating effectiveness in various applications such as document management and automated classification. The system is designed to be user-friendly, allowing real-time predictions through a web interface, and shows significant improvements in processing speed and accuracy for document classification tasks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
7 views5 pages
Document Image Classification Using Deep Learning[1]-3
The document presents a framework for classifying scanned TIFF document images using deep learning techniques, specifically convolutional neural networks (CNNs) and transfer learning. It achieves high accuracy rates by utilizing pre-trained models and feature reduction methods, demonstrating effectiveness in various applications such as document management and automated classification. The system is designed to be user-friendly, allowing real-time predictions through a web interface, and shows significant improvements in processing speed and accuracy for document classification tasks.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5
Document Image Classification using Deep Learning
Dr. D. Balakrishnan Tharun Narra Harsha vardhan nalleboina
Assistant Professor 99220042096 99220042099 Department of Computer Computer Science and Computer Science and Science and Engineering, Engineering Engineering Kalasalingam Academy of Kalasalingam Academy of Kalasalingam Academy of Research and Education Research and Education, Research and Education, Krishnankoil, Virudhunagar Krishnankoil Krishnankoil d.balakrishnan@klu.ac.in 99220042096@klu.ac.in 99220042099@klu.ac.in
Devi prasad ponnapula Rajendra reddy bijjam
99220042100 99220042130 Computer Science and Computer Science and Engineering Engineering Kalasalingam Academy of Kalasalingam Academy of Research and Education, Research and Education, Krishnankoil Krishnankoil 99220042100@klu.ac.in 99220042130@klu.ac.in
Abstract: the present project anticipated the I. INTRODUCTION
development of the cnn-based document classifier, yet Thus, the research paper proposes a document another deep learning-based approach to the classification framework based on deep transfer learning and classification of scanned TIFF document images into feature reduction techniques. The study has used various pre- sixteen different types, including but not limited to trained models (e.G. Densenet121, VGG19) along with invoices, resumes, letters, and scientific papers. The classifiers like logistic regression and k-nearest neighbors to dataset is preprocessed by cleaning non-tiff files as well yield a remarkable performance. PCA and LDA reduction as corrupted images, to ensure only good quality data for techniques are useful in optimization for the best trade-off the training of models. It is the image preprocessing between accuracy and efficiency, leading to significant pipeline that resizes all images into a standardized size results such as 97.83% by densenet121-lda-lr on 546 images. (224x224 pixels) and rescales the pixel values for The work proposes to recommend this approach for inclusion enhanced performance of the model. Leverage tensorflow into the ERP system environment where it can support and keras for structuring the model with several document processing on tasks like OCR. [1] convolutional and pooling layers for feature extraction, with subsequent dense layers classifying the document. The developed multimodal deep learning model for While finalizing the model, an 80-20 ratio wasused for the classifying digitized documents involves CNNs for images train-validation split, working categorical cross-entropy and RNNs for text features. As these hybrid models do not for loss and attempting to enhance learning using the drop below the accuracy achieved by single-modality adam optimizer. models, this approach offers lot of promise in financial, health-care, and jurisprudence applications. Accuracy results Model performance is monitored over ten epochs, 94.84% on the testing set of 9,125 documents separated into wherein plotting of training and validation accuracies seven categories. [2] allows one to judge its efficiency. Finally, in a gradio- based user-friendly interface, a TIFF document image This paper will aim to outline a framework for can be uploaded and a prediction along with a confidence unstructured financial documents by embracing robotic score is given. In this way, this interface can take process automation and even implementing a multimodal advantage of the trained model that allows real-time approach. This would ensure the employment of RPA classification in a web format, thus putting the model alongside a pre-trained model in deep learning for within reach for practical applications. The project classification and key information extraction of multilingual effectively demonstrates the systematic simplicity of documents. Thus, models like LayoutXLM document automated document classification, potentially understanding will see this approach better serviced andmore applicable in digital archiving, document management, accurate while processing its tasks-the ones they are most which requires fast and accurate document classification. relevant to regarding banking, of course. Thisframework was also demonstrated to significantly improve the accuracy of Keywords: Deep learning , Document classification , KIE, especially under key-value labeling, and reduce the Convolutional neural networks , Image preprocessing , TIFF time it takes to complete business processing by more than images , TensorFlow , Keras , Gradio interface , Model evaluation , Image classification 30% in some applications. [3]. The age-old problem of document classification is still In the initial convolutional layers, the model applies filters to prevailing in the world of computer science, and this paper the input image in order to achieve low-level features like titled "A Novel 2D Deep Convolutional Neural Network for edges and textures. The deeper layers are able to capture Multimodal Document Categorization" offers a new solution more complex patterns required for distinguishing between that attains multimodality by training CNN-based deep types of resumes, invoices, scientific reports, etc. learning architectures on digitized documents by fusing Immediately following each convolutional layer is a pooling information from the text and image modalities to improve layer, which takes spatial dimensions away from the features. the accuracy of such jobs. The classification of textual It serves to decrease the computational requirements but also information is performed by an RNN, while image stops overfitting. After three convolution-pooling blocks, processing is performed by a CNN. These results are features are flattened into a single vector and then sent to combined together in a fusion layer for classification. The dense layers, which end up with softmax output layers having experimental results indicate that this multimodal approach 16 nodes, each representing one of the document classes in significantly outperformed single-modality methods, with consideration. acknowledgement of higher accuracy for documents under The categorical cross-entropy loss function can be used the given consideration, which belong to different industries because this is indeed a multi-class classification problem, like finance, healthcare, and legal domains. This model could and the Adam optimizer can be used, which dynamically be applied in automated document management systems: adjusts the learning rates for good training. Training is The following model promises to improve both efficiency basically feeding batches of images through the network, and accuracy in document classification. [4] updating the weights to minimize the error between the classes that were actually predicted and those that were Using deep transfer learning and feature reduction, the actually in the classes. The validity data is checked for paper "An Improved Document Image Classification using monitoring the generalization accuracy of the model. Finally, Deep Transfer Learning and Feature Reduction" has a Gradio interface utilizes a trained CNN to enable users to presented a framework to classify document images. upload images and get their prediction, hence it will be Utilizing pre-trained deep learning models in this study like utilized as a simple practical application in document DenseNet121, VGG19, and machine learning classifiers classification. resulted in up to 97.83% of accuracy on the small dataset of scanned documents used in this research study. The II. LITERATURE SURVEY framework includes dimensionality reduction techniques like PCA and LDA in order to enhance the processing speed The paper addresses the issues of classifying document without detracting from the performance. It holds potential images using deep convolutional neural networks (DCNNs) applications in enterprise resource planning systems, with intra-domain transfer learning and stacked especially with regard to document management and generalization. It uses VGG16-based DCNNs, which have preprocessing in OCR systems. [5] been fine-tuned for classification between ImageNet and document images, and it classifies the document into The research paper "A Document Image Classification different regions whereby independent regions are trained System Fusing Deep and Machine Learning Models" studied using their separate models. By stacking the predictions of an array of methods meant to classify documents, with these region-specific models to form hybrid models, an particular insight directed toward a process of digitization in accuracy improvement of 92.21% was achieved on the RVL- university document management systems. Some techniques CDIP dataset-a new benchmark-breaking effort. The employed in the classification are based on OCR and deep approach improves training efficiency by inter-domain and learning, or else at times employed the assistance of intra-domain transfer learning in machine learning-based ensemble methods, achieving a noteworthy degree of document image classification. [7] accuracy (94.45% F-score) with a fusion model of EfficientNetB3 and ExtraTree classifiers. The analysis has We are going to give an extensive study on the feasibility shown that the combination of content-based features with of using deep CNN for document image classification and those based on images contributes greatly to the accuracy of retrieval. We want to compare CNN-based features with classification. This is to address the needs of document traditional handcrafted ones used for such purposes and management and minimize human work on document establish their superiority. Our experiments show that when verification. [6] with enough training data, pre-trained CNNs could work excellently in non-document classification tasks; therefore ALGORITHMS USED: there is no necessity to have region-specific CNN models anymore. The best accuracy values for large document This code implements an algorithm that goes by the name datasets were achieved through holistic CNN approaches. [8] Convolutional Neural Network (CNN)-a deep learning architecture specifically effective for image classification. The article will introduce a multimodal neural network It's able to capture hierarchies with the spatial information model for document classification integrating text and image present within images. The CNN model implemented was modalities. It exploits both visual features from images based on TensorFlow and Keras. There are several layers extracted via MobileNetV2 and textual content processed involved: convolutional, pooling, flattening, and dense layers. through Tesseract OCR and FastText embeddings. Single- Application: It is the best choice for mixed heterogeneous modality baselines on the RVL-CDIP and Tobacco3482 document datasets, because of its deep feature-rich layer and datasets which are capable of improving with this model the possibility for the network to represent more complex score a 3% higher classification accuracy than some early document structures. incarnations did. It is shown that integration of text and Limitations: Demands are high in terms of computation; image modalities provides a means for finer-grained can use very large datasets to alleviate overfitting. document classification, even when the text produced by OCR is imperfect. The paper concludes with a discussion on 4. Inception Networks (GoogLeNet, InceptionV3) how multimodal learning infers practical applications in Description: It uses so-called "inception modules" which processing documents. [9] should run parallel convolutions of different sizes, including 1x1, 3x3, and 5x5 to catch features at multiple scales. In this work, the authors explore Convolutional Neural This application would help images of documents that are Networks (CNNs) for document image classification, mainly structured with diversified layouts and contain elements such modifying CNN architecture and data augmentation methods as varying font sizes, mixed content, etc. in such a way that focuses on document-specific features Limitations: Extremely complex architecture, which indeed rather than general image datasets. The other findings clearly is very hard to engineer and then fine-tune appropriately. demonstrate that performance is better on the RVL-CDIP dataset with shear transformation and larger images as input. 5. EfficientNet The authors also study the design parameters of a CNN such Description: It noticed EfficientNet and took note of some as depth, width, and input size and illustrate how CNNs may things about it, like balancing between accuracy and learn the region-wise layout features for document computational cost-wise. It also systematically scales depth, classification. Thereby achieving state-of-the-art results on width, and resolution of a network. RVL-CDIP through tuning CNN architecture and data Applications are appropriate in the field of document preprocessing. [10] imaging, in which high precision achievements chip in efficiently to the resource use range. III. EXISTING SYSTEM Limitations: EfficientNet requires very careful scaling to reach its peak performance, so it may demand extra 1. AlexNet customization for document-specific tasks. Because of this, Description: AlexNet highlighted CNNs when it won the even the most minute details must be attended to. ImageNet 2012 challenge. It consisted of five convolutional layers and three fully connected layers, with the main emphasis on progressively deeper layers so as to perform feature extraction. Application: Most generally used as a benchmark for IV. PROPOSED SYSTEM document classification since its performance is good enough to handle image data easily. The TIFF document image classification system is developed Limitations: Input size is fixed at 227x227, which is apt to with an earnest but not vain effort toward the categorization perform poorly on documents having highly complex of the 16 classes of images: these encompass the entire layouts. spectrum of document issues, like invoices, resumes, letters, and sorting of others. The workflow's complete series, 2. VGGNet inclusive of data cleaning and preprocessing modules, model Specifications: Depth from 16 to 19 and reliant on majorly training, and Gradio-based preclassification deployment, small convolutions (3x3) in depths. They are quite effective means that the task of classification can be completed through when it comes to grabbing those fine-grained features. minimal user interferon, hence allowing an accurate Application: Mostly used in processing huge-sized scans of execution of the task. Section talking about the system from an outside view: detailed documents so that the little nuances like font details or layout could be clearly understood. Data Cleaning Module: Limitations: This deep architecture makes the framework The rule for cleaning the TIFF-only dataset includes: somewhat computationally expensive, and other innovations I. elimination of all the non-TIFF files; like residual learning are quite absent to ease the training II. integrity check of all TIFF files, assuring that every instability. document was readable; III. removal of all corrupted files, deleted files that could not 3. ResNet (Residual Networks) be opened, alongside validation to a considerable extent for Description: It does away with the need for depth in its quality training. neural network architecture like ResNet-50 and ResNet-152 to capture features of increased complexity and abstraction, Data Preprocessing Module: therefore simplifying training with the use of residual connections. I. Resizing all the images to 224x224 pixels, which is the model building and development to interface it with a user- standard size preferred as input by CNNs, normalizes the friendly format. pixel values by shrinking them into values between 0 and 1, 1. Data Preprocessing: The first step involves filtering out all II. splitting models will actually assist in the evaluation, non-TIFF files with corrupted images from the TIFF images train-test split at 80%-20% respectively . for the purpose of acquiring clean data for the current task. The second step accounts for reshaping and rescaling all 3. Deep Learning Model: images to 224x224 pixels normalized around a mean and The convolutional neural network (CNN) essentially variance. The last should involve splitting data into an 80-20 employs: percent train-validation split by which the ability of the model consists of three convolutional layers which extract features is assessed. from documents through consecutive layers; 2. Model Architecture: The architecture of the CNN consists with a final flattening layer connecting this to a class system of several layers. Convolution layers work to extract features that represents 16 unique class outputs. from the document images like edges and layouts; afterwards, 4. Training and Evaluation Module: the pooling layers reduce the spatial dimensions to mitigate The model was trained with categorical crossentropy loss and the problem of overfitting and cut down on computational Adam optimizer for the balance of the learning speed with load. Finally, fully connected dense layers feed into sixteen which the final model performance stands. softmax classifiers-the outputs of the feature vectors obtained from the previous layers. 5. Gradio's Prediction Interface: 3. Training and Validation: The categorical cross-entropy This provides an easily accessible interface through common loss is used with Adam as an optimizer where the learning web constructs to allow users to upload any TIFF images for rate parameter was adaptive to improve convergence. classification. Accordingly, the monitoring of training and validation It thus makes the contribution of a predicted type of document accuracy metrics over ten epochs of work ensured an substantiated by a confidence score almost in real-time. evaluation of a degree of plausibility. 4. Deployment: Gradio provided direct access to the model for image uploads to be predicted in real time along with confidence scores. This facilitates instantaneousclassification and connects the model better to the requirements of its real application. Hence it is a very much a choiceable document classification system, with higher accuracy and ease of use, to turn into document management and archiving-specific applications.
VI. RESULT
The experimental results show that the CNN-based
document classification model categorizes TIFF document images into one of sixteen categories. High accuracy is achieved in the training as well as validation datasets. The model was evaluated over ten epochs while training and was showing stead improvement in accuracy and loss. The validation accuracy closely tracks the training accuracy. This indeed shows how well the model generalizes to unseen data and does not easily overfit because convolutional and pooling layers are extracting meaningful features from the document images. The model was tested using a Gradio interface after training, Fig1.1 flowchat which allowed users to upload TIFF images and receive classifications with confidence scores. It worked correctly, processing each document image quickly in real time. Its V. METHODOLOGY strengths were in correctly distinguishing visually distinct types of documents- resumes and scientific reports-but failed The paper describes CNN-based document classification sometimes with documents having similar layouts or minimal scheme, which works on a dataset of TIFF images with distinguishing features. sixteen types of documents including invoices, resumes, and The system is also suitable to be used in various digital letters. After the finalization of a convenient way since the archiving and automated document management applications dataset involves several techniques, it finally moved on to where large volumes of documents need to be classified efficiently. Though the model could still be improved further by using more training data or fine-tuning for accuracy on IEEE Citation: R. Abkrakhmanov, A. Elubaeva, T. challenging classes, the model does currently represent an Turymbetov, V. Nakhipova, S. Turmaganbetova, and Z. effective and reliable high-performance tool for document Ikram, "A Novel 2D Deep Convolutional Neural classification. Integration of Gradio further makes it possible Network for Multimodal Document Categorization," for non-technical users to use the functionalities provided by International Journal of Advanced Computer Science the model in a simple interface. and Applications, vol. 14, no. 7, pp. 720-728, 2023, doi: 10.14569/IJACSA.2023.0140779. [2] VII. CONCLUSION IEEE Citation: S. Cho, J. Moon, J. Bae, J. Kang, and S. In conclusion, the project has successfully Lee, "A Framework for Understanding Unstructured implemented a strong and accurate document Financial Documents Using RPA and Multimodal classification model using convolutional neural networks Approach," Electronics, vol. 12, no. 4, p. 939, 2023, for classifying TIFF documents to sixteen different styles of invoices, CVs, or scientific papers. This approach doi: 10.3390/electronics12040939. [3] ensures lack of requirement of any kind of text extraction while it has the appropriate combination of decent data IEEE Citation: R. Abkrakhmanov, A. Elubaeva, T. preparation and model design and good user-centered Turymbetov, V. Nakhipova, S. Turmaganbetova, and Z. deployments for its accuracy. Initialized ahead oddlywell Ikram, "A Novel 2D Deep Convolutional Neural at efficiently identifying document types as most of the Network for Multimodal Document Categorization," features like field sets inside each layout are complex,the International Journal of Advanced Computer Science CNN does not even involve the text extraction and sothere and Applications, vol. 14, no. 7, pp. 720-726, 2023. [4] is no particular limitations of input requirements and the document quality, variance, or layout be trusted as well IEEE Citation: A. Jadli, M. Hain, and A. Hasbaoui, "An or in the simplicity of use. Because of the employof the CNN model there is a huge generalization over thevarious Improved Document Image Classification using Deep formats or types of the documents with ample good Transfer Learning and Feature Reduction," Int. J. Adv. solutions making it really helpful in the applicationswhere Trends Comput. Sci. Eng., vol. 10, no. 2, pp. 549-557, document classification and accurate digital archiving is 2021. [5] required. In addition, as the Gradio Interfacehelp the end user in real time prediction and scoring directly from the IEEE Citation: S. I. Omurca, E. Ekinci, S. Sevim, E. B. uploaded images the following three steps of the working Edinç, S. Eken, and A. Sayar, "A Document Image methodology introduces a very highlevel of applicability Classification System Fusing Deep and Machine making it very useful in case of non students or non Learning Models," Applied Intelligence, vol. 53, pp. programmers who doesn't have to have any knowledge 15295–15310, 2023. [6] about computer science making proof of concept of that indeed we can achieve the mass document classification done automatically. In conclusion this project along with IEEE Citation: A. Das, S. Roy, U. Bhattacharya, and S. showing the high promise as well in the application of K. Parui, “Document Image Classification with Intra- CNNs as a large scale document classification and Domain Transfer Learning and Stacked Generalization document intake pipeline confirms that they not enough of Deep Convolutional Neural Networks,” 24th Int. existing mainstream models as well. Nonetheless future Conf. on Pattern Recognition (ICPR), Beijing, China, work in making some changes in the model 9 as well or 2018. [7] entity granularity can be proposed only way to increase accuracies would be to rely more and more on the IEEE Citation: A. W. Harley, A. Ufkes, and K. G. document types which are more similarto each other and Derpanis, "Evaluation of Deep Convolutional Nets for it promises to work more in future on the terms of the liberation of the document choices and processing the Document Image Classification and Retrieval," arXiv high intensity documentation desk work. preprint arXiv:1502.07058, 2015 [8]
VIII. REFERENCES IEEE Citation: N. Audebert, C. Herold, K. Slimani, and
C. Vidal, "Multimodal Deep Networks for Text and IEEE Citation: A. Jadli, M. Hain, and A. Hasbaoui, "An Image-Based Document Classification," arXiv preprint Improved Document Image Classification using Deep arXiv:1907.06370, 2019. [9] Transfer Learning and Feature Reduction," International Journal of Advanced Trends in Computer IEEE Citation: C. Tensmeyer and T. Martinez, Science and Engineering, vol. 10, no. 2, pp. 549-557, "Analysis of Convolutional Neural Networks for Mar.-Apr.2021,doi: 10.30534/ijatcse/2021/141022021. Document Image Classification," arXiv preprint [1] arXiv:1708.03273, 2017. [10]