0% found this document useful (0 votes)
7 views5 pages

Document Image Classification Using Deep Learning[1]-3

The document presents a framework for classifying scanned TIFF document images using deep learning techniques, specifically convolutional neural networks (CNNs) and transfer learning. It achieves high accuracy rates by utilizing pre-trained models and feature reduction methods, demonstrating effectiveness in various applications such as document management and automated classification. The system is designed to be user-friendly, allowing real-time predictions through a web interface, and shows significant improvements in processing speed and accuracy for document classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views5 pages

Document Image Classification Using Deep Learning[1]-3

The document presents a framework for classifying scanned TIFF document images using deep learning techniques, specifically convolutional neural networks (CNNs) and transfer learning. It achieves high accuracy rates by utilizing pre-trained models and feature reduction methods, demonstrating effectiveness in various applications such as document management and automated classification. The system is designed to be user-friendly, allowing real-time predictions through a web interface, and shows significant improvements in processing speed and accuracy for document classification tasks.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Document Image Classification using Deep Learning

Dr. D. Balakrishnan Tharun Narra Harsha vardhan nalleboina


Assistant Professor 99220042096 99220042099
Department of Computer Computer Science and Computer Science and
Science and Engineering, Engineering Engineering
Kalasalingam Academy of Kalasalingam Academy of Kalasalingam Academy of
Research and Education Research and Education, Research and Education,
Krishnankoil, Virudhunagar Krishnankoil Krishnankoil
d.balakrishnan@klu.ac.in 99220042096@klu.ac.in 99220042099@klu.ac.in

Devi prasad ponnapula Rajendra reddy bijjam


99220042100 99220042130
Computer Science and Computer Science and
Engineering Engineering
Kalasalingam Academy of Kalasalingam Academy of
Research and Education, Research and Education,
Krishnankoil Krishnankoil
99220042100@klu.ac.in 99220042130@klu.ac.in

Abstract: the present project anticipated the I. INTRODUCTION


development of the cnn-based document classifier, yet Thus, the research paper proposes a document
another deep learning-based approach to the classification framework based on deep transfer learning and
classification of scanned TIFF document images into feature reduction techniques. The study has used various pre-
sixteen different types, including but not limited to trained models (e.G. Densenet121, VGG19) along with
invoices, resumes, letters, and scientific papers. The classifiers like logistic regression and k-nearest neighbors to
dataset is preprocessed by cleaning non-tiff files as well yield a remarkable performance. PCA and LDA reduction
as corrupted images, to ensure only good quality data for techniques are useful in optimization for the best trade-off
the training of models. It is the image preprocessing between accuracy and efficiency, leading to significant
pipeline that resizes all images into a standardized size results such as 97.83% by densenet121-lda-lr on 546 images.
(224x224 pixels) and rescales the pixel values for The work proposes to recommend this approach for inclusion
enhanced performance of the model. Leverage tensorflow into the ERP system environment where it can support
and keras for structuring the model with several document processing on tasks like OCR. [1]
convolutional and pooling layers for feature extraction,
with subsequent dense layers classifying the document. The developed multimodal deep learning model for
While finalizing the model, an 80-20 ratio wasused for the classifying digitized documents involves CNNs for images
train-validation split, working categorical cross-entropy and RNNs for text features. As these hybrid models do not
for loss and attempting to enhance learning using the drop below the accuracy achieved by single-modality
adam optimizer. models, this approach offers lot of promise in financial,
health-care, and jurisprudence applications. Accuracy results
Model performance is monitored over ten epochs, 94.84% on the testing set of 9,125 documents separated into
wherein plotting of training and validation accuracies seven categories. [2]
allows one to judge its efficiency. Finally, in a gradio-
based user-friendly interface, a TIFF document image This paper will aim to outline a framework for
can be uploaded and a prediction along with a confidence unstructured financial documents by embracing robotic
score is given. In this way, this interface can take process automation and even implementing a multimodal
advantage of the trained model that allows real-time approach. This would ensure the employment of RPA
classification in a web format, thus putting the model alongside a pre-trained model in deep learning for
within reach for practical applications. The project classification and key information extraction of multilingual
effectively demonstrates the systematic simplicity of documents. Thus, models like LayoutXLM document
automated document classification, potentially understanding will see this approach better serviced andmore
applicable in digital archiving, document management, accurate while processing its tasks-the ones they are most
which requires fast and accurate document classification. relevant to regarding banking, of course. Thisframework was
also demonstrated to significantly improve the accuracy of
Keywords: Deep learning , Document classification , KIE, especially under key-value labeling, and reduce the
Convolutional neural networks , Image preprocessing , TIFF
time it takes to complete business processing by more than
images , TensorFlow , Keras , Gradio interface , Model evaluation
, Image classification 30% in some applications. [3].
The age-old problem of document classification is still In the initial convolutional layers, the model applies filters to
prevailing in the world of computer science, and this paper the input image in order to achieve low-level features like
titled "A Novel 2D Deep Convolutional Neural Network for edges and textures. The deeper layers are able to capture
Multimodal Document Categorization" offers a new solution more complex patterns required for distinguishing between
that attains multimodality by training CNN-based deep types of resumes, invoices, scientific reports, etc.
learning architectures on digitized documents by fusing Immediately following each convolutional layer is a pooling
information from the text and image modalities to improve layer, which takes spatial dimensions away from the features.
the accuracy of such jobs. The classification of textual It serves to decrease the computational requirements but also
information is performed by an RNN, while image stops overfitting. After three convolution-pooling blocks,
processing is performed by a CNN. These results are features are flattened into a single vector and then sent to
combined together in a fusion layer for classification. The dense layers, which end up with softmax output layers having
experimental results indicate that this multimodal approach 16 nodes, each representing one of the document classes in
significantly outperformed single-modality methods, with consideration.
acknowledgement of higher accuracy for documents under The categorical cross-entropy loss function can be used
the given consideration, which belong to different industries because this is indeed a multi-class classification problem,
like finance, healthcare, and legal domains. This model could and the Adam optimizer can be used, which dynamically
be applied in automated document management systems: adjusts the learning rates for good training. Training is
The following model promises to improve both efficiency basically feeding batches of images through the network,
and accuracy in document classification. [4] updating the weights to minimize the error between the
classes that were actually predicted and those that were
Using deep transfer learning and feature reduction, the actually in the classes. The validity data is checked for
paper "An Improved Document Image Classification using monitoring the generalization accuracy of the model. Finally,
Deep Transfer Learning and Feature Reduction" has a Gradio interface utilizes a trained CNN to enable users to
presented a framework to classify document images. upload images and get their prediction, hence it will be
Utilizing pre-trained deep learning models in this study like utilized as a simple practical application in document
DenseNet121, VGG19, and machine learning classifiers classification.
resulted in up to 97.83% of accuracy on the small dataset of
scanned documents used in this research study. The II. LITERATURE SURVEY
framework includes dimensionality reduction techniques like
PCA and LDA in order to enhance the processing speed The paper addresses the issues of classifying document
without detracting from the performance. It holds potential images using deep convolutional neural networks (DCNNs)
applications in enterprise resource planning systems, with intra-domain transfer learning and stacked
especially with regard to document management and generalization. It uses VGG16-based DCNNs, which have
preprocessing in OCR systems. [5] been fine-tuned for classification between ImageNet and
document images, and it classifies the document into
The research paper "A Document Image Classification different regions whereby independent regions are trained
System Fusing Deep and Machine Learning Models" studied using their separate models. By stacking the predictions of
an array of methods meant to classify documents, with these region-specific models to form hybrid models, an
particular insight directed toward a process of digitization in accuracy improvement of 92.21% was achieved on the RVL-
university document management systems. Some techniques CDIP dataset-a new benchmark-breaking effort. The
employed in the classification are based on OCR and deep approach improves training efficiency by inter-domain and
learning, or else at times employed the assistance of intra-domain transfer learning in machine learning-based
ensemble methods, achieving a noteworthy degree of document image classification. [7]
accuracy (94.45% F-score) with a fusion model of
EfficientNetB3 and ExtraTree classifiers. The analysis has We are going to give an extensive study on the feasibility
shown that the combination of content-based features with of using deep CNN for document image classification and
those based on images contributes greatly to the accuracy of retrieval. We want to compare CNN-based features with
classification. This is to address the needs of document traditional handcrafted ones used for such purposes and
management and minimize human work on document establish their superiority. Our experiments show that when
verification. [6] with enough training data, pre-trained CNNs could work
excellently in non-document classification tasks; therefore
ALGORITHMS USED: there is no necessity to have region-specific CNN models
anymore. The best accuracy values for large document
This code implements an algorithm that goes by the name datasets were achieved through holistic CNN approaches. [8]
Convolutional Neural Network (CNN)-a deep learning
architecture specifically effective for image classification. The article will introduce a multimodal neural network
It's able to capture hierarchies with the spatial information model for document classification integrating text and image
present within images. The CNN model implemented was modalities. It exploits both visual features from images
based on TensorFlow and Keras. There are several layers extracted via MobileNetV2 and textual content processed
involved: convolutional, pooling, flattening, and dense
layers.
through Tesseract OCR and FastText embeddings. Single- Application: It is the best choice for mixed heterogeneous
modality baselines on the RVL-CDIP and Tobacco3482 document datasets, because of its deep feature-rich layer and
datasets which are capable of improving with this model the possibility for the network to represent more complex
score a 3% higher classification accuracy than some early document structures.
incarnations did. It is shown that integration of text and Limitations: Demands are high in terms of computation;
image modalities provides a means for finer-grained can use very large datasets to alleviate overfitting.
document classification, even when the text produced by
OCR is imperfect. The paper concludes with a discussion on 4. Inception Networks (GoogLeNet, InceptionV3)
how multimodal learning infers practical applications in Description: It uses so-called "inception modules" which
processing documents. [9] should run parallel convolutions of different sizes, including
1x1, 3x3, and 5x5 to catch features at multiple scales.
In this work, the authors explore Convolutional Neural This application would help images of documents that are
Networks (CNNs) for document image classification, mainly structured with diversified layouts and contain elements such
modifying CNN architecture and data augmentation methods as varying font sizes, mixed content, etc.
in such a way that focuses on document-specific features Limitations: Extremely complex architecture, which indeed
rather than general image datasets. The other findings clearly is very hard to engineer and then fine-tune appropriately.
demonstrate that performance is better on the RVL-CDIP
dataset with shear transformation and larger images as input. 5. EfficientNet
The authors also study the design parameters of a CNN such Description: It noticed EfficientNet and took note of some
as depth, width, and input size and illustrate how CNNs may things about it, like balancing between accuracy and
learn the region-wise layout features for document computational cost-wise. It also systematically scales depth,
classification. Thereby achieving state-of-the-art results on width, and resolution of a network.
RVL-CDIP through tuning CNN architecture and data Applications are appropriate in the field of document
preprocessing. [10] imaging, in which high precision achievements chip in
efficiently to the resource use range.
III. EXISTING SYSTEM Limitations: EfficientNet requires very careful scaling to
reach its peak performance, so it may demand extra
1. AlexNet customization for document-specific tasks. Because of this,
Description: AlexNet highlighted CNNs when it won the even the most minute details must be attended to.
ImageNet 2012 challenge. It consisted of five convolutional
layers and three fully connected layers, with the main
emphasis on progressively deeper layers so as to perform
feature extraction.
Application: Most generally used as a benchmark for IV. PROPOSED SYSTEM
document classification since its performance is good enough
to handle image data easily. The TIFF document image classification system is developed
Limitations: Input size is fixed at 227x227, which is apt to with an earnest but not vain effort toward the categorization
perform poorly on documents having highly complex of the 16 classes of images: these encompass the entire
layouts. spectrum of document issues, like invoices, resumes, letters,
and sorting of others. The workflow's complete series,
2. VGGNet inclusive of data cleaning and preprocessing modules, model
Specifications: Depth from 16 to 19 and reliant on majorly training, and Gradio-based preclassification deployment,
small convolutions (3x3) in depths. They are quite effective means that the task of classification can be completed through
when it comes to grabbing those fine-grained features. minimal user interferon, hence allowing an accurate
Application: Mostly used in processing huge-sized scans of execution of the task.
Section talking about the system from an outside view:
detailed documents so that the little nuances like font details
or layout could be clearly understood.
Data Cleaning Module:
Limitations: This deep architecture makes the framework The rule for cleaning the TIFF-only dataset includes:
somewhat computationally expensive, and other innovations I. elimination of all the non-TIFF files;
like residual learning are quite absent to ease the training II. integrity check of all TIFF files, assuring that every
instability. document was readable;
III. removal of all corrupted files, deleted files that could not
3. ResNet (Residual Networks) be opened, alongside validation to a considerable extent for
Description: It does away with the need for depth in its quality training.
neural network architecture like ResNet-50 and ResNet-152
to capture features of increased complexity and abstraction, Data Preprocessing Module:
therefore simplifying training with the use of residual
connections.
I. Resizing all the images to 224x224 pixels, which is the model building and development to interface it with a user-
standard size preferred as input by CNNs, normalizes the friendly format.
pixel values by shrinking them into values between 0 and 1, 1. Data Preprocessing: The first step involves filtering out all
II. splitting models will actually assist in the evaluation, non-TIFF files with corrupted images from the TIFF images
train-test split at 80%-20% respectively . for the purpose of acquiring clean data for the current task.
The second step accounts for reshaping and rescaling all
3. Deep Learning Model: images to 224x224 pixels normalized around a mean and
The convolutional neural network (CNN) essentially variance. The last should involve splitting data into an 80-20
employs: percent train-validation split by which the ability of the model
consists of three convolutional layers which extract features is assessed.
from documents through consecutive layers;
2. Model Architecture: The architecture of the CNN consists
with a final flattening layer connecting this to a class system
of several layers. Convolution layers work to extract features
that represents 16 unique class outputs.
from the document images like edges and layouts; afterwards,
4. Training and Evaluation Module: the pooling layers reduce the spatial dimensions to mitigate
The model was trained with categorical crossentropy loss and the problem of overfitting and cut down on computational
Adam optimizer for the balance of the learning speed with load. Finally, fully connected dense layers feed into sixteen
which the final model performance stands. softmax classifiers-the outputs of the feature vectors obtained
from the previous layers.
5. Gradio's Prediction Interface: 3. Training and Validation: The categorical cross-entropy
This provides an easily accessible interface through common loss is used with Adam as an optimizer where the learning
web constructs to allow users to upload any TIFF images for rate parameter was adaptive to improve convergence.
classification. Accordingly, the monitoring of training and validation
It thus makes the contribution of a predicted type of document accuracy metrics over ten epochs of work ensured an
substantiated by a confidence score almost in real-time.
evaluation of a degree of plausibility.
4. Deployment: Gradio provided direct access to the model
for image uploads to be predicted in real time along with
confidence scores. This facilitates instantaneousclassification
and connects the model better to the requirements of its real
application.
Hence it is a very much a choiceable document classification
system, with higher accuracy and ease of use, to turn into
document management and archiving-specific applications.

VI. RESULT

The experimental results show that the CNN-based


document classification model categorizes TIFF document
images into one of sixteen categories. High accuracy is
achieved in the training as well as validation datasets. The
model was evaluated over ten epochs while training and was
showing stead improvement in accuracy and loss. The
validation accuracy closely tracks the training accuracy. This
indeed shows how well the model generalizes to unseen data
and does not easily overfit because convolutional and pooling
layers are extracting meaningful features from the document
images.
The model was tested using a Gradio interface after training,
Fig1.1 flowchat which allowed users to upload TIFF images and receive
classifications with confidence scores. It worked correctly,
processing each document image quickly in real time. Its
V. METHODOLOGY strengths were in correctly distinguishing visually distinct
types of documents- resumes and scientific reports-but failed
The paper describes CNN-based document classification sometimes with documents having similar layouts or minimal
scheme, which works on a dataset of TIFF images with distinguishing features.
sixteen types of documents including invoices, resumes, and The system is also suitable to be used in various digital
letters. After the finalization of a convenient way since the archiving and automated document management applications
dataset involves several techniques, it finally moved on to where large volumes of documents need to be classified
efficiently. Though the model could still be improved further
by using more training data or fine-tuning for accuracy on IEEE Citation: R. Abkrakhmanov, A. Elubaeva, T.
challenging classes, the model does currently represent an Turymbetov, V. Nakhipova, S. Turmaganbetova, and Z.
effective and reliable high-performance tool for document Ikram, "A Novel 2D Deep Convolutional Neural
classification. Integration of Gradio further makes it possible Network for Multimodal Document Categorization,"
for non-technical users to use the functionalities provided by International Journal of Advanced Computer Science
the model in a simple interface. and Applications, vol. 14, no. 7, pp. 720-728, 2023, doi:
10.14569/IJACSA.2023.0140779. [2]
VII. CONCLUSION
IEEE Citation: S. Cho, J. Moon, J. Bae, J. Kang, and S.
In conclusion, the project has successfully Lee, "A Framework for Understanding Unstructured
implemented a strong and accurate document Financial Documents Using RPA and Multimodal
classification model using convolutional neural networks
Approach," Electronics, vol. 12, no. 4, p. 939, 2023,
for classifying TIFF documents to sixteen different styles
of invoices, CVs, or scientific papers. This approach doi: 10.3390/electronics12040939. [3]
ensures lack of requirement of any kind of text extraction
while it has the appropriate combination of decent data IEEE Citation: R. Abkrakhmanov, A. Elubaeva, T.
preparation and model design and good user-centered Turymbetov, V. Nakhipova, S. Turmaganbetova, and Z.
deployments for its accuracy. Initialized ahead oddlywell Ikram, "A Novel 2D Deep Convolutional Neural
at efficiently identifying document types as most of the Network for Multimodal Document Categorization,"
features like field sets inside each layout are complex,the International Journal of Advanced Computer Science
CNN does not even involve the text extraction and sothere and Applications, vol. 14, no. 7, pp. 720-726, 2023. [4]
is no particular limitations of input requirements and the
document quality, variance, or layout be trusted as well
IEEE Citation: A. Jadli, M. Hain, and A. Hasbaoui, "An
or in the simplicity of use. Because of the employof the
CNN model there is a huge generalization over thevarious Improved Document Image Classification using Deep
formats or types of the documents with ample good Transfer Learning and Feature Reduction," Int. J. Adv.
solutions making it really helpful in the applicationswhere Trends Comput. Sci. Eng., vol. 10, no. 2, pp. 549-557,
document classification and accurate digital archiving is 2021. [5]
required. In addition, as the Gradio Interfacehelp the end
user in real time prediction and scoring directly from the IEEE Citation: S. I. Omurca, E. Ekinci, S. Sevim, E. B.
uploaded images the following three steps of the working Edinç, S. Eken, and A. Sayar, "A Document Image
methodology introduces a very highlevel of applicability Classification System Fusing Deep and Machine
making it very useful in case of non students or non Learning Models," Applied Intelligence, vol. 53, pp.
programmers who doesn't have to have any knowledge
15295–15310, 2023. [6]
about computer science making proof of concept of that
indeed we can achieve the mass document classification
done automatically. In conclusion this project along with IEEE Citation: A. Das, S. Roy, U. Bhattacharya, and S.
showing the high promise as well in the application of K. Parui, “Document Image Classification with Intra-
CNNs as a large scale document classification and Domain Transfer Learning and Stacked Generalization
document intake pipeline confirms that they not enough of Deep Convolutional Neural Networks,” 24th Int.
existing mainstream models as well. Nonetheless future Conf. on Pattern Recognition (ICPR), Beijing, China,
work in making some changes in the model 9 as well or 2018. [7]
entity granularity can be proposed only way to increase
accuracies would be to rely more and more on the IEEE Citation: A. W. Harley, A. Ufkes, and K. G.
document types which are more similarto each other and
Derpanis, "Evaluation of Deep Convolutional Nets for
it promises to work more in future on the terms of the
liberation of the document choices and processing the
Document Image Classification and Retrieval," arXiv
high intensity documentation desk work. preprint arXiv:1502.07058, 2015 [8]

VIII. REFERENCES IEEE Citation: N. Audebert, C. Herold, K. Slimani, and


C. Vidal, "Multimodal Deep Networks for Text and
IEEE Citation: A. Jadli, M. Hain, and A. Hasbaoui, "An Image-Based Document Classification," arXiv preprint
Improved Document Image Classification using Deep arXiv:1907.06370, 2019. [9]
Transfer Learning and Feature Reduction,"
International Journal of Advanced Trends in Computer IEEE Citation: C. Tensmeyer and T. Martinez,
Science and Engineering, vol. 10, no. 2, pp. 549-557, "Analysis of Convolutional Neural Networks for
Mar.-Apr.2021,doi: 10.30534/ijatcse/2021/141022021. Document Image Classification," arXiv preprint
[1] arXiv:1708.03273, 2017. [10]

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy