0% found this document useful (0 votes)
25 views6 pages

Optical Character Recognition Techniques

This paper surveys Optical Character Recognition (OCR) techniques for English text, highlighting the need for systems that can accurately convert various forms of printed text into editable formats. It discusses the challenges faced by existing OCR systems, particularly with degraded documents, and emphasizes the importance of post-processing to correct errors. The paper also reviews various methodologies and recent advancements in OCR technology, aiming to improve recognition accuracy and efficiency.

Uploaded by

lassy02182021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views6 pages

Optical Character Recognition Techniques

This paper surveys Optical Character Recognition (OCR) techniques for English text, highlighting the need for systems that can accurately convert various forms of printed text into editable formats. It discusses the challenges faced by existing OCR systems, particularly with degraded documents, and emphasizes the importance of post-processing to correct errors. The paper also reviews various methodologies and recent advancements in OCR technology, aiming to improve recognition accuracy and efficiency.

Uploaded by

lassy02182021
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Vol. 4, No.

6 June 2013 ISSN 2079-8407


Jou r na l of Em e r gin g Tr e n ds in Com put ing a n d I n for m a t ion Scie n ce s
©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org

Optical Character Recognition Techniques: A Survey


Sukhpreet Singh
M.tech Student, Dept. of Computer Engineering, YCOE Talwandi Sabo BP. India.

ABSTRACT
This paper presents a literature review on English OCR techniques. English OCR system is compulsory to convert
numerous published books of English into editable computer text files. Latest research in this area has been able to grown
some new methodologies to overcome the complexity of English writing style. Still these algorithms have not been tested
for complete characters of English Alphabet. Hence, a system is required which can handle all classes of English text and
identify characters among these classes.

Keywords: OCR, templates, English-alphabets, alpha-numerics

1. INTRODUCTION
Optical Character Recognition [1] – [5] is a documents and the error-prone pattern matching
process that can convert text, present in digital image, to techniques of the OCR process, OCR errors occur.
editable text. It allows a machine to recognize characters Modern OCR processors have character recognition rates
through optical mechanisms. The output of the OCR up to 99% on high quality documents. Assuming an
should ideally be same as input in formatting. The average word length of 5 characters, this still means that
process involves some pre-processing of the image file one out of 20 words is defect. Thus, at least 5% of all
and then acquisition of important knowledge about processed words will contain OCR errors. On historic
written text. documents this error rate will be even higher because the
print quality is likely to be of lower quality.
That knowledge or data can be used to recognize
characters. OCR [1] is becoming an important part of After finishing the OCR process several post-
modern research based computer applications. Especially processing steps are necessary depending on the
with the advent of Unicode and support of complex application, e.g. tagging the documents with meta-data
scripts on personal computers, the importance of this (author, year, etc.) or proof-reading the documents for
application has increased. correcting OCR errors and spelling mistakes. Data which
contains spelling mistakes or OCR errors is difficult to
The current study is focused on exploration of process. For example, a standard full-text search will not
possible techniques to develop an OCR [2] system for retrieve misspelled versions of a query string. To fulfill
English language when noise is present in the signal. A application’s demanding requirements toward zero errors,
detailed analysis of English writing system has been done a post-processing step to correct these errors is a very
in order to understand the core challenges. Existing OCR important part of the post-processing chain.
systems are also studied to know the latest research going
on in this field. The emphasis was on finding workable A post-processing error correction system can be
segmentation technique and diacritic handling for English manual, semi-automatic or fully automatic. A semi-
strings, and built a recognition module for these ligatures. automatic post-correction system detects errors
The complete methodology is proposed to develop an automatically and proposes corrections to human
OCR system for English and a testing application is also correctors who then hive to choose the correct proposal.
made. Test results are reported and compared with the A fully-automatic post-correction system does the
previous work done in this area. detection and correction of errors by its own. Because
semi-automatic or manual corrections require a lot of
2. MOTIVATION human effort and time, fully-automatic systems become
In the last years the trend to digitize (historic) necessary to perform a full correction.
paper based documents such as books and newspapers,
has emerged. The aim is to preserve these documents and 3. DESIGN OF OCR
make them fully accessible, searchable and process able Various approaches used for the design of OCR
in digital form. Knowledge contained in paper based systems are discussed below:
documents is more valuable for today’s digital world
when it is available in digital form. Matrix Matching [6]: Matrix Matching
converts each character into a pattern within a matrix, and
The first step towards transforming a paper then compares the pattern with an index of known
based archive into a digital archive is to scan the characters. Its recognition is strongest on monotype and
documents. The next step is to apply an OCR (Optical uniform single column pages.
Character Recognition) process, meaning that the scanned
image of each document will be translated into machine Fuzzy Logic [6]: Fuzzy logic is a multi-valued
process able text. Due to the print quality of the logic that allows intermediate values to be defined

545
Vol. 4, No. 6 June 2013 ISSN 2079-8407
Jou r na l of Em e r gin g Tr e n ds in Com put ing a n d I n for m a t ion Scie n ce s
©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org
between conventional evaluations like yes/no, true/false, (adapted from [6])
black/ white etc. An attempt is made to attribute a more
human-like way of logical thinking in the programming 3.2 Stages in Design of OCR Systems
of computers. Fuzzy logic is used when answers do not Various stages of OCR system design are given
have a distinct true or false value and there is uncertainly in figure 2.
involved.

Feature Extraction [3]-[6]: This method


defines each character by the presence or absence of key
features, including height, width, density, loops, lines,
stems and other character traits. Feature extraction is a
perfect approach for OCR of magazines, laser print and
high quality images.

Structural Analysis [6]: Structural Analysis


identifies characters by examining their sub features-
shape of the image, sub-vertical and horizontal
histograms. Its character repair capability is great for low
quality text and newsprints.

Neural Networks [6]: This strategy simulates


the way the human neural system works. It samples the
pixels in each image and matches them to a known index
of character pixel patterns. The ability to recognize Fig 2: Stages in OCR Design (adapted from [6])
characters through abstraction is great for faxed
documents and damaged text. Neural networks are ideal 3.3 Reasons for Poor Performance of OCR Systems
for specific types of problems, such as processing stock Existing OCR systems generally show poor
market data or finding trends in graphical patterns. performance for documents like old books: print and
paper quality inferior due to aging, Copied Materials:
3.1 Structure of OCR Systems documents like photocopies or faxed documents, where
Diagrammatic representation of the structure of print quality is inferior to the original, Newspapers:
an OCR system is given in figure 1. generally printed on low quality paper etc.

For such degraded documents, the system


recognition accuracy comes down to 80- 90%. But if we
want to use the OCR system for Banking and Corporate
sector, this accuracy rate is not up-to-mark.

4. RELATED WORK
Claudiu et al. (2011) [1] has investigated using
simple training data pre-processing gave us experts with
errors less correlated than those of different nets trained
on the same or bootstrapped data. Hence committees that
simply average the expert outputs considerably improve
recognition rates. Our committee-based classifiers of
isolated handwritten characters are the first on par with
human performance and can be used as basic building
blocks of any OCR system (all our results were achieved
by software running on powerful yet cheap gaming
cards).

Georgios et al. (2010) [2] has presented a


methodology for off-line handwritten character
recognition. The proposed methodology relies on a new
feature extraction technique based on recursive
subdivisions of the character image so that the resulting
sub-images at each iteration have balanced
(approximately equal) numbers of foreground pixels, as
far as this is possible. Feature extraction is followed by a
Fig 1: Diagrammatic Structure of the OCR System two-stage classification scheme based on the level of

546
Vol. 4, No. 6 June 2013 ISSN 2079-8407
Jou r na l of Em e r gin g Tr e n ds in Com put ing a n d I n for m a t ion Scie n ce s
©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org
granularity of the feature extraction method. Classes with multilayer image, the haze-free image is reconstructed
high values in the confusion matrix are merged at a through haze layer estimation based on the image filtering
certain level and for each group of merged classes, approach using both low-rank technique and the overlap
granularity features from the level that best distinguishes averaging scheme. By using parallel analysis with Monte
them are employed. Two handwritten character databases Carlo simulation from the coarse atmospheric veil by the
(CEDAR and CIL) as well as two handwritten digit median filter, the refined smooth haze layer is acquired
databases (MNIST and CEDAR) were used in order to with both less texture and retaining depth changes. With
demonstrate the effectiveness of the proposed technique. the dark channel prior, the normalized transmission
coefficient is calculated to restore fogless image.
Sankaran et al. (2012) [3] has presented present Experimental results show that the proposed algorithm is
a novel recognition approach that results in a 15% a simpler and efficient method for clarity improvement
decrease in word error rate on heavily degraded Indian and contrast enhancement from a single foggy image.
language document images. OCRs have considerably Moreover, it can be comparable with the state-of-the-art
good performance on good quality documents, but fail methods, and even has better results than them.
easily in presence of degradations. Also, classical OCR
approaches perform poorly over complex scripts such as Badawy, W. et al. (2012) [6] has discussed the
those for Indian languages. Sankaran et al. (2012) [3] Automatic license plate recognition (ALPR) is the
addressed these issues by proposing to recognize extraction of vehicle license plate information from an
character n-gram images, which are basically groupings image or a sequence of images. The extracted information
of consecutive character/component segments. Their can be used with or without a database in many
approach was unique, since they use the character n- applications, such as electronic payment systems (toll
grams as a primitive for recognition rather than for post- payment, parking fee payment), and freeway and arterial
processing. monitoring systems for traffic surveillance. The ALPR
uses either a color, black and white, or infrared camera to
By exploiting the additional context present in take images.
the character n-gram images, we enable better
disambiguation S between confusing characters in the Ntirogiannis et al. (2013) [7] has studied that the
recognition phase. The labels obtained from recognizing document image binarization is of great importance in the
the constituent n-grams are then fused to obtain a label document image analysis and recognition pipeline since it
for the word that emitted them. Their method is affects further stages of the recognition process. The
inherently robust to degradations such as cuts and merges evaluation of a binarization method aids in studying its
which are common in digital libraries of scanned algorithmic behavior, as well as verifying its
documents. We also present a reliable and scalable effectiveness, by providing qualitative and quantitative
scheme for recognizing character n-gram images. Tests indication of its performance. This paper addresses a
on English and pixel-based binarization evaluation methodology for
Malayalam document images show considerable historical handwritten/machine-printed document images.
improvement in recognition in the case of heavily In the proposed evaluation scheme, the recall and
degraded documents. precision evaluation measures are properly modified
using a weighting scheme that diminishes any potential
Jawahar et al. (2012) [4] has propose a evaluation bias.
recognition scheme for the Indian script of Devanagari.
Recognition accuracy of Devanagari script is not yet Yang et al. (2012) [8] has proposed a novel
comparable to its Roman counterparts. This is mainly due adaptive binarization method based on wavelet filter is
to the complexity of the script, writing style etc. Our proposed in this paper, which shows comparable
solution uses a Recurrent Neural Network known as performance to other similar methods and processes
Bidirectional Long- Short Term Memory (BLSTM). Our faster, so that it is more suitable for real-time processing
approach does not require word to character and applicable for mobile devices. The proposed method
segmentation, which is one of the most common reason is evaluated on complex scene images of ICDAR 2005
for high word error rate. Jawahar et al. (2012) [4] has Robust Reading Competition, and experimental results
reported a reduction of more than 20% in word error rate provide a support for our work.
and over 9% reduction in character error rate while
comparing with the best available OCR system. Sumetphong et al. (2012) [9] has proposed a
novel technique for recognizing broken Thai characters
Zhang et al. (2012) [5] has discussed the misty, found in degraded Thai text documents by modeling it as
foggy, or hazy weather conditions lead to image color a set-partitioning problem (SPP). The technique searches
distortion and reduce the resolution and the contrast of the for the optimal set-partition of the connected components
observed object in outdoor scene acquisition. In order to by which each subset yields a reconstructed Thai
detect and remove haze, this article proposes a novel character. Given the non-linear nature of the objective
effective algorithm for visibility enhancement from a function needed for optimal set-partitioning, we design an
single gray or color image. Since it can be considered that algorithm we call Heuristic Incremental Integer
the haze mainly concentrates in one component of the Programming (HIIP), that employs integer programming

547
Vol. 4, No. 6 June 2013 ISSN 2079-8407
Jou r na l of Em e r gin g Tr e n ds in Com put ing a n d I n for m a t ion Scie n ce s
©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org
(IP) with an incremental approach using heuristics to crafted, a difficult and time-consuming process. In
hasten the convergence. To generate corrected Thai contrast, general-purpose descriptors (such as SIFT) are
words, we adopt a probabilistic generative approach easy to apply and have proved successful for a variety of
based a Thai dictionary corpus. The proposed technique is tasks, including classification, segmentation, and
applied successfully to a Thai historical document and clustering. Unfortunately, most general-purpose feature
poor quality Thai fax document with promising accuracy descriptors are targeted at natural images and may
rates over 93%. perform poorly in document analysis tasks. Ramakrishnan
et al. (2012) [13] has proposed a method for
AlSalman et al. (2012) [10] has proposed that automatically learning feature descriptors tuned to a given
Braille recognition is the ability to detect and recognize image domain. The method works by first extracting the
Braille characters embossed on Braille document. The independent components of the images, and then building
result is used in several applications such as embossing, a descriptor by pooling these components over multiple
printing, translating...etc. However, the performance of overlapping regions. We test the proposed method on
these applications is affected by poor quality imaging due several document analysis tasks and several datasets, and
to several factors such as scanner quality, scan resolution, show that it outperforms existing general-purpose feature
lighting, and type of embossed documents. descriptors.

Mutholib et al. (2012) [11] has proposed that Chattopadhyay et al. (2012) [14] has worked on
Android platform has gained popularity in recent years in a low complexity video OCR system has been presented,
terms of market share and number of available that can be deployed on an embedded platform. The
applications. Android operating system is built on a novelty of the proposed method is the use of low
modified Linux kernel with built-in services such as processing cycle and memory and yet getting a
email, web browser, and map applications. In this paper, recognition accuracy of 84.23% which is higher than the
automatic number plate recognition (ANPR) was usual video OCR recognition accuracy. Moreover, the
designed and implemented on Android mobile phone proposed method can recognize about 180 characters on
platform. First, the graphical user interface (GUI) for average per frame in 26.34 milliseconds.
capturing image using built-in camera was developed to
acquire car plate number in Malaysia. Second, the Malakar et al.(2012)[15] has described that
preprocessing of raw image was done using contrast extraction of text lines from document images is one of
enhancement, filtering, and straightening. Next, an optical the important steps in the process of an Optical Character
character recognition (OCR) using neural network was Recognition (OCR) system. In case of handwritten
utilized to extract texts and numbers. The proposed document images, presence of skewed, touching or
ANPR algorithm was implemented and simulated using overlapping text line(s) makes this process a real
Android SDK on a computer. The preliminary results challenge to the researcher. The present technique
showed that our system is able to recognize most of the extracts 87.09% and 89.35% text lines successfully from
plate characters by almost 88%. Future research includes the said databases respectively.
optimizing the system for mobile phone implementation
with limited CPU and memory resources, and geo- Sankaran et al. (2012) [16] has proposed
tagging of the image using GPS coordinates and online a recognition scheme for the Indian script of
database for various mobile applications. Devanagari. Recognition accuracy of Devanagari script is
not yet comparable to its Roman counterparts. This is
Chi et al. (2012) [12] has proposed that because mainly due to the complexity of the script, writing style
of the existence of possible carbon and seals, it's quite etc. Our solution uses a Recurrent Neural Network known
often that images of financial documents such as Chinese as Bidirectional Long Short Term Memory (BLSTM).
bank checks are suffered from bleed-through effects, Our approach does not require word to character
which will affect the performance of automatic financial segmentation, which is one of the most common reason
document processing such as seal verification and OCR. for high word error rate. Sankaran et al. (2012) [16] has
Chi et al. (2012) [12] has presented an effective algorithm reported a reduction of more than 20% in word error rate
to deal with bleed-through effects existing in the images and over 9% reduction in character error rate while
of financial documents. Double-sided images scanned comparing with the best available OCR system.
simultaneously are used as in-puts, and the bleed-through
effect is detected and removed after the registration of the Gur et al. (2012) [17] has discussed that
recto and verso side images. text recognition and retrieval is a well known problem.
Automated optical character recognition(OCR) tools do
Ramakrishnan et al. (2012) [13] has proposed not supply a complete solution and in most cases human
that many machine learning algorithms rely on feature inspection is required. In this paper the authors suggest a
descriptors to access information about image novel text recognition algorithm based on usage of fuzzy
appearance. Using an appropriate descriptor is therefore logic rules relying on statistical data of the analyzed font.
crucial for the algorithm to succeed. Although domain- The new approach combines letter statistics and
and task-specific feature descriptors may result in correlation coefficients in a set of fuzzy based rules,
excellent performance, they currently have to be hand- enabling the recognition of distorted letters that may not

548
Vol. 4, No. 6 June 2013 ISSN 2079-8407
Jou r na l of Em e r gin g Tr e n ds in Com put ing a n d I n for m a t ion Scie n ce s
©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org
be retrieved otherwise. The authors focused on Rashi International Conference on Document Analysis
fonts associated with commentaries of the Bible that are and Recognition, IEEE, 2011.
actually handwritten calligraphy.
[2] GeorgiosVamvakas, Basilis Gatos, Stavros J.
Devlin et al. (2012) [18] has discussed that when Perantonis, “Handwritten character recognition
performing handwriting recognition on natural language through two-stage foreground sub-sampling”
text, the use of a word-level language model (LM) is ,Pattern Recognition, Volume 43, Issue 8, August
known to significantly improve recognition accuracy. The 2010.
most common type of language model, the n-gram model,
decomposes sentences into short, overlapping chunks. In [3] Shrey Dutta, Naveen Sankaran, PramodSankar K.,
this paper, we propose a new type of language model C.V. Jawahar, “Robust Recognition of Degraded
which we use in addition to the standard n-gram LM. Documents Using Character N-Grams”, IEEE,
Devlin et al. (2012) [18]’s new model uses the likelihood 2012.
score from a statistical machine translation system as a
reran king feature. In general terms, we automatically [4] Naveen Sankaran and C.V Jawahar, “Recognition
translate each OCR hypothesis into another language, and of Printed Devanagari Text Using BLSTM Neural
then create a feature score based on how "difficult" it was Network”, IEEE, 2012.
to perform the translation. Intuitively, the difficulty of
translation correlates with how well-formed the input [5] Yong-Qin Zhang, Yu Ding, Jin-Sheng Xiao,
sentence is. In an Arabic handwriting recognition task, Jiaying Liu and Zongming Guo1, “Visibility
Devlin et al. (2012) [18] were able to obtain a 0.4% enhancement using an image filtering approach”,
absolute improvement to word error rate (WER) on top of Zhang et al. EURASIP Journal on Advances in
a powerful 5-gram LM. Signal Processing 2012.

Al-Khaffaf et al. (2012) [19] has presented the [6] Badawy, W. "Automatic License Plate Recognition
current status of Decapod's English font reconstruction. (ALPR): A State of the Art Review." (2012): 1-1.
The Pot race algorithm and its parameters that affect
glyph shape are examined. The visual fidelity of [7] Ntirogiannis, Konstantinos, Basilis Gatos, and
Decapod's font reconstruction is shown and compared to IoannisPratikakis. "A Performance Evaluation
Adobe Clears can. The font reconstruction details of the Methodology for Historical Document Image
two methods are presented. The experiment demonstrates Binarization." (2013): 1-1.
the capabilities of the two methods in reconstructing the
font for a synthetic book typeset each time with one of six [8] Yang, Jufeng, Kai Wang, Jiaofeng Li, Jiao Jiao,
English fonts, three serif and three sans-serif. For both and Jing Xu. "A fast adaptive binarization method
typefaces, Decapod is able to create a reusable TTF font for complex scene images." In Image Processing
that is embedded in the generated PDF document. (ICIP), 2012 19th IEEE International Conference
on, pp. 1889-1892. IEEE, 2012.
Rhead et al. (2012) [20] has considered real
world UK number plates and relates these to ANPR. It [9] Sumetphong, Chaivatna, and
considers aspects of the relevant legislation and standards SupachaiTangwongsan. "An Optimal Approach
when applying them to real world number plates. The towards Recognizing Broken Thai Characters in
varied manufacturing techniques and varied specifications OCR Systems." Digital Image Computing
of component parts are also noted. The varied fixing Techniques and Applications (DICTA), 2012
methodologies and fixing locations are discussed as well International Conference on. IEEE, 2012.
as the impact on image capture.
[10] AlSalman, AbdulMalik, et al. "A novel approach
5. CONCLUSION for Braille images segmentation." Multimedia
This paper has presented a related work on Computing and Systems (ICMCS), 2012
English OCR techniques. Various available techniques International Conference on. IEEE, 2012.
are studied to find a best technique. But is found that the
techniques which provide better results are slow in nature [11] Mutholib, Abdul, Teddy Surya Gunawan, and Mira
while fast techniques mostly provide inefficient results. It Kartiwi. "Design and implementation of automatic
is found that the OCR techniques based on neural number plate recognition on android platform."
network provide more accurate results than other Computer and Communication Engineering
techniques. (ICCCE), 2012 International Conference on. IEEE,
2012.
REFERENCES
[1] Dan ClaudiuCires¸an and Ueli Meier and Luca [12] Chi, Bingyu, and Youbin Chen. "Reduction of
Maria Gambardella and JurgenSchmidhuber, Bleed-through Effect in Images of Chinese Bank
“Convolutional Neural Network Committees for Items." Frontiers in Handwriting Recognition
Handwritten Character Classification”, 2011

549
Vol. 4, No. 6 June 2013 ISSN 2079-8407
Jou r na l of Em e r gin g Tr e n ds in Com put ing a n d I n for m a t ion Scie n ce s
©2009-2013 CIS Journal. All rights reserved.

http://www.cisjournal.org
(ICFHR), 2012 International Conference on. IEEE, (ICPR), 2012 21st International Conference on.
2012. IEEE, 2012.

[13] Ramakrishnan, Kandan, and Evgeniy Bart. [17] Gur, Eran, and ZeevZelavsky. "Retrieval of Rashi
"Learning domain-specific feature descriptors for Semi-Cursive Handwriting via Fuzzy Logic."
document images." Document Analysis Systems Frontiers in Handwriting Recognition (ICFHR),
(DAS), 2012 10th IAPR International Workshop 2012 International Conference on. IEEE, 2012.
on. IEEE, 2012.
[18] Devlin, Jacob, "Statistical Machine Translation as
[14] Chattopadhyay, T., Ruchika Jain, and Bidyut B. a Language Model for Handwriting Recognition."
Chaudhuri. "A novel low complexity TV video Frontiers in Handwriting Recognition (ICFHR),
OCR system." Pattern Recognition (ICPR), 2012 2012 International Conference on. IEEE, 2012.
21st International Conference on. IEEE, 2012.
[19] Al-Khaffaf, Hasan SM, et al. "On the performance
[15] Malakar, Samir, et al. "Text line extraction from of Decapod's digital font reconstruction." Pattern
handwritten document pages using spiral run Recognition (ICPR), 2012 21st International
length smearing algorithm." Communications, Conference on. IEEE, 2012.
Devices and Intelligent Systems (CODIS), 2012
International Conference on. IEEE, 2012. [20] Rhead, Mke, "Accuracy of automatic number plate
recognition (ANPR) and real world UK number
[16] Sankaran, Naveen, and C. V. Jawahar. plate problems." Security Technology (ICCST),
"Recognition of printed Devanagari text using 2012 IEEE International Carnahan Conference on.
BLSTM Neural Network." Pattern Recognition IEEE, 2012.

550

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy