0% found this document useful (0 votes)
30 views15 pages

Tesseract 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views15 pages

Tesseract 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

2.

Architecture and Data Structures


A quick tour of the Tesseract Code

Ray Smith, Google Inc.

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
A Note about the Coordinate System
● The pixel edges are aligned with integer coordinates.
● (0, 0) is at bottom-left.
● Width = right - left => no silly +1/-1.
2

0
0 1 2

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Tesseract System Architecture
Nominally a pipeline, but not really, as there is a lot of re-visiting of

old decisions.

LSTM Line Word


Recognizer OK?
Ye
s

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Tesseract Word Recognizer

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
The ‘C’ Legacy
● Large chunks of the code written originally in C.
● Major rewrite in ~1991 with new C++ code.
● C->C++ migration gradual over time since.
● Majority of global functions now live in a convenience directory structure class.
(For thread compatibility purposes.)

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Directory Structure ~ Functional Architecture
TessBaseAPI
API

Tesseract
ccmain
cube
Wordrec
Textord lstm
wordrec
textord
Classify
classify
Dict
dict
CCStruct
ccstruct CUtil
cutil CCUtil
ccutil

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Key Data Structures = Page Hierarchy
Core page Normalized
Layout (old) Layout Results
outlines outlines PAGE_RES

WorkingPartSet TO_BLOCK BLOCK BLOCK_RES

ColPartition TO_ROW ROW ROW_RES

BLOBNBOX WERD TWERD WERD_RES WERD_CHOICE

C_BLOB TBLOB BLOB_CHOICE

C_OUTLINE TESSLINE

EDGEPT

TPOINT

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Software Engineering - Building Blocks

Coordinates Containers

TBOX GenericVector ELIST CLIST

ICOORD FCOORD

Text

STRING UNICHARSET

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Key Parts of the Call Hierarchy
TessBaseAPI::Recognize

Tesseract::SegmentPage Tesseract::recog_all_words

Tesseract::
Tesseract::AutoPageSeg Textord::TextordPage
classify_word_and_language

Tesseract::chop_word_main
Tesseract::
Wordrec::SegSearch LSTMRecognizeWord

Classify::AdaptiveClassifier LanguageModel::UpdateState

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Tesseract’s List Implementation
● Predates STL
● Allows control over ownership of list elements
● Uses macros instead of templates

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
List Example
ccstruct/blobbox.h: ccstruct/blobbox.h:
class BLOBNBOX : public ELIST_LINK { float Textord::filter_noise_blobs(
… BLOBNBOX_LIST *src_list, // original list
}; BLOBNBOX_LIST *noise_list, // noise list
// Defines classes: BLOBNBOX_LIST *small_list) { // small blobs
// BLOBNBOX_LIST: a list of BLOBNBOX BLOBNBOX_IT src_it(src_list); // iterators
// BLOBNBOX_IT: list iterator BLOBNBOX_IT noise_it(noise_list);
ELISTIZEH(BLOBNBOX) BLOBNBOX_IT small_it(small_list);
for (src_it.mark_cycle_pt(); !src_it.cycled_list();
ccstruct/blobbox.h: src_it.forward()) {
// Implementation of some of the blob = src_it.data();
// list functions. if (blob->bounding_box().height() < textord_max_noise_size)
ELISTIZE(BLOBNBOX) noise_it.add_after_then_move(src_it.extract());
else if (blob->enclosed_area() >=
blob->bounding_box().area() * textord_noise_area_ratio)
small_it.add_after_then_move(src_it.extract());
}

http://www.apache.org/licenses/LICENSE-2.0
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
TessBaseAPI : Simple example

Main API class provides initialization, image input, text/hOCR/PDF output:


TessBaseAPI api;
api.Init(NULL, “eng”);
Pix* pix = pixRead(“phototest.tif”);
api.SetImage(pix);
char* text = api.GetUTF8Text();
printf(“%s\n”, text);
delete [] text;
pixDestroy(&pix);

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
TessBaseAPI : Multipage example
TessBaseAPI api;
api.Init(NULL, “eng”);
tesseract::TessResultRenderer* renderer =
new tesseract::TessPDFRenderer(api.GetDatapath());
api.ProcessPages(filename, NULL, 0, renderer);
const char* data;
inT32 data_len;
if (renderer->GetOutput(&data, &data_len)) {
fwrite(data, 1, data_len, fout);
fclose(fout);
}
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
ResultIterator for getting the real details
ResultIterator* it = api.GetIterator();
do {
int left, top, right, bottom;
if (it->BoundingBox(RIL_WORD, &left, &top, &right, &bottom)) {
char* text = it->GetUTF8Text(RIL_WORD);
printf("%s %d %d %d %d\n", text, left, top, right, bottom);
delete [] text;
}
} while (it->Next(RIL_WORD));
delete it;

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Thanks for Listening!

Questions?

Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy