Tesseract 2
Tesseract 2
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
A Note about the Coordinate System
● The pixel edges are aligned with integer coordinates.
● (0, 0) is at bottom-left.
● Width = right - left => no silly +1/-1.
2
0
0 1 2
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Tesseract System Architecture
Nominally a pipeline, but not really, as there is a lot of re-visiting of
old decisions.
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Tesseract Word Recognizer
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
The ‘C’ Legacy
● Large chunks of the code written originally in C.
● Major rewrite in ~1991 with new C++ code.
● C->C++ migration gradual over time since.
● Majority of global functions now live in a convenience directory structure class.
(For thread compatibility purposes.)
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Directory Structure ~ Functional Architecture
TessBaseAPI
API
Tesseract
ccmain
cube
Wordrec
Textord lstm
wordrec
textord
Classify
classify
Dict
dict
CCStruct
ccstruct CUtil
cutil CCUtil
ccutil
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Key Data Structures = Page Hierarchy
Core page Normalized
Layout (old) Layout Results
outlines outlines PAGE_RES
C_OUTLINE TESSLINE
EDGEPT
TPOINT
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Software Engineering - Building Blocks
Coordinates Containers
ICOORD FCOORD
Text
STRING UNICHARSET
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Key Parts of the Call Hierarchy
TessBaseAPI::Recognize
Tesseract::SegmentPage Tesseract::recog_all_words
Tesseract::
Tesseract::AutoPageSeg Textord::TextordPage
classify_word_and_language
Tesseract::chop_word_main
Tesseract::
Wordrec::SegSearch LSTMRecognizeWord
Classify::AdaptiveClassifier LanguageModel::UpdateState
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Tesseract’s List Implementation
● Predates STL
● Allows control over ownership of list elements
● Uses macros instead of templates
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
List Example
ccstruct/blobbox.h: ccstruct/blobbox.h:
class BLOBNBOX : public ELIST_LINK { float Textord::filter_noise_blobs(
… BLOBNBOX_LIST *src_list, // original list
}; BLOBNBOX_LIST *noise_list, // noise list
// Defines classes: BLOBNBOX_LIST *small_list) { // small blobs
// BLOBNBOX_LIST: a list of BLOBNBOX BLOBNBOX_IT src_it(src_list); // iterators
// BLOBNBOX_IT: list iterator BLOBNBOX_IT noise_it(noise_list);
ELISTIZEH(BLOBNBOX) BLOBNBOX_IT small_it(small_list);
for (src_it.mark_cycle_pt(); !src_it.cycled_list();
ccstruct/blobbox.h: src_it.forward()) {
// Implementation of some of the blob = src_it.data();
// list functions. if (blob->bounding_box().height() < textord_max_noise_size)
ELISTIZE(BLOBNBOX) noise_it.add_after_then_move(src_it.extract());
else if (blob->enclosed_area() >=
blob->bounding_box().area() * textord_noise_area_ratio)
small_it.add_after_then_move(src_it.extract());
}
http://www.apache.org/licenses/LICENSE-2.0
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
TessBaseAPI : Simple example
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
TessBaseAPI : Multipage example
TessBaseAPI api;
api.Init(NULL, “eng”);
tesseract::TessResultRenderer* renderer =
new tesseract::TessPDFRenderer(api.GetDatapath());
api.ProcessPages(filename, NULL, 0, renderer);
const char* data;
inT32 data_len;
if (renderer->GetOutput(&data, &data_len)) {
fwrite(data, 1, data_len, fout);
fclose(fout);
}
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
ResultIterator for getting the real details
ResultIterator* it = api.GetIterator();
do {
int left, top, right, bottom;
if (it->BoundingBox(RIL_WORD, &left, &top, &right, &bottom)) {
char* text = it->GetUTF8Text(RIL_WORD);
printf("%s %d %d %d %d\n", text, left, top, right, bottom);
delete [] text;
}
} while (it->Next(RIL_WORD));
delete it;
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece
Thanks for Listening!
Questions?
Tesseract Blends Old and New OCR Technology - DAS2016 Tutorial - Santorini - Greece