Unconstrained Offline Handwritten Word
Unconstrained Offline Handwritten Word
Abstract—The state-of-the-art methods usually integrate with [13]–[15]. In these tasks, the convolutional neural network
linguistic knowledge in the recognizer, which makes models more (CNN) is usually used to extract low/mid/high-level image
complicated and hard for resource-lacking languages. This let- features automatically. For example, Xie et al. [16] presented
ter proposes a new method for unconstrained offline handwritten
word recognition by combining position embeddings with resid- a multi-spatial-context fully convolutional recurrent network
ual networks (ResNets) and bidirectional long short-term memory (MC-FCRN). Jaderberg et al. [17] developed a character se-
(BiLSTM) networks. At first, ResNets are used to extract abun- quence model by using a CNN with multiple position-sensitive
dant features from the input image. Then, position embeddings character classifier. When an image contains a long sequence of
are used as indices of the character sequence corresponding to characters, they need to build a lot of classifiers. To improve the
a word. By combining the ResNets features with each position
capability of handling misalignment between inputs and target
embedding, the model generates different inputs for the BiLSTM
networks. Finally, the state sequence of the BiLSTM is used to labels, the long short-term memory (LSTM) [18] network com-
recognize corresponding characters. Without additional language bining with connectionist temporal classification (CTC) [19]
resource, the proposed model achieved the best result on two public is used for labeling sequence. Based on CTC model, Zhan
corpora, i.e., the 2017 ICDAR word-level information extraction in et al. [20] used ResNets [21] to extract features. The recurrent
historical handwritten records competition and the RIMES public neural network (RNN) was used to model the contextual infor-
dataset on character error rate.
mation and predict recognition sequences in Zhan’s model. Shi
Index Terms—Position embedding, residual networks, bidirec- et al. [22] proposed a novel end-to-end scene text recognition ar-
tional long short-term memory network, off-line handwritten word chitecture, which uses a convolutional recurrent neural network
recognition. (CRNN) with CTC. As one kind of important deep learning
mechanism, the attention model is successfully applied to text
I. INTRODUCTION recognition [23], [24]. Shi et al. [25] proposed a flexible rec-
tification mechanism based on the spatial transformer network
FF-LINE handwritten word or sentence recognition is
O still a very challenging problem, especially for those
languages that lack language resources. Traditionally, segmenta-
(STN) for irregular text recognition. And Wojna et al. [26] pre-
sented an end-to-end approach with a spatial attention mask for
scene text recognition.
tion is one of the key tasks for word recognition [1]–[4]. Models To deal with the diversity of writing styles and the similar-
based on hidden markov model (HMM) or neural network hid- ities between characters, the neural networks for handwriting
den markov model (NN-HMM) had been successfully applied recognition usually rely on constructing additional features and
on segmentation-free word recognition [5], [6]. The main is- lexicons. Chherawala et al. [27] achieved promising results with
sues of traditional methods include the overfitting and the long features such as histograms, direction distribution, and profiles.
distance dependency [7], [8]. Almazán et al. [28] proposed a word spotting and recognition
In recent years, deep learning has been introduced for recog- method by embedding both word images and text strings in a
nition tasks. Outstanding performance has been reached on common vectorial subspace. Based on the Almazán’s work, Poz-
handwriting recognition [9]–[12] and scene text recognition nanski et al. [29] presented a CNN-N-Gram method to estimate
its n-gram frequency profile by constructing a set of attributes.
Manuscript received October 13, 2018; revised January 3, 2019; accepted They utilized canonical correlation analysis (CCA) to match the
January 20, 2019. Date of publication January 29, 2019; date of current version predicted profiles to the true profiles of all words in a big lexicon.
March 12, 2019. This work was supported in part by the Natural Science Foun-
dation of China (Grants 61473101, 61872113, and 61573118); and in part by
Their system has been applied on several handwriting recogni-
Strategic Emerging Industry Development Special Funds of Shenzhen (Grants tion benchmarks and reached an obvious performance gain. The
JCYJ20170307150528934 and JCYJ20170811153836555). The associate edi- issue of this method is that it requires the construction of a large
tor coordinating the review of this manuscript and approving it for publication
was Dr. Yap-Peng Tan. (Corresponding author: Qingcai Chen.)
number of linguistic features, such as unigrams, bigrams, and
The authors are with the Shenzhen Chinese Calligraphy Digital Simu- trigrams. In the papers [28] and [29], the recognition tasks are
lation Engineering Laboratory, Harbin Institute of Technology (Shenzhen), performed to some extent like retrieval systems, which match
Shenzhen University Town, Shenzhen 518055, China (e-mail:, wxpleduole@ the word label from existing dictionaries and are named as the
gmail.com; qingcai.chen@hit.edu.cn; youjinghan2018@163.com; xiaoyulun@
stu.hit.edu.cn). lexicon-driven method. In order to avoid constructing a large
Digital Object Identifier 10.1109/LSP.2019.2895967 number of linguistic features and reduce the dependency on
1070-9908 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
598 IEEE SIGNAL PROCESSING LETTERS, VOL. 26, NO. 4, APRIL 2019
B. ResNets
In this letter, 101-layer ResNets are employed to learn the
features of an image. The network architecture is based on the
earlier work of He et al. [21]. Considering an input sample x
(here x could be multiple color channels) and output vector y,
a building block of ResNets is defined as:
PE is then used to distinguish the part of features for a given written character sequence. Since in prediction stage, we will
character contained in a handwritten word. Though there are end the sequence while the first “null” label is encountered, we
many ways to combine a PE with the output of ResNets, the only count on the loss of the first “null” label in the training
simple and efficient way of concatenation is used. Let Fg de- stage.
notes the global feature vector output from the ResNets, which
is the same for all characters in the given handwritten word. D. The Model Training
Then the feature of each character can be expressed as
In order to overcome overfitting, we add L2 regularization
(i) with a weight decay of 0.0001 to the loss function expression.
Fc = Fg ⊕ Pi , i = 1, 2, . . . , K (3)
The network is trained by stochastic gradient descent (SGD)
where the ⊕ denotes the concatenating operation. There are usu- with a momentum set to 0.9. The learning rate is initially set
ally two ways to determine the value of each PE [33], [34], one to 0.1 and is divided by 10 per 50 iterations. Considering the
is to use randomly generated values, the other is to dynamically huge parameter space of the complex networks, data enhance-
learn from the model. This letter uses a mechanism of PE learn- ments are employed to expand training samples. For each in-
ing during recognizer training. As shown in Fig. 1, the character put image, we rotate, shear, and zoom, respectively. And the
feature vector Fc (i) is the ith input of the BiLSTM, and the corresponding parameter ranges are [−5◦ , +5◦ ], [−0.5, +0.5]
ith hidden state is corresponding to the ith classification vector and [0.8, 1.2]. By this method, each input image can gener-
used by multiple layer perceptrons and a softmax classifier. ate 12 additional images. Prediction-side data enhancement is
Given a handwritten word containing the character sequence performed in the same way. During the prediction stage, 13
C = {c1 , c2 , . . . , cn }, n ≤ K, we assume A is a set of all char- images were separately passed through the proposed network,
acter labels that the language needs to predict. The standard and then calculate the average of the softmax results of 13 im-
output label corresponding to the ith state is given as: ages. Finally, the sequence of characters with the highest con-
fidence was taken as the predicted result of the test image. In
char, char ∈ A, i ≤ n
ci = (4) the training process, neither segmentation techniques nor lan-
∅, n < i ≤ K guage resources are used. We only use an official lexicon in the
post-processing stage, which further improves the recognition
where ∅ represents the “null” label, which is just a placeholder
performance by not requiring any additional language resource
while the real length of the given word is shorter than the max-
and keeping the simplicity of the recognition system. In this
imum length K. Assuming that the K prediction results are
step, we first do lexicon-free recognition, and then select the
denoted as S = {s1 , s2 , . . . , sK }, then the conditional proba-
most closed word from the lexicon according to the edit distance
bility is defined as:
metric. The maximal edit distance is set to 7 to limit the search
n
K complexity.
p(C|I) = p(si = ci |I) p(sj = ∅|I) (5)
i=1 j =n +1 III. EXPERIMENTS
where the p(·) is the probability output of softmax classifiers. A. Dataset
We only extract the characters before the first ‘∅’ label as the
word prediction results. Here n is the number of the characters The experiments of this letter are constructed on two
contained in the ground truth label sequence. Given the train- benchmarks. For the 2017 ICDAR IEHHR competition [35],
ing dataset X = {I (d) , C (d) }, d = 0, 1, 2, . . . , |X|, where I (d) 125 pages of the Esposalles dataset [36], [37] are used for hand-
is the dth input image and C (d) is the ground truth label se- writing recognition and named entities recognition (NER). This
quence. Let L be the label set of the model, including character dataset consists of historical handwritten marriage records from
set A and ∅ label. yi is the one-hot encoding of the character the archives of the Cathedral of Barcelona in old Catalan. The
ci , which is a vector of the |L| dimension. Then the element of training set is composed of 968 marriage records with 31501
the vector yi can be expressed as isolated word images. The test set is composed of 253 marriage
records. Since the test set is not publicly available, we randomly
1, ci = L[j] divide the training word images into equal five parts and per-
yij = (6) form a 5-fold cross-validation. The performance of our previous
0, ci = L[j]
competition system running on the test set is also given as the
Here L[j] denotes the j th class of the label set L. The loss comparison. In this competition, we had won and improved the
function is defined as: performance from the baseline of 70.18% given by the compe-
⎛ tition organizer to 91.97% on the test set.
|X | n |L|
1 ⎝ (d)
L= − yij ln pij (d) The RIMES [38] dataset was used for ICDAR 2011 compe-
|X| i=1 j =1 tition as an isolated word recognition task [39]. A dictionary
d=1
⎞ (7) composed of more than 5000 words is also provided. In this
|L|
letter, we conduct experiments on the training and test sets.
+ yn +1,j (d) ln pn +1,j (d) ⎠ (n + 1)
j =1 B. Experimental Results
where the pij is the probability of ci = L[j]. In (7), the second In this letter, we use the same character error rate (CER)
term of the cross-entropy corresponds to the “null” of the hand- measure as in [29], which is based on the Levenshtein Distance.
600 IEEE SIGNAL PROCESSING LETTERS, VOL. 26, NO. 4, APRIL 2019
TABLE I
COMPARISON TO EXISTING METHODS IN CHARACTER ERROR RATE (%) ON
RIMES AND ESPOSALLES DATASET
∗
The result is obtained on unpublished test set, which is not directly
comparable to our result. † The results listed from the 2nd to 6th raws
are reported in [29].
TABLE II
COMPARISON TO DIFFERENT VARIANTS OF THE FULL SYSTEM
IN CHARACTER ERROR RATE (%)
Fig. 2. Examples of recognition. Left characters are ground truth and right
are predictions. Red characters are recognized wrongly.
REFERENCES [21] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
[1] M.-Y. Chen, A. Kundu, and J. Zhou, “Off-line handwritten word recogni- pp. 770–778.
tion using a hidden Markov model type stochastic network,” IEEE Trans. [22] B. Shi, X. Bai, and C. Yao, “An end-to-end trainable neural network
Pattern Anal. Mach. Intell., vol. 16, no. 5, pp. 481–496, May 1994. for image-based sequence recognition and its application to scene text
[2] C.-L. Liu, H. Sako, and H. Fujisawa, “Effects of classifier structures and recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11,
training regimes on integrated segmentation and recognition of handwrit- pp. 2298–2304, Nov. 2017.
ten numeral strings,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, [23] F. Bai, Z. Cheng, Y. Niu, S. Pu, and S. Zhou, “Edit probability for scene
no. 11, pp. 1395–1407, Nov. 2004. text recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[3] M. Kumar, M. Jindal, and R. Sharma, “Segmentation of isolated and 2018, pp. 1508–1516.
touching characters in offline handwritten Gurmukhi script recognition,” [24] Z. Cheng, F. Bai, Y. Xu, G. Zheng, S. Pu, and S. Zhou, “Focusing attention:
Int. J. Inf. Technol. Comput. Sci., vol. 6, no. 2, pp. 58–63, 2014. Towards accurate text recognition in natural images,” in Proc. IEEE Int.
[4] Y. Wang, X. Ding, and C. Liu, “Topic language model adaption for recogni- Conf. Comput. Vis., 2017, pp. 5086–5094.
tion of homologous offline handwritten Chinese text image,” IEEE Signal [25] B. Shi, M. Yang, X. Wang, P. Lyu, C. Yao, and X. Bai, “Aster: An at-
Process. Lett., vol. 21, no. 5, pp. 550–553, May 2014. tentional scene text recognizer with flexible rectification,” IEEE Trans.
[5] T.-H. Su, T.-W. Zhang, D.-J. Guan, and H.-J. Huang, “Off-line recognition Pattern Anal. Mach. Intell., to be published.
of realistic Chinese handwriting using segmentation-free strategy,” Pattern [26] Z. Wojna et al., “Attention-based extraction of structured information from
Recognit., vol. 42, no. 1, pp. 167–182, 2009. street view imagery,” in Proc. 14th IAPR Int. Conf. Doc. Anal. Recognit.,
[6] Z.-R. Wang, J. Du, W.-C. Wang, J.-F. Zhai, and J.-S. Hu, “A compre- 2017, pp. 844–850.
hensive study of hybrid neural network hidden Markov model for offline [27] Y. Chherawala, P. P. Roy, and M. Cheriet, “Combination of context-
handwritten chinese text recognition,” Int. J. Doc. Anal. Recognit., vol. 21, dependent bidirectional long short-term memory classifiers for robust of-
pp. 241–251, 2018. fline handwriting recognition,” Pattern Recognit. Lett., vol. 90, pp. 58–64,
[7] A. Graves and J. Schmidhuber, “Offline handwriting recognition with 2017.
multidimensional recurrent neural networks,” in Proc. Adv. Neural Inf. [28] J. Almazán, A. Gordo, A. Fornés, and E. Valveny, “Word spotting and
Process. Syst., 2009, pp. 545–552. recognition with embedded attributes,” IEEE Trans. Pattern Anal. Mach.
[8] T. Liu and J. Lemeire, “Efficient and effective learning of HMMS based Intell., vol. 36, no. 12, pp. 2552–2566, Dec. 2014.
on identification of hidden states,” Math. Probl. Eng., vol. 2017, 2017, [29] A. Poznanski and L. Wolf, “CNN-N-gram for handwriting word recogni-
Art. no. 7318940. tion,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2305–
[9] J. Sueiras, V. Ruiz, A. Sanchez, and J. F. Velez, “Offline continuous 2314.
handwriting recognition using sequence to sequence neural networks,” [30] B. Stuner, C. Chatelain, and T. Paquet, “Cascading BLSTM networks for
Neurocomputing, vol. 289, pp. 119–128, 2018. handwritten word recognition,” in Proc. 23rd Int. Conf. Pattern Recognit.,
[10] X. Xiao, L. Jin, Y. Yang, W. Yang, J. Sun, and T. Chang, “Building fast 2016, pp. 3416–3421.
and compact convolutional neural networks for offline handwritten chinese [31] A. Ul-Hasan, S. B. Ahmed, F. Rashid, F. Shafait, and T. M. Breuel, “Of-
character recognition,” Pattern Recognit., vol. 72, pp. 72–81, 2017. fline printed Urdu Nastaleeq script recognition with bidirectional LSTM
[11] X.-Y. Zhang, F. Yin, Y.-M. Zhang, C.-L. Liu, and Y. Bengio, “Draw- networks,” in Proc. 12th Int. Conf. Doc. Anal. Recognit., 2013, pp. 1061–
ing and recognizing chinese characters with recurrent neural network,” 1065.
IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 4, pp. 849–862, [32] D. Ko, C. Lee, D. Han, H. Ohk, K. Kang, and S. Han, “Approach for
Apr. 2018. machine-printed Arabic character recognition: The-state-of-the-art deep-
[12] Q. Wang and Y. Lu, “A sequence labeling convolutional network and learning method,” Electron. Imag., vol. 2018, no. 2, pp. 1–8, 2018.
its application to handwritten string recognition,” in Proc. 26th Int. Joint [33] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin, “Con-
Conf. Artif. Intell., 2017, pp. 2950–2956. volutional sequence to sequence learning,” in Int. Conf. Machine Learn.,
[13] R. Wang, N. Sang, and C. Gao, “Scene text identification by leveraging 2017, pp. 1243–1252.
mid-level patches and context information,” IEEE Signal Process. Lett., [34] A. Vaswani et al., “Attention is all you need,” in Proc. Adv. Neural Inf.
vol. 22, no. 7, pp. 963–967, Jul. 2015. Process. Syst., 2017, pp. 5998–6008.
[14] X. Bai, C. Yao, and W. Liu, “Strokelets: A learned multi-scale mid-level [35] A. Fornés et al., “ICDAR2017 competition on information extraction in
representation for scene text recognition,” IEEE Trans. Image Process., historical handwritten records,” in Proc. 14th IAPR Int. Conf. Doc. Anal.
vol. 25, no. 6, pp. 2789–2802, Jun. 2016. Recognit., 2017, vol. 1, pp. 1389–1394.
[15] B. Su and S. Lu, “Accurate recognition of words in scenes without char- [36] D. Fernández-Mota, J. Almazán, N. Cirera, A. Fornés, and J. Lladós,
acter segmentation using recurrent neural network,” Pattern Recognit., “Bh2m: The barcelona historical, handwritten marriages database,” in
vol. 63, pp. 397–405, 2017. Proc. 22nd Int. Conf. Pattern Recognit., 2014, pp. 256–261.
[16] Z. Xie, Z. Sun, L. Jin, H. Ni, and T. Lyons, “Learning spatial-semantic [37] V. Romero et al., “The Esposalles database: An ancient marriage license
context with fully convolutional recurrent network for online handwritten corpus for off-line handwriting recognition,” Pattern Recognit., vol. 46,
chinese text recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 1658–1669, 2013.
no. 8, pp. 1903–1917, Aug. 2018. [38] E. Augustin, M. Carré, E. Grosicki, J.-M. Brodin, E. Geoffrois, and F.
[17] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman, “Synthetic Prêteux, “Rimes evaluation campaign for handwritten mail processing,”
data and artificial neural networks for natural scene text recognition,” in in Proc. Int. Workshop Frontiers Handwriting Recognit., 2006, pp. 231–
Workshop Deep Learn., NIPS, 2014. 235.
[18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural [39] E. Grosicki and H. El-Abed, “ICDAR 2011-french handwriting recogni-
Comput., vol. 9, no. 8, pp. 1735–1780, 1997. tion competition,” in Proc. Int. Conf. Doc. Anal. Recognit., 2011, pp. 1459–
[19] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist 1463.
temporal classification: Labelling unsegmented sequence data with re- [40] J. I. Toledo, S. Dey, A. Fornés, and J. Lladós, “Handwriting recognition by
current neural networks,” in Proc. 23rd Int. Conf. Mach. Learn, 2006, attribute embedding and recurrent neural networks,” in Proc. Doc. Anal.
pp. 369–376. Recognit.14th IAPR Int. Conf., 2017, vol. 1, pp. 1038–1043.
[20] H. Zhan, Q. Wang, and Y. Lu, “Handwritten digit string recognition by [41] B. Stuner, C. Chatelain, and T. Paquet, “Self-training of BLSTM with
combination of residual network and RNN-CTC,” in Proc. Int. Conf. lexicon verification for handwriting recognition,” in Proc. 14th IAPR Int.
Neural Inf. Process., 2017, pp. 583–591. Conf. Doc. Anal. Recognit., 2017, pp. 633–638.