Lip Reading Using Deep Learning in Turkish Language
Lip Reading Using Deep Learning in Turkish Language
Corresponding Author:
Hadi Pourmousa
Department of Management Information Systems, Faculty of Economics and Administrative Science
Atatürk University
Erzurum, Türkiye
Email: hadi.pourmousa14@ogr.atauni.edu.tr
1. INTRODUCTION
The lip-reading process, which tries to understand what the speaker is saying from lip movements [1],
is one of the most important field of human action recognition [2]. Lip reading which is performed using only
visual information, is an impressive skill in noisy environments or when there is no audio streaming [3]-[5].
Because, the main purpose in lip reading studies is to detect and understand expressions that are spoken only
with images without audio flow [6]-[8].
In recent years, lip-reading studies have started to increase with the widespread of deep learning [6].
The lip-reading process is applied at the letter, number, syllable, word, and sentence level [6], [9]. The lip
reading is process is applied at the level of letter, number, syllable, word, and sentence [6], but in most studies
it has been applied at the level of alphabet, word, and sentence [9].
The most important step in lip-reading studies is to acquire the mouth images of the speaker with
some predefined point coordinates. Then, the movements of these points are analyzed by some classification
methods, such as k-nearest neighbor classifier, hidden Markov models, and artificial neural networks. to
understand what the speaker is saying [10]. Automatic lip reading is used in different fields. However, there
are many challenges to be overcome in this area in Turkish language.
Guttural sounds (K, G, Ğ). Having multiple dialects (kardeş (brother) is pronounced as gardaş).
Making some words and verbs meaningless by shortening (gidiyorum (I am going) verb changes to gidiyom).
Absence of lip movement in some words (iyi (good), 40, and 2). Different datasets in different languages were
created for the lip-reading area. Some of the important datasets created are presented in Table 1.
2. RELATED WORK
In recent years, lip-reading studies have been carried out using deep learning techniques and datasets
in different languages have been created and studies are increasing. Most of the studies have been carried out
in English and different datasets have been created in English, unlike other languages, until today. However,
today, datasets for other languages have begun to be created, even if they are small. In Turkish, a dataset
consisting of 70 people was created for 20 numbers by Pourmousa and Özen [29], and it was trained and tested
with convolutional neural network. In the study, 56.25% successes were achieved and it was reported that the
absence of lip movements in some Turkish numbers and the similarity of lip movements in some numbers
significantly affect the success.
Atila and Sabaz [6] created two new Turkish datasets, one with 111 words and the other with 113
sentences. In this study, pre-trained models and the bidirectional long short-term memory (Bi-LSTM) method
were used to perform the classification. GoogleNet, ResNet-101, ResNet-50, ResNet-18, Nasnet-Large,
Xception, DarkNet53, DarkNet19, AlexNet, Squeezenet, and DenseNet201 were used as a pre-trained model
and success of models were investigated. As a result of the study, when Resnet-18 and Bi-LSTM were used
together, the highest success was achieved with 84.5% at word level and 88.55% at sentence level.
Nambeesan et al. [30] developed a lip-reading system on the MIRACL-VC1 dataset by using the long
short-term memory method. In this study, lip regions were extracted from all frames of a word video and
recorded sequentially in a single photograph. As a result of the study, 85.5% accuracy was obtained.
Sarhan et al. [31] developed a hybrid lip-reading model which is based deep convolutional neural network for
lip reading from videos. The proposed model consists of preprocessing, encoder, and decoder stages to perform
lip-reading. As a result of the study, 92% success was achieved in the unseen speaker and 99% in the overlapped
speakers.
Ma et al. [32] developed a lip-reading system using convolutional neural network on lip reading in
the wild (LRW) and LRW-1000 dataset. In this study, 88.5% accuracy was obtained on the LRW dataset and
46.6% on the LRW-1000 dataset. Elrefaei et al. [28] proposed a lip-reading method for Arabic language. In
this study, 22 people were participated and recorded video by their smartphones for 10 words. As a result of
the study, 79% accuracy was obtained. Noda et al. [5] used convolutional neural network method as feature
extraction mechanism for visual speech recognition in their study titled lip reading using convolutional neural
network. In this study, the convolutional neural network was trained using images of the speaker's mouth
region. The proposed system was evaluated on an audio-visual speech dataset containing 300 Japanese words
with six different speakers, and 58% success was achieved as a result of the study.
3. DATASET
In this section, the details of the dataset used are explained. First, the characteristics of the words used
in the data set and how they were collected are explained, then how the dataset was created and the
pre-processing done on the data set are explained. Finally, an example of the dataset that should be given as
input to the system is presented.
Figure 3. Recording of lip movement as frames Figure 4. Sequential recording of lip frames in one image
The dataset prepared as mentioned above is divided into training and validation datasets. The test
dataset is completely different and consists of data not found in the training and validation dataset. The number
of data in the each class was not equal due to the fact that some words were said incorrectly or there was an
error in the video while saying it. Therefore, the training dataset contains at least 295 data for each class, 25
for validation and 20 for testing.
features can be obtained from each location [35]. The pooling layer, which is usually added after a convolution
layer, is an important step in convolution-based systems that reduce the dimensionality of feature maps [48].
The two main purposes of this layer are, firstly, to reduce the number of parameters or weights, thereby
reducing the computational cost, and secondly, to control overfitting by reduce the spatial size of the data [49],
[50]. Generally, two types of pooling methods are used, maximum pooling and average pooling (Figure 6).
5. RESEARCH METHOD
In this study, convolutional neural networks with high success in image processing and classification
were used. Because all images are divided into frames and recorded on a single image. Since there are not
many features in the images, the images are given to the system in grayscale to increase the speed of the study.
As seen in the model presented in the Figure 7, firstly the dataset was changed to 224×224×1 and sizes of all
the images were reduced and changed to grayscale images to increase the speed of the system.
Figure 8. Training and validation accuracy and training and validation loss for adjectives dataset
The large dataset in deep learning plays an important role in making the system more successful. The
dataset of this study is very small and the data were mostly obtained by replacing other data. The confusion
matrix of the adjectives dataset is presented in Figure 10. 100% success was achieved in most of the adjectives
such as “Other”, “White”, “Correct”, “Economic”, “Easy”, “Small”, “Possible”, “Important”, “Black”, and
“International” and the system predicted completely correctly. The system made an incorrect guess with 0% in
"Slow" adjective, 20% in "continually" adjective, and 25% in "OK" adjective. In addition, in adjectives such
as “Great”, “Beautiful”, “Necessary”, “Clean”, and “High”, the system can be said to have made almost correct
predictions by showing 50% or more success.
The system showed more errors in words with similar lip movements (as seen in Figure 11). The lip
movements of the adjectives "Beautiful" (in Figure 11(a)) and "Easy" (in Figure 11(b)) are similar. Therefore,
the system showed the adjective "Easy" as a result in 33% of the " Beautiful " adjectives. Also, the lip
movement of the "continually" is similar to the "Beautiful" and "High", and thus the system nearly failed with
only 20% correct guesses. In order to eliminate these errors and to enable the system to learn better, the dataset
and the number of epochs must be increased.
The model was run again with 50 epochs and adam optimization for the nouns dataset. Training and
validation accuracy and training and validation loss for the nouns dataset are shown in Figure 12. Training
accuracy was 78.67% and validation accuracy was 81.21%. The system showed lower performance than the
adjective dataset in training and validation accuracy.
(a) (b)
Figure 12. Training and validation accuracy and training and validation loss for nouns dataset
The success rate of the nouns dataset was 71.88% on the test set (Figure 13). As mentioned in the
adjective dataset, the training and validation datasets are likely to be similar here, but the test dataset consists
of completely different data and is given to the system only when testing. Faisal and Manzoor [52] achieved
62% success on word dataset in Urdu language.
The confusion matrix of the nouns dataset is presented in Figure 14. The system showed 100% success
in 17 nouns. However, it failed completely with 0% on some words such as “Car”, “section”, “Technology”,
and “Country” and partially failed with the nouns “Money” and “Part” under 50% guessing. In other words,
50% or more success was achieved.
The model was run again with 50 epochs and adam optimization for the verbs dataset. Training and
validation accuracy and training and validation loss for the verbs dataset are shown in Figure 15. Training
accuracy was 81.16% and validation accuracy was 78.74%. The system showed better performance than the
adjective and nouns dataset in training and validation accuracy.
Figure 15. Training and validation accuracy and training and validation loss for verbs dataset
The success rate of the nouns dataset was 79.69% on the test set (Figure 16). As mentioned in the
adjective dataset, the training and validation datasets are likely to be similar here, but the test dataset consists
of completely different data and is given to the system only when testing. Ma et al. [32] achieved 88.5% success
at word level with LRW-1000 dataset and Noda et al. [5] achieved 58% success in the word dataset in Japanese.
The confusion matrix of the verbs dataset is presented in Figure 17. As seen in the complexity matrix, the
model showed 100% success in 9 verbs and 50% or more success in all other verbs. The verbs dataset
performed better than other datasets In the testing phase.
There is lip movement in all verbs and lip movement of verbs are not similar to each other, and also
the number of frames is high. So, the system learned better than other datasets. In addition to the developed
convolutional neural network model, pre-trained models were also trained and tested on the datasets. In this
study, pre-trained VGG and Inception-V3 models were trained on datasets and their success rates were
examined. However, the VGG model showed only 10% success on the adjectives dataset and Incevption-V3
model showed 64.06% success on the verbs dataset. Table 3 summarizes the results, the number of repetitions
and success rates.
7. CONCLUSION
In this study, a visual lip-reading system was proposed for Turkish language. So, a convolutional neural
network model was proposed and trained. Also, pre-trained models were used to increase success. The proposed
model was trained and tested on the dataset of adjectives, nouns, and verbs and achieved 75%, 71.88%, and
79.69% success, respectively In the adjectives dataset, the lip movements of adjectives such as “beautiful”, “easy”,
and “continually” are similar to each other, and in the nouns dataset, the lip movements of words such as “money”
and “piece” are similar to each other, and the lip movements of words such as “animal” and “car” are similar to
each other. Therefore, they affected the success rate. In addition to these, one of the biggest problems of the study
was that the data set was small and was not recorded in the required environment, and that most of the people did
not look at the camera correctly. Most of the studies conducted at the word level in this area have achieved a
success rate of less than 75%. Therefore, it can be said that the result is at a good level when the success achieved
at the word level is compared with other studies. Most people did not send videos because they did not trust and
thus the very small dataset was one of the major limitations of the study. In addition, the videos were not recorded
in the required environment or the persons were not looking towards the camera. therefore, lip movements were
not understandable by the system. For this reason, in future studies, it can be examined how much the success rate
will be affected by keeping the camera fixed in one place and saying words by looking at the camera from different
angles. In addition, Turkish sentences dataset can be collected and its success rate can be compared with numbers
words. Also, other pre-trained models can be run on these datasets.
REFERENCES
[1] Y. Li, Y. Takashima, T. Takiguchi, and Y. Ariki, “Lip reading using a dynamic feature of lip images and convolutional neural
networks,” in 2016 IEEE/ACIS 15th International Conference on Computer and Information Science (ICIS), Jun. 2016, pp. 1–6,
doi: 10.1109/ICIS.2016.7550888.
[2] S. Agrawal, V. R. Omprakash, and Ranvijay, “Lip reading techniques: A survey,” in 2016 2nd International Conference on Applied
and Theoretical Computing and Communication Technology (iCATccT), 2016, pp. 753–757, doi:
10.1109/ICATCCT.2016.7912100.
[3] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip reading sentences in the wild,” in 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Jul. 2017, pp. 3444–3453, doi: 10.1109/CVPR.2017.367.
[4] X. Chen, J. Du, and H. Zhang, “Lipreading with DenseNet and resBi-LSTM,” Signal, Image and Video Processing, vol. 14, no. 5,
pp. 981–989, Jul. 2020, doi: 10.1007/s11760-019-01630-1.
[5] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata, “Lipreading using convolutional neural network,” in Interspeech
2014, Sep. 2014, pp. 1149–1153, doi: 10.21437/Interspeech.2014-293.
[6] Ü. Atila and F. Sabaz, “Turkish lip-reading using Bi-LSTM and deep learning models,” Engineering Science and Technology, an
International Journal, vol. 35, p. 101206, Nov. 2022, doi: 10.1016/j.jestch.2022.101206.
[7] A. Garg, J. Noyola, and S. Bagadia, “Lip reading using CNN and LSTM,” Technical Report, Stanford University, 2016. [Online].
Available: https://cs231n.stanford.edu/reports/2016/pdfs/217_Report.pdf.
[8] B. Martinez, P. Ma, S. Petridis, and M. Pantic, “Lipreading using temporal convolutional networks,” in 2020 IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6319–6323, doi:
10.1109/ICASSP40776.2020.9053841.
[9] T. Ozcan and A. Basturk, “Lip reading using convolutional neural networks with and without pre-trained models,” Balkan Journal
of Electrical and Computer Engineering, vol. 7, no. 2, pp. 195–201, Apr. 2019, doi: 10.17694/bajece.479891.
[10] A. Yargic and M. Dogan, “A lip reading application on MS Kinect camera,” in 2013 IEEE INISTA, Jun. 2013, pp. 1–5, doi:
10.1109/INISTA.2013.6577656.
[11] J. R. Movellan, “Visual speech recognition with stochastic networks,” in Advances in Neural Information Processing Systems, 1994,
pp. 851–858.
[12] O. Vanegas, K. Tokuda, and T. Kitamura, “Location normalization of HMM-based lip-reading: experiments for the M2VTS
database,” in Proceedings 1999 International Conference on Image Processing, 1999, pp. 343–347 vol.2, doi:
10.1109/ICIP.1999.822914.
[13] I. Matthews, T. F. Cootes, J. A. Bangham, S. Cox, and R. Harvey, “Extraction of visual features for lipreading,” IEEE Transactions
on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 198–213, 2002, doi: 10.1109/34.982900.
[14] A. Vorwerk, X. Wang, D. Kolossa, S. Zeiler, and R. Orglmeister, “WAPUSK20 - a database for robust audiovisual speech
recognition,” in Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), 2010,
pp. 3016–3019.
[15] C. Neti et al., “Audio visual speech recognition,” Worksop Final Report, pp. 1-84, 2000.
[16] A. Ortega et al., “AV@CAR: a Spanish multichannel multimodal corpus for in-vehicle automatic audio-visual speech recognition,”
in Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), 2004.
[17] E. K. Patterson, S. Gurbuz, Z. Tufekci, and J. N. Gowdy, “CUAVE: A new audio-visual database for multimodal human-computer
interface research,” in IEEE International Conference on Acoustics Speech and Signal Processing, 2002, pp. 2017-2020, doi:
10.1109/ICASSP.2002.5745028.
[18] C. Petr, Ž. Miloš, K. Zdeněk, K. Jakub, Z. Jan, and M. Luděk, “Design and recording of Czech speech corpus for audio-visual
continuous speech recognition,” in Auditory-Visual Speech Processing Workshop 2005, 2005, pp. 1–4.
[19] S. Cox, R. Harvey, Y. Lan, J. Newman, and B.-J. Theobald, “The challenge of multispeaker lip-reading,” in Audio Visual Speech
Processing AVSP, Brisbane, 2008, 2008, pp. 179–184.
[20] S. Tamura et al., “CENSREC-1-AV: an audio-visual corpus for noisy bimodal speech recognition,” International Conference on
Audio-Visual Speech Processing, pp. 1-4, 2010.
[21] A. G. Chitu, K. Driel, and L. J. M. Rothkrantz, “Automatic lip reading in the Dutch language using active appearance models o n
high speed recordings,” in Text, Speech and Dialogue, Springer Berlin Heidelberg, 2010, pp. 259–266.
[22] I. Anina, Ziheng Zhou, Guoying Zhao, and M. Pietikainen, “OuluVS2: A multi-view audiovisual database for non-rigid mouth
motion analysis,” in 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG),
May 2015, pp. 1–5, doi: 10.1109/FG.2015.7163155.
[23] V. Estellers and J.-P. Thiran, “Multi-pose lipreading and audio-visual speech recognition,” EURASIP Journal on Advances in Signal
Processing, vol. 2012, no. 1, p. 51, Dec. 2012, doi: 10.1186/1687-6180-2012-51.
[24] A. Rekik, A. B. -Hamadou, and W. Mahdi, “A new visual speech recognition approach for RGB-D cameras,” in Lecture Notes in
Computer Science, Springer International Publishing, 2014, pp. 21–28.
[25] V. Verkhodanova, A. Ronzhin, I. Kipyatkova, D. Ivanko, A. Karpov, and M. Železný, “HAVRUS corpus: high-speed recordings
of audio-visual Russian speech,” in Speech and Computer, Springer International Publishing, 2016, pp. 338–345.
[26] S. Petridis, J. Shen, D. Cetin, and M. Pantic, “Visual-only recognition of normal, whispered and silent speech,” in 2018 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 6219–6223, doi:
10.1109/ICASSP.2018.8461596.
[27] S. Yang et al., “LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild,” in 2019 14th IEEE
International Conference on Automatic Face & Gesture Recognition (FG 2019), 2019, pp. 1–8, doi: 10.1109/FG.2019.8756582.
[28] L. A. Elrefaei, T. Q. Alhassan, and S. S. Omar, “An Arabic visual dataset for visual speech recognition,” Procedia Computer
Science, vol. 163, pp. 400–409, 2019, doi: 10.1016/j.procs.2019.12.122.
[29] H. Pourmousa and Ü. Özen, “Lip reading using CNN for Turkish numbers,” Journal of Business in The Digital Age, vol. 5, no. 2,
pp. 155-160, Sep. 2022, doi: 10.46238/jobda.1100903.
[30] A. S. Nambeesan, C. Payyappilly, E. J.C, J. J. P, and M. S. Alex, “Lip reading using facial feature extraction and deep learning,”
International Journal of Innovative Science and Research Technology, vol. 6, no. 7, pp. 92–96, 2021.
[31] A. M. Sarhan, N. M. Elshennawy, and D. M. Ibrahim, “HLR-Net: a hybrid lip-reading model based on deep convolutional neural
networks,” Computers, Materials & Continua, vol. 68, no. 2, pp. 1531–1549, 2021, doi: 10.32604/cmc.2021.016509.
[32] P. Ma, B. Martinez, S. Petridis, and M. Pantic, “Towards practical lipreading with distilled and efficient models,” in ICASSP 2021
- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jun. 2021, pp. 7608–7612, doi:
10.1109/ICASSP39728.2021.9415063.
[33] K. Sarıgül, “En çok kullanılan 1000 türkçe kelime,” Türkçe Ogretimi. Accessed: Oct. 21, 2021. [Online]. Available:
https://www.turkceogretimi.com/tavsiyeler/en-cok-kullanilan-1000-turkce-kelime
[34] K. Sarıgül, “Türkçede en çok kullanılan 200 fiil,” Türkçe Ogretimi. Accessed: Oct. 21, 2021. [Online]. Available:
https://www.turkceogretimi.com/tavsiyeler/turkcede-en-cok-kullanilan-200-fiil
[35] Y. LeCun et al., “Backpropagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551,
Dec. 1989, doi: 10.1162/neco.1989.1.4.541.
[36] Z. Li, F. Liu, W. Yang, S. Peng, and J. Zhou, “A survey of convolutional neural networks: analysis, applications, and prospects,”
IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 12, pp. 6999–7019, Dec. 2022, doi:
10.1109/TNNLS.2021.3084827.
[37] R. Yamashita, M. Nishio, R. K. G. Do, and K. Togashi, “Convolutional neural networks: an overview and application in radiology,”
Insights into Imaging, vol. 9, no. 4, pp. 611–629, Aug. 2018, doi: 10.1007/s13244-018-0639-9.
[38] A. Kamilaris and F. X. P. -Boldú, “A review of the use of convolutional neural networks in agriculture,” Journal of Agricultural
Science, vol. 156, no. 3, pp. 312–322, 2018, doi: 10.1017/S0021859618000436.
[39] K. Ryczko, K. Mills, I. Luchak, C. Homenick, and I. Tamblyn, “Convolutional neural networks for atomistic systems,”
Computational Materials Science, vol. 149, pp. 134–142, Jun. 2018, doi: 10.1016/j.commatsci.2018.03.005.
[40] T. Guo, J. Dong, H. Li, and Y. Gao, “Simple convolutional neural network on image classification,” in 2017 IEEE 2nd International
Conference on Big Data Analysis (ICBDA), Mar. 2017, pp. 721–724, doi: 10.1109/ICBDA.2017.8078730.
[41] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. F.-Fei, “Large-scale video classification with convolutional
neural networks,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1725–1732, doi:
10.1109/CVPR.2014.223.
[42] Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods
in Natural Language Processing (EMNLP), 2014, pp. 1746–1751, doi: 10.3115/v1/D14-1181.
[43] S. Hershey et al., “CNN architectures for large-scale audio classification,” in 2017 IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), Mar. 2017, pp. 131–135, doi: 10.1109/ICASSP.2017.7952132.
[44] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D. Yu, “Convolutional neural networks for speech recognition,”
IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 10, pp. 1533–1545, Oct. 2014, doi:
10.1109/TASLP.2014.2339736.
[45] K. Simonyan and A. Zisserman, “Very deep convolutional networks for Large-Scale image recognition,” Computer Vision and
Pattern Recognition, Sep. 2014.
[46] C. Affonso, A. L. D. Rossi, F. H. A. Vieira, and A. C. P. de L. F. de Carvalho, “Deep learning for biological image classification,”
Expert Systems with Applications, vol. 85, pp. 114–122, Nov. 2017, doi: 10.1016/j.eswa.2017.05.039.
[47] F. Güven, “Using text representation and deep learning methods for Turkish text classification,” Master Thesis, Çukurova
University, Adana, Turkey, 2019.
[48] H. Gholamalinezhad and H. Khosravi, “Pooling methods in deep neural networks, a review,” Computer Vision and Pattern
Recognition, Sep. 2020.
[49] D. T. Tran, A. Iosifidis, and M. Gabbouj, “Improving efficiency in convolutional neural networks with multilinear filters,” Neural
Networks, vol. 105, pp. 328–339, Sep. 2018, doi: 10.1016/j.neunet.2018.05.017.
[50] H. Wu and J. Zhao, “Deep convolutional neural network model based chemical process fault diagnosis,” Computers & Chemical
Engineering, vol. 115, pp. 185–197, Jul. 2018, doi: 10.1016/j.compchemeng.2018.04.009.
[51] A. M. Karim, “A new framework by using deep learning techniques for data processing,” Ph.D. Thesis, Department of Computer
Engineering, Ankara Yıldırım Beyazıt University, Ankara, Turkey, 2018.
[52] M. Faisal and S. Manzoor, “Deep learning for lip reading using audio-visual information for Urdu language,” Computer Vision and
Pattern Recognition, Feb. 2018.
BIOGRAPHIES OF AUTHORS
Dr. Üstün Özen is full Professor and Senior Lecturer in Management Information
Systems Department at Atatürk University, Erzurum, Türkiye. He also serves as Department
chair. His research interests focus on social media analysis, data analytics and health
informatics. He has several journal and conference papers, and books. He can be contacted at
email: uozen@atauni.edu.tr.