0% found this document useful (0 votes)
104 views7 pages

Devi Priya SECOND PAPER

The document discusses classifying speech using a K-Nearest Neighbor (K-NN) model. It summarizes that speech was classified to identify vowels differently from consonants using a KNN model built with Mel Frequency Cepstrum Coefficients (MFCC) extracted from a training database, and tested on a separate test database with an accuracy of 84-96%. The KNN model is a non-parametric method that retains all training data information for accurate speech classification.

Uploaded by

Naga Raju G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views7 pages

Devi Priya SECOND PAPER

The document discusses classifying speech using a K-Nearest Neighbor (K-NN) model. It summarizes that speech was classified to identify vowels differently from consonants using a KNN model built with Mel Frequency Cepstrum Coefficients (MFCC) extracted from a training database, and tested on a separate test database with an accuracy of 84-96%. The KNN model is a non-parametric method that retains all training data information for accurate speech classification.

Uploaded by

Naga Raju G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

CLASSIFICATION OF SPEECH USING K-NN MODEL

1
Dr. P. V. Rama Raju, 2G. Naga Raju, 3K. N. V Satyanarana 4R.Devi Priya, 4S.Krishna Sri, 4T.S.S.Prasad

1
Professor & HOD, 2, 3Asst. Professor, 4B. Tech Students
Department of ECE, SRKR Engineering College (A), Bhimavaram, India.

ABSTRACT: Throughout the history of human existence, speech has been the most dominant and convenient method of
communication with each other. However, many problems can arise due to speech miscommunication. Issues such as:
noisy communication medium, word pronunciations, speakers’ accents, language barriers, among others. These issues
can affect the understanding and interpretation the spoken words. This can be bad for the normal hearing listeners, but it
is even worse for the listeners with hearing impairments. Hearing aids are used to help persons with hearing losses. One
problem with hearing aids is that they generally amplify all signals that are inputted through their microphones. Digital
hearing aids, however, improve hearing by reducing background noise and significantly improve sound qualities.
However, hearing aids do not improve overall hearing for the hearing impaired, when the challenge for some, especially
the elderly, is recognizing high frequency sounds which typically are low energy sounds as most consonants are. In this
paper, speech was classified using K-Nearest Neighbour (K-NN), a non-parametric method, so as to identify vowels
differently from consonants. The KNN model was built in MATLAB by parsing the phoneme data in the TIMIT training
database, and generating the corresponding Mel Frequency Cepstrum Coefficients (MFCC) for each phoneme. The
trained K-NN classifier model was generated in MATLAB from the phonemes MFCC generated. The classifier was then
tested using the TIMIT test database, to determine the performance. Results showed that our classifier worked with an
accuracy that ranged between 84% to 96%.

1. INTRODUCTION:
The great practical difference between the word, written or spoken, and the visual image is that we cannot read
the former unless we have been initiated into the mystery of language, whereas visual images can be made
intelligible to all men who have eyes. Human visual perception is a far more complex and selective process than
that which a film records. However, unlike humans, who are limited to the visual band of electromagnetic (EM)
spectrum, imaging machines cover almost the entire EM spectrum, ranging from gamma to radio waves. They
can operate also on images generated by sources that humans are not accustomed to associating with images.
These include ultrasound, electron microscopy, parametric imaging and computer-generated images. Thus,
digital image processing encompasses a wide and varied field of applications.
A fundamental distinctive unit of a spoken language is the phoneme; the phoneme is distinctive in the sense that
it is a speech sound class that differentiates words of a language. Dumb people are usually deprived of normal
communication with other people in the society. It has been observed that they find it really difficult at times to
interact with normal people with their gestures, as only a very few of those are recognized by most people. Since
people with hearing impairment or deaf people cannot talk like normal people so they have to depend on some
sort of visual communication in most of the time. And hence, sign language was evolved. Sign Language
became the principal substance of communication in the deaf and dumb community. As like any other language
it has also got grammar and vocabulary but uses visual modality for exchanging information. The trouble arises
when dumb or deaf people try to express themselves to normal people with the aid of these sign language
grammars. This is because normal people are usually unaware of these grammars. As a result it has been seen
that communication of a dumb person are only limited within his/her family or his/her deaf-dumb community.
The study described in this project is with an intention to produce a system that helps deaf-mute people by using
sign language recognition into text with static palm side of right hand images. This project uses contourlet
transform for feature extraction and KNN classifier [3][4] is used for sign recognition and classifying according
to the sign gesture viewed. In this paper, we take up one of the social challenges to give the mass, a futuristic
solution to deaf and dumb people communicating with normal human beings. Hence the aim of this task is to
prepare an intelligent system which can work as a translator between the sign language and the spoken language
dynamically, and can get the communication between people with hearing impairment and normal people both
effective and effective. The signals are stated in terms of voice and text for showing.
LITERATURE SURVEY:
Speech is produced by the movement of speech productions organs located at the top half of the human body
known as articulators. This consists of organs such as the lips, teeth, tongue, lungs, trachea, glottis, larynx,
pharynx, oral cavity and the nasal cavity . During speech production, the shape of the vocal tract varies due to
movement of the articulators in the oral cavity namely, the tongue, jaws, lips and velum. If there is any
abnormality in any movement of these articulators then speech impediment occurs. Speech impediments can
have severe impact on the ability to hear the intended spoken words , but this is made even worst for those
persons with hearing impairments. Over the years much has been put into place for persons with hearing
impediments. One such device is the hearing aid. Hearing aids have gone through five major periods, the
acoustic era, carbon hearing aid era, vacuum tube era, transistor era, and the most recent, microelectronics era .
All of these have contributed significantly to achieve a tiny wearable device which can fit in the canal, in the
ear, or behind the ear increasing the quality of speech which is delivered to a person’s ears. Many users of
hearing instruments are not satisfied with the quality they have received as these hearing aids simply amplifies
all the sounds in the environment and not what the person has a problem hearing. According to , advanced
algorithms and more powerful signal processing have been able to produce better hearing aid units. One method
of enhancing the hearing aid successfully is to classify speech into phonemes , and correctly manipulate the
phonemes which are not being heard clearly. Figure 1 shows the basic layout of a speech recognition system in
which a phoneme classifier can be embedded.

(1) Acoustic wave block, which is the input speech from speaker. (2) The Front-end Analysis, this : (i) Extract
acoustic features from input speech wave, (ii) Outputs compact, efficient set of parameters that represent the
input speech properties, (iii) Uses one of 3 processing techniques to capture the necessary acoustic input wave
properties. These techniques are: (i) Linear predictive coding (LPC), (ii) MelFrequencyCepstral Coefficients
(MFCC), and Perceptual Linear Prediction (PLP). For this study, MFCC will be used as it is best designed to
capture positions and widths of formants (exactly the resonant frequencies of a vocal tract when pronouncing a
vowel that are acoustically perceivable, and have an easy interpretation and compact representation.
(3) Acoustic Pattern Recognition. This block: (1) measure the similarity between an input speech and a
reference pattern or model (obtained during training), (2) determines a reference or model, which best matches
the input speech, as an output.
Acoustic model: - The incoming speech features from the frontend part are modelled by this unit. Several
speech models exist. Hidden Markov Model (HMM), Gaussian Mixture Model (GMM) and K-Nearest
Neighbour model (K-NN), just to name a few of the most popular acoustic models.
HMM has been the most popular model used in speech recognition processing. However, GMMs, which
includes HMM, fail to capture long-term (i.e., longer than one sentence) temporal dependency in acoustic
features. These weaknesses are the natural result of using statistical modes (HMM) that can generalize easily,
thus speaker properties are aggregated in for the information of the statistical model and information is therefore
lost. On the contrary, in non-parametric-base model (such as K-NN), all the information from the training data is
retained and not just the statistical approximations. Keeping all the information from the training data results in
retaining fine phonetic details, possibly resulting in more accurate speech classification/recognition. K-NN will
be used in this study.
K-NN is a non-parametric learning algorithm used in various types of classification and prediction problems due
to its simplicity and versatility. Non-parametric approaches calculate density estimate based only on the genuine
characteristics of the data, generally according to the information that is derived from the close neighbourhood
of the query point itself. The nearest neighbour classification rule is a simple but effective classifier which
associates a sample with a posteriori distribution of its nearest classes.
Mel Frequency Cepstrum Coefficient:
The first step in any automatic speech recognition system is to extract features i.e. identify the components of
the audio signal that are good for identifying the linguistic content and discarding all the other stuff which
carries information like background noise, emotion etc.
The main point to understand about speech is that the sounds generated by a human are filtered by the shape of
the vocal tract including tongue, teeth etc. This shape determines what sound comes out. If we can determine the
shape accurately, this should give us an accurate representation of the phoneme being produced. The shape of
the vocal tract manifests itself in the envelope of the short time power spectrum, and the job of MFCCs is to
accurately represent this envelope.

Pre-Emphasis:
The speech/voice signal x(n) is sent to a high-pass filter :
y (n) = x(n) - a * x( n - 1 ) ………….. (2)
where y(n) is the output signal and the value of “a” is usually between 0.9 and 1.0.
The Z transform of this equation is given by:
H(z) = 1 - a * z -1 ………………..….. (3)
The goal of pre-emphasis is to compensate the high-frequency part that was suppressed during the sound
production mechanism of humans. Moreover, it can also amplify the importance of high-frequency formants.
Frame Blocking:
The input speech signal is segmented into frames of 15~20 ms with overlap of 50% of the frame size. Usually
the frame size (in terms of sample points) is equal to power of two in order to facilitate the use of Fast Fourier
Transform (FFT). If this is not the case, zero padding is done to the nearest length of power of two. If the sample
rate is 16 kHz and the frame size is 256 sample points, then the frame duration is 256/16000 = 0.016 sec = 16
ms. Additional, for 50% overlap meaning 128 points, then the frame rate is 16000/(256-128) = 125 frames per
second. Overlapping is used to produce continuity within frames.
Hamming Window:
Each frame has to be multiplied with a hamming window in order to keep the continuity of the first and the last
points in the frame. If the signal in a frame is denoted by x(n), n = 0,…N-1, then the signal after Hamming
windowing is: x(n) * w(n) ………….……… (4)
where w(n) is the Hamming window defined by w(n) = 0.54. see figure 3.
- 0.46 * cos (2πn/(N-1)), where 0 ≤ n ≤ N-1 Figure 3. Plot of Hamming Window
Fast Fourier Transform:
Spectral analysis shows that different timbres in speech signals corresponds to different energy distribution over
frequencies [22]. Therefore, FFT is performed to obtain the magnitude frequency response of each frame. When
FFT is performed on a frame, it is assumed that the signal within a frame is periodic, and continuous when
wrapping around. If this is not the case, FFT can still be performed but the discontinuity at the frame's first and
last points is likely to introduce undesirable effects in the frequency response. To deal with this problem, each
frame is multiplied by a hamming window, to increase its continuity at the first and last points.
Triangular Bandpass Filters:
The magnitude frequency response is multiplied by a set of 40 triangular band pass filters to get the log energy
of each triangular band pass filter. The positions of these filters are equally spaced along the Mel frequency.
From centre frequencies from 133.33 Hz to 1 kHz, there are 13 overlapping (50%) linear filters, while for centre
frequencies from 1 kHz to 8 kHz there are 27 overlapping filters spaced logarithmically [23]
Discrete Fourier Transform:
In this step, Discrete Fourier Transform (DCT) is applied to the output of the N triangular bandpass filters
(figure 4) to obtain L mel-scale cepstral coefficients. The formula for DCT is:
C(n) = ∑ Ek * cos(n * (k - 0.5) * π/40 ) ………… (5)
where n = 0,1,..to N
Here, N is the number of triangular bandpass filters, L is the number of mel-scale cepstral coefficients. In this
study, N = 40 and L = 13. Since we have performed FFT, DCT transforms the frequency domain into a time-like
domain called quefrency domain. The obtained features are similar to cepstrum, thus it is referred to as the mel-
scale cepstral coefficients, or MFCC. MFCC alone can be used as the feature for speech recognition.
TIMIT:
TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different
sexes and dialects. Each transcribed element has been delineated in time. TIMIT was designed to further
acoustic-phonetic knowledge and automatic speech recognition systems. It was commissioned by Defense
Advance Research Project Agency (DARPA) and worked on by many sites, including Texas Instrument (TI)
and Massachusetts Institute of Technology (MIT), hence the corpus' name. TIMIT corpus of read speech is
designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of
automatic speech recognition systems. Although it was primarily designed for speech recognition, it is also
widely used in speaker recognition studies, since it is one of the few databases with a relatively large number of
speakers. It is a single-session database recorded in a sound booth with fixed wideband headset. TIMIT contains
broadband recordings of 630 speakers of eight major dialects of American English, each reading ten
phonetically rich sentences. The TIMIT corpus includes time aligned orthographic, phonetic and word
transcriptions as well as a 16bit, 16 kHz speech waveform file for each utterance Corpus design was a joint
effort among the Massachusetts Institute of Technology (MIT), SRI International (SRI) and Texas Instruments,
Inc. (TI) [24].

5. METHODOLOGY:

The methodological approach choosen included:


1. Acquiring the TIMIT database
2. Acquiring MATLAB
3. Speech Selection
4. Pre-Emphasis
5. Analysing TIMIT data into individual phonemes
6. Feature Extraction using MFCC
7. Tabulating the MFCC
8. Creating the KNN
9. Testing the new Query data

6. RESULTS:

Analysing TIMIT sound file:

It created phonemes which resembled those of the phonetic alphabet

Feature Extraction:
Envelopes are enough to represent the difference, so we can recognize phonemes through MFCC.

KNN Method:
MATLAB software has an App that was used to generate the associated K-NN model.

TESTING:
This gives an accuracy of 83.4% for all the phoneme of the speech, but given that two of these phonemes
produced an error, the actual accuracy of the K-NN model created is 89.7% for this randomly selected speech
file.

INPUT 1: HELLO
INPUT 2: APPLE

INPUT 3: PEOPLE
INPUT 4:BANANA

7 CONCLUSION:

Based on the research done and the results obtained, these are the following conclusions we obtained:
 Using the K-NN model, developed using MATLAB and the TIMIT database, the differentiation of
phoneme English vowels and consonants are done.
 The accuracy attained by our project is about 84% to 96%.
 A statistic classifier is less difficult method than known HMM classifier model as KNN used to classify
speech and gain an acceptable accuracy level

8 REFERENCES:

[1] McLoughghlin I., 2009, Applied Speech And Audio Processing with MATLAB Examples, United Kingdom Cambridge
University Press.
[2] Yunusova Y, Weismer G, Westbury J, Lindstrom M. Articulatory movements during vowels in speakers with dysarthria
and in normal controls. Journal of Speech, Language, and Hearing Research. 2008;51:596–611
[3] B. Prasad, S.R.M. Prasanna (Eds.), Speech, Audio, Image and Biomedical Signal Processing using Neural Networks,
Springer-Verleg, Berlin (2008), pp. 239-264
[4] Moore, B.C., Glasberg, B.R. Modeling binaural loudness. J Acoust Soc Am. 2007;121:1604–1612.
[5] Strom KE. The HR 2006 Dispenser Survey. Hear Rev. 2006; 13: 16–39
[6] G. Potamianos, pp. 800, 2006, ISBN 9780080448541
[7] M. Rizwan, D. V. Anderson, "Using k-Nearest Neighbour and speaker ranking for phoneme prediction", Machine
Learning and Applications (ICMLA) 2014 13th International Conference on, pp. 383-387, 2014
[8] Suraj Jadhav1, Shashank Kava, (2013) Voice Activated Calculator, International Journal of Emerging Technology and
Advanced Engineering
[9] Fogerty, D., and Humes, L. E. (2012). “The role of vowel and consonant fundamental frequency, envelope, and temporal
fine structure cues to the intelligibility of words and sentences,” J. Acoust. Soc. Am. 131, 1490–1501

ABOUT AUTHORS:

Dr. P. V.Rama Raju

Presently working as a Professor and HOD of Department of Electronics and Communication


Engineering, S.R.K.R. Engineering College, AP, India. His research interests include
Biomedical-Signal Processing, Signal Processing, Image Processing, VLSI Design, Antennas
and Microwave Anechoic Chambers Design. He is author of several research studies
published in national and international journals and conference proceedings.

G. Naga Raju

Presently working as assistant professor in Dept. of ECE, S.R.K.R. Engineering College,


Bhimavaram, India. He received BTech degree from S.R.K.R Engineering College,
Bhimavaram in 2012, and MTech degree in Computer electronics specialization from Govt.
College of Eng., Pune University in 2004. His current research interests include Image
processing, digital security systems, Signal processing, Biomedical Signal processing, and
VLSI Design.

K.N.V Satyanarayana

Presently working as Asst. Professor Dept of Electronics and Communication, SRKR


Engineering College @ Bhimavaram, India

R.Devi Priya

Presently pursuing Bachelor of Engineering degree in Electronics & Communication


engineering S.R.K.R. Engineering College, AP, India.
Email ID: devipriyaragireddy@gmail.com

S. Krishna Sree

Presently pursuing Bachelor of Engineering degree in Electronics &Communication


engineering S.R.K.R. Engineering College, AP, India.
Email ID: krishnasree8374@gmail.com

T. Sohith Sai Prasad

Presently pursuing Bachelor of Engineering degree in Electronics & Communication


engineering S.R.K.R. Engineering College, AP, India.
Email ID: sohithsaiprasad1@gmail.com

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy