Design A Text-Prompt Speaker Recognition System Using LPC-Derived Features
Design A Text-Prompt Speaker Recognition System Using LPC-Derived Features
Abstract: Humans are integrated closer to computers every day, and computers are taking over many services that used to be based on face-toface contact between humans. This has prompted an active development in the field of biometric systems. The use of biometric information has been known widely for both person identification and security applications. The paper is concerned with the use of speaker features for protection against unauthorized access. A speaker recognition system for 6304 speech samples is presented that relies on LPC-derived features. A vocabulary of 46 speech samples is built for 10 speakers, where each authorized person is asked to utter every sample 10 times. Two different modes are considered in identifying individuals according to their speech samples. In the closed-set speaker identification, it is found that all tested LPC-derived features outperform the raw LPC coefficients and 84% to 97% identification rates are achieved. Applying the preprocessing steps to the speech signals (preemphasis, remove DC offset, frame blocking, overlapping, normalization and windowing) improve the representation of speech features, and up to 100% identification rate was obtained using weighted Linear Predictive Cepstral Coefficients (LPCC). In the open-set speaker verification mode of our proposed system model, the system selects randomly a pass phrase of 8-samples length from its database for each trial a speaker is presented to the system. Up to 213 text-prompt trials from 23-different speakers (authorized and unauthorized) are recorded (i.e., 1704 samples) in order to study the system behavior and to generate the optimal threshold in which the speakers are verified or not when compared to those training references of authorized speakers constructed in the first mode, where the best obtained speaker verification rate is greater than 99%.
Keywords: Speaker Recognition, Speaker Identification, Speaker Verification, Biometric, Text-prompt, LPC-derived features, LSF.
1. Introduction
As everyday life is getting more and more computerized, automated security systems are getting more and more important. Today most personal banking tasks can be performed over the Internet and soon they can also be performed on mobile devices such as cell phones and PDAs. The key task of an automated security system is to verify that the users are in fact those who claim to be [1]. Since the level of security breaches and transaction fraud increases, the need for highly secure identification and personal verification technologies is becoming apparent. Biometric-based solutions are able to provide confidential financial transactions and personal data privacy [2]. The need for biometrics can be found in federal, state and local governments, in the military, and in commercial applications [1, 3]. A biometric system is essentially a pattern recognition system that establishes the authenticity of a specific physiological or behavioral characteristic possessed by a user. They are typically based on some single biometric feature of humans, but several hybrid systems also exist [2, 4, 5, 1, 6]. Human voice can serve as a key for any security objects, and it is not easy to lose or forget it. This technique can be used to verify the identity claimed by people accessing systems; that is, it enables control of access to various services by voice [3, 7]. Speaker recognition has received for many years the attention of researchers working in the field of signal processing. This technology has been developed in such a way that it can be used in a number of applications, such as: voice dialing, banking over a telephone network, person
authentication, remote access to computers, command and control systems, network security and protection, entry and access control systems, data access/information retrieval, Monitoring, etc [8, 5, 9, 10, 11].
1
555
Table-1: The recorded speech samples Data Sets 1) Digits 2) Characters 3) Words Speech Samples 0 ... 9 A ... Z Accept, Reject, Open, Close, Help, Computer, Yes, No, Copy, Paste
For practical purposes, these data sets are very interesting because the similarities between several samples (especially letters) lead to the realization of important problems in speech recognition. In the closed-set speaker identification mode, up to 4600 samples were collected from different persons, whereas 1704 samples were recorded in the open-set speaker verification mode.
Figure (1): Block-Diagram of the proposed Speaker Recognition System Model
In the open-set speaker verification mode of our proposed system model, the system selects randomly a pass phrase of 8samples length from its database for each trial a speaker is presented to the system. Up to 213 text-prompt trials from different speakers (i.e., authorized and unauthorized) are recorded (i.e., 1704 samples) in order to study the system behavior. In fact, the generated text-prompt sentence, shown in Table-2, is a random number between 1 and 46 which corresponds to the samples in the vocabulary shown in Table-1. This is performed in order to study the system behavior and to generate the optimal threshold in which the speakers are verified to be accepted or not when compared to those training references of authorized speakers constructed in the first mode [1].
Table-2: Examples of Randomly Text-Prompt Sentences generated by the System
Table-2 illustrates five examples of text-prompt sentences generated by the system where column Si ( i=1,2,...,8 ) stands for sample number i , which compose a sentence in each row [1].
3.3. Preprocessing
The basic idea behind speech preprocessing is to generate a signal with a fine structure as close as possible to that of the original speech signal. This produces a data reduction facility with easier task analysis [11]. A number of processing techniques adopted in this system model are applied in the following sequence:
(1) (2)
No. of Test Samples (10 R) No. of Samples No. of Spea ker s (3)
2
556
Preemphasis Usually the digital speech signal, s[n], is preemphasized first. This is achieved by passing the signal through a highpass filter. This process emphasis the high frequencies relative to low frequencies, hence, compensating the effect of band limiting the input signal with a low-pass filter in the recording process. The most commonly used preemphasis filter is given by the following transfer function [12, 13, 10, 14]: (4) where typically lies in the range of 0.9 < 1.0 , which controls the slope of the filter that is simply implemented as a first order differentiator: (5) For the proposed system model is set to 0.95 [1]. The Removal of DC offset DC offset occurs when hardware, such as a sound card, adds DC current to a recorded audio signal. This current produces a recorded waveform that is not centered on the baseline. Therefore, removing this DC offset is the process of forcing the input signal mean to the baseline by adding a constant value to the samples in the sound file. An illustrative example of removing DC offset from a waveform file is shown in Fig. (2) [1].
Overlapping Usually adjacent frames are overlapped. The frame is shifted forward by a fixed amount, typically 30 50 % of the frame length along the signal. The purpose of the overlapping is to avoid losing of information since that each speech sound of the input sequence would be approximately centered at some frame [1, 15, 13, 16]. Normalization The frames of speech are normalized to make their power equal to unity. This step is very important since the extracted frames have different intensities due to the speaker loudness, speaker distance from the microphone and recording level. The normalization is done by dividing each sample by the square root of the sum of squares of all the samples in the segment as stated below: (7)
where S[n] is the speech sample, N is the number of samples in the segment which is 256, and the subscript norm refers to normalization [1]. Windowing The purpose of windowing is to reduce the effect of spectralleakage (type of distortion in spectral analysis) that results from the framing process. Windowing involves multiplying a speech signal x(n) by a finite-duration window w(n), which yields a set of speech samples weighted by the shape of the window, as stated by the following equation [1, 15, 13, 17, 12]: (8) where N is the size of the window or frame. There exist many different windowing functions; Table-3 lists the window functions that are used in our experiments and their shapes illustrated in Fig. (3) [1].
Table-3: Rectangular, Hamming and Kaiser Window-Function
Figure (2): Removal of DC offset from a Waveform file (a) Exhibits DC offset, (b) After the removal of DC offset
Frame-Blocking It is the process of blocking or splitting the input speech samples into equal durations of N samples length to carryout frame-wise analysis. The selection of the frame length is a crucial parameter for successful spectral analysis, due to the trade-off between the time and frequency resolutions. The window should be long enough for adequate frequency resolution, but on the other hand, it should be short enough so that it would capture the local spectral properties. Typically a frame length of 10 30 milliseconds is used. The signal for the i-th frame is given by [15, 14, 10, 12]: (6) In this work, a frame length N = 256 samples with a duration of 23.2 milliseconds is used [1].
3
557
The prediction error for nth sample is given by the difference between the actual sample and its predicted value [1, 13, 20, 10]:
Equivalently,
e[ n ] s[ n ] a[ k ] s[ n k ]
k 1
(10) (11)
s[ n ] a[ k ] s[ n k ] e[ n ]
k 1
When the prediction residual e[n] is small, predictor Eq. (9) approximates s[n] well. The total squared prediction error is given by
e[ n ]
n n
( s [ n ] a [ k ] s [ n k ])
k 1
(12)
Figure (3): Rectangular, Hamming and Kaiser Window-Function of 256 Samples Length [1]
Minimization of error is achieved by setting the partial derivatives of E with respect to the model parameters {a[k]} to zero:
E 0 , k 1,..., p a[ k ]
(13)
By writing out Eq. (13) for k = 1 ... p, the problem of finding the optimal predictor coefficients reduced to solve of so-called (Yule- Walker) AR equations. Depending on the choice of the error minimization interval in Eq. (12), there are two methods for solving the AR equations: covariance method and autocorrelation method [13, 10, 19]. The two methods do not have large difference, but the autocorrelation method is the preferred since it is computationally more efficient and it always guarantees a stable filter.
R(0) R(1) R(p1) a1 R(1) R(0) R(p 2) a2 R(2) R(1) R(p1) R(p 2) R(0) ap R(p)
(14)
s[ n ]
a [ k ] s[ n k ]
(9)
k 1
where R is a special type of matrix called Toeplitz matrix (symmetric with all diagonal elements equal, this facilitates the solution of the Yule-Walker equations for the LP coefficients {ak} through computationally fast algorithms such as the Levinson Durbin algorithm), a is the vector of the LPC coefficients and v is the autocorrelation. Both the matrix R and vector v are completely defined by p autocorrelation samples. The autocorrelation sequence of s[n] is defined as [1, 21, 10, 19, 13]:
where s[n] is an approximation of the present output, s[nk] are past outputs, p is the prediction order; and {a[k]}, k = 1...p are the model parameters called the predictor coefficients that need to be determined so that the average prediction error (or residual) is as small as possible [10, 19].
R[ k ]
1 N
N 1 k n0
s[ n ] s[ n k ]
(15)
4
558
Due to the redundancy in the Yule-Walker (AR) equations, there exists an efficient algorithm for finding the solution, known as Levinson-Durbin recursion [1, 10, 19, 20, 13].
cepstrum coefficients c n and prediction coefficients a k is represented in the following equations [1, 9, 13]:
E( 0 ) R( 0 ) ki ai( i ) E
(i )
c1 a1 cn 1 k / n . ak .cnk an , 1 n p
k 1 n1
(16)
R( i ) ki
2 i
i 1
a(j i1 ) R( i j ) E( i 1) , 1 i p j 1 1 j i 1
(23)
(17)
(18) (19)
where p is a prediction order. It is usually said that the cepstrum, derived in such a way represents the smoothed version of the spectrum. Similar to LPC analysis, increasing the number of coefficients results in more details [10, 4]. Because of the sensitivity of the low-order cepstral coefficients to overall spectral slope and the sensitivity of the high-order cepstral coefficients to noise (and other forms of noise-like variability), it has become a standard technique to weight the cepstral coefficients by a tapered window so as to minimize these sensitivities and improving the performance of these coefficients [19, 14, 13, 1]. To achieve the robustness for large values of n, it must consider a more general weighting of the form:
(1 k ) E
( i 1 )
where k i : Partial Correlation Coefficients (PARCOR). a j ( i ) : is the jth predictor (LPC) coefficient after i iterations. E ( i ) : is the prediction error after i iterations. The Levinson-Durbin procedure takes the autocorrelation sequence as its input, and produces the coefficients a[k]; k = 1 p. The time complexity of the procedure is O(p 2) as opposed to standard Gaussian elimination method whose complexity is O(p3). Equations (16 19) are solved recursively for i = 1, 2, , p, where p is the order of the LPC analysis and the final solution is given as [13, 1, 10, 20, 19]:
(24)
where
1 n p
(25)
aj aj( p) ,
1 j p
(20)
This weighting function truncates the computation and deemphasis c n around n = 1 and around n = P [19]. Line Spectral Frequencies (LSFs) Another representation of the LP parameters of the all-pole spectrum is the set of line spectral frequencies (LSFs) or line spectrum pairs (LSPs) [23, 21]. It is proposed to be employed in speech compression and other audio signals, which is the most widely representation of LPC parameters used for quantization and coding but they have been applied with good results to speaker recognition [23, 24, 10, 1]. LSFs are the roots of the following polynomials:
Partial Correlation Coefficients (PARCOR) Several alternative representations can be derived from LPC coefficients when the autocorrelation method is used. The Levinson-Durbin algorithm produces the quantities {[k i ]}; i = 1, 2, p (are in the range of - 1 k i 1 ), which is known as the reflection or PARCOR coefficients [13, 1]. Log Area Ratio (LAR) A new parameter set, which can be derived from the PARCOR coefficients, is obtained by taking the logarithm of the area ratio, yielding log area ratios (LARs) {g i } defined as [19, 20, 22, 10, 1, 13].
(26) (27)
1 ki g i log 1 k , i
1 i p
(21)
Arcsin Reflection Coefficients (ASRC) An alternative for the log area ratios are arcsin reflection coefficients, simply computed as taking the sine inverse of the reflection coefficients [10, 1, 13].
where B(z) = 1/H(z) = 1 A(z) is the inverse LPC filter. The roots of P(z) and Q(z) are interleaved and occur in complexconjugate pairs so that only p/2 roots are retained for each of P(z) and Q(z) (p roots in total). Also, the root magnitudes are known to be unity and, therefore, only their angles (frequencies) are needed. Each root of B(z) corresponds to one root in each of P(z) and Q(z). Therefore, if the frequencies of this pair of roots are close, then the original root in B(z) likely represents a formant, and, otherwise, this latter root represents a wide bandwidth feature of the spectrum. These correspondences provide us with an intuitive interpretation of the LSP coefficients [13].
arcsin i sin 1 ( k i ) ,
1 i p
(22)
Linear Predictive Cepstral Coefficients (LPCC) An important fact is that cepstrum can also be derived directly from the LPC parameter set. The relationship between
5
559
Table-4: Identification Rate for the LP based Coefficients Euclidean Distance (E. D.) 84.173 95.173 94.000 94.826 97.087 95.695 City-block Distance (C. D.) 84.217 95.260 94.652 95.608 97.521 95.782
bi ) 2
(28)
C .D .
i1
bi
(29)
where A and B are two vectors, such that A = [a1 a2 aN] and B = [b1 b2 bN].
It is clear from Table-4 and its corresponding chart Fig. (4), that all tested LPC-derived features outperform the raw LPC coefficients which give about 84% identification rates.
4. Experimental Results
Many experiments and test conditions were accomplished to measure the performance of the proposed system with different criterions concerning: preemphasis, frame overlapping, LPC order, window type, cepstral weighting and the text-prompt speaker verification. The identification rate is defined as the ratio of correct identified speakers to the total number of test samples which corresponds to a nearest neighbor decision rule.
Identification Rate No. of Correctly Identified Spea ker s Total No. of SamplesTested 100 %
(30)
Figure (5) clearly indicates the higher improvements in identification rates overall LPC-based systems in the range of 93% to 98% after applying the preemphasis step to the speech signal.
6
560
Table-6: Identification Rates for LP based Coefficients with different Window Type
Rectangular Hamming 97.5652 94.9130 99.8696 99.4783 99.3478 99.4783 99.6957 99.4783 99.9565 99.9565 99.8261 99.9130
Identification Rate %
Figure (6): Effect of LPC Predictor Order (P =15, 30, 45) on Identification Rates
It is clearly seen from the results of Table-5 and Fig. (6) that the increasing number of predictor order P with the overlap between successive frames give positive influence for most identification rates. Therefore, the predictor order P is taken to be 45 for the next experimental tests.
7
561
The successful decision in Table-7 corresponds to the rate of accepting registered persons and rejecting non-registered ones for all trials. The variation of FAR and FRR with different threshold values are also shown in Fig. (7), where the obtained CER is approximately 17.15 (which is the most suitable security threshold) for 99.53% successful decision rate.
[3] [4]
[5] [6]
[7] [8]
[9] [10]
[11] [12]
Figure (7): FAR and FRR Performance Curve for different threshold levels using city-block distance
[13]
5. Conclusion
A speaker recognition system for 6304 speech samples is presented that relies on LPC-derived features and acceptable results have been obtained. In the closed-set speaker identification, it is found that all tested LPC-derived features outperform the raw LPC coefficients where 84% to 97% identification rates are achieved. An improvement in identification rates with LPCbased systems is obtained in the range of 97% to 99% by applying the preprocessing steps (preemphasis, remove DC offset, frame blocking, overlap successive frames to 50% of frame size, normalization and windowing) to the speech signal and increasing the predictor order (P). According to speaker identification tests performed, one can deduce that LPCC exhibits paramount results when compared to other LPC based coefficients. However, the accuracy can be further improved by weighting the cepstral coefficients to obtain identification rates close to 100%. The open-set speaker verification mode is also presented for 213 trials (randomly text-prompt sentences generated by the system) from 23 persons (1704 samples). The obtained verification rates, greater than 99%, using our proposed system model is considered to be quite suitable.
[14]
[15]
[16]
[17]
[21]
[22] [23]
[24]
References
[1] Mustafa D. Al-Hassani, Identification Techniques using Speech Signals and Fingerprints, Ph.D. Thesis, Department of Computer Science, AlNahrain University, Baghdad, Iraq, September 2006. Tiwalade O. Majekodunmi, Francis E. Idachaba, A Review of the Fingerprint, Speaker Recognition, Face Recognition and Iris
[25] [26]
Recognition Based Biometric Identification Technologies, Proceedings of the World Congress on Engineering Vol. II WCE, London, U.K, 2011. M. Eriksson, Biometrics Fingerprint based identity verification, M. Sc. Thesis, Department of Computer Science, UME University, August 2001. Yuan Yujin, Zhao Peihua, ZhouQun, Research of Speaker Recognition Based on Combination of LPCC and MFCC, Electronic Information Engineering, Training and Experimental Center, Handan College, China, 2010. Anil K. Jain and Arun Ross, "Introduction to Biometrics, Springer Science+Business Media, LLC, USA, 2008. S. Gunnam, Fingerprint Recognition and Analysis System, A mini-thesis Presented to Dr. David P. Beach, Dept of Electronics and Computer Technology, Indiana state university, Terre Haute, In Partial Fulfillment of the Requirements for ECT 680, April 2004. E. Hjelms, Biometric Systems: A Face Recognition Approach, Department of Informatics, University of Oslo, Oslo, Norway, 2000. Valentin Andrei, Constantin Paleologu, Corneliu Burileanu, Implementation of a Real-Time Text Dependent Speaker Identification System, University Politehnica of Bucharest, Romania, 2011. E. Karpov, Real-Time Speaker Identification, M. Sc. Thesis, Department of Computer Science, University of Joensuu, Finland, January 2003. T. Kinnunen, Spectral Features for Automatic Text-Independent Speaker Recognition, Ph. D. Thesis, Department of Computer Science, University of Joensuu, Finland, December 2003. T. Chen, The Past, Present, and Future of Speech Processing, IEEE Signal Processing Magazine, No.5, May 1998. Biswajit Kar, Sandeep Bhatia & P. K. Dutta, Audio -Visual Biometric Based Speaker Identification, International Conference on Computational Intelligence and Multimedia Applications, India, 2007. Antonio M. Peinado, Jose C. Segura, Speech Recognition Over Digital Channels: Robustness and Standards, John Wiley & Sons Ltd, University of Granada, Spain, 2006. B. R. Wildermoth, Text-Independent Speaker Recognition using Source Based Features, M. Sc. Thesis, Griffith University, Australia, January 2001. Ch.Srinivasa Kumar, P. Mallikarjuna Rao ,Design of An Automatic Speaker Recognition System Using MFCC, Vector Quantization And LBG Algorithm, Ch.Srinivasa Kumar et al. / International Journal on Computer Science and Engineering (IJCSE), Vol. 3 No. 8 August 2011. Ciira wa Maina and John MacLaren Walsh, Log Spectra Enhancement Using Speaker Dependent Priors for Speaker Verification, Drexel University, Department of Electrical and Computer Engineering, Philadelphia, PA 19104, 2011. Ning WANG, P. C. CHING, and Tan LEE, Robust Speaker Verification Using Phase Information of Speech, Department of Electronic Engineering, The Chinese University of Hong Kong, 2010. Wai C. Chu, Speech Coding Algorithms: Foundation and Evolution of Standardized Coders, John Wiley & Sons, Inc., California, USA, 2003. L. Rabiner, B.-H. Juang, Fundamentals of Speech Recognition, Prentice-Hall, Inc., Englewood Cliffs, New Jersey, 1993. Yasir. A.-M. Taleb, Statistical and Wavelet Approaches for Speaker Identification, M. Sc. Thesis, Department of Computer Engineering, AlNahrain University, Iraq, June 2003. N. Batri, Robust Spectral Parameter Coding in Speech Processing, M. Sc. Thesis, Department of Electrical Engineering, McGill University, Montreal, Canada, May 1998. J. P. Campbell, Speaker Recognition: A Tutorial, IEEE Proceedings, Vol. 85, No. 9, 1997. A. K. Khandani and F. Lahouti, Intra-frame and Inter-frame Coding of Speech: LSF Parameters Using a Trellis Structure, Department of Electrical and Computer Engineering, University of Waterloo, Ontario, Canada, June 2000. J. Rothweiler, A Root Finding Algorithm for Line Spectral Frequencies, Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-99), March 15-19, U.S.A., 1999. S. E. Umbaugh, Computer Vision and Image Processing, PrenticeHall, Inc., U.S.A., 1998. R. C. Gonzalez, Richard E. Woods, Digital Image Processing, Second Edition, Prentice-Hall Inc., New Jersey, U.S.A., 2002.
[2]
8
562