Identification of Speaker From Disguised Voice Using MFCC Feature Extraction, Chi Square and Classification Technique
Identification of Speaker From Disguised Voice Using MFCC Feature Extraction, Chi Square and Classification Technique
https://doi.org/10.1007/s11277-024-11542-0
Mahesh K. Singh1
Abstract
The purpose of this manuscript is to show that certain acoustic features can be used to rec-
ognize the disguised speech of unknown speakers. As the name implies, forensic speaker
identification entails the use of scientific techniques to ascertain an unknown speaker’s
identity during an inquiry. This study aims to provide a voice recognition method that
works well. To distinguish between speech and background noise in each frame, chi-square
tests are utilized. The estimated background noise is continuously modified to achieve this.
Chi-square noise estimations are then obtained once background noise has initially been
reduced. The observed signal distribution and the estimated noise distribution are com-
pared using a second chi-square test, this time using a different approach. For the frame
to be labelled as noise, the chi-square test scores must be close together. Mel-frequency
cepstrum coefficient (MFCC), features are grouped as three-dimensional features. The cor-
relation coefficient characteristics of speech are coupled with the different MFCC feature
extraction technique. The feature-based classification is done with support vector machine
(SVM) classifiers and k-nearest neighbor (k-NN) classification technique. Classification
results show that applying these unique features in an SVM classifier boosts classification
accuracy.
It is assumed that the caller is attempting to hide their genuine identity if they are using
a disguised voice. Due to the risk they provide to investigators, forensic investigations
involving Indian suspects are more likely to encounter this issue. This kind of behaviors
may diminish a person’s sense of value [1]. By changing his voice in a number of dif-
ferent ways, a speaker might deceive a listener or an automated system. It is regarded as
one of the most restricting features for speaker recognition. There are some crimes that
are more likely to be perpetrated in broad daylight than others. Speakers are additional
expected to adopt a disguise if the suspicious wants to prevent the auditor who is aware
* Mahesh K. Singh
mahesh.092002.ece@gmail.com
1
Department of ECE, Aditya University, Surampalem, India
Vol.:(0123456789)
974 M. K. Singh
with their speech from recognizing them. In circumstances like kidnapping, extortion,
or harassing phone calls, a disguise is more likely [2, 3].
The ability of human auditors to correctly identify the speakers in recorded conversa-
tions is being evaluated. Confident speakers performed well on the concealing test in
both relaxed and tense situations, scoring 79% and 98%, respectively. In both relaxed
and tense circumstances, individuals who were oblivious to the speaker performed 0.5
points worse than those in covert listening situations [4, 5]. Although individuals who
did not know the speakers or their language scored lower, there was no change in the
data patterns. Examples of disguise include covering one’s mouth with a handkerchief
or changing the pitch of one’s voice. Additionally, formant frequencies can be changed
by moving articulators like the tongue or lips [6]. The dictionary describes a disguise
as altering or distorted speech, regardless of the reason. He also differentiates between
electronic and non-electronic disguises, both of which are referred to as “deliberate.”
Altering one’s accent, voice quality, and phonatory alterations like whispering and fal-
setto are examples of non-electronic deceptions [7]. Another typical non-electric tech-
nique is to obstruct the voice organ’s movement by inserting a pencil between one’s
teeth. Health, age, emotions, and the distortions caused by recording and transmission
technology are just a few of the variables that might affect a person’s natural voice.
Although these effects aren’t considered disguises, they could nonetheless cause a per-
son’s voice to be mistakenly identified [8].
The use of voice scrambling and voice modification tools can mask other techni-
cal disguises. These technologies allow for accurate pitch and frequency adjusting of
the speech stream. A technique was put forth to make it easier to find noises that have
been electronically concealed by using mean values and correlation coefficients [9].
After rigorous testing, this system was able to recognize more than 90% of the voices in
diverse speech datasets that had been disguised in different ways. Feature extraction and
classification are the two distinct parts of speakers recognition systems, like many other
form of recognition. There are two main parts to categorization: decision-making and
decision and pattern matching shown in Fig. 1.
In order to provide speaker-specific information, a combination of phonological,
phonetic, semantic, and auditory alterations must be made [10]. As a speaker’s commu-
nicative intent and dialogue engagement change, changes are made to the speech signal
at the semantic level. When a speaker’s word choice and sentence structure reveal some-
thing about their financial class or educational background, it’s a good indicator of their
caliber. Ultimately, the phonological representation of the communication objective is
what matters most. Many things about someone’s original language and geographic area
can be gleaned from their voice and sentence length. The articulators of the vocal tract,
which include the jaw, tongue, velum, and vocal cords, collaborate to provide a phonetic
representation [11]. To make the same phoneme, for instance, the speaker can employ a
different combination of articulator movements. The spectral characteristics of audible
speech are the primary emphasis at the acoustic level. The length and form of the vocal
fold, for example, affect the fundamental and resonant frequencies of the voice [12].
It described a technique for identifying a person’s voice even while they are hiding
their identity with a mask. Without the use of any additional presentations or aids, only
65% of participants correctly identified speakers with shaky voices, compared to 90%
accuracy with undisguised sounds. Falsetto phonation voice-masking was put to the test
[13]. False phonation was shown to be far less effective at distinguishing people from
one another than natural speech. Recognition rates declined from 97 to 4% when fal-
setto was used to muddle speech. research of the impact of common vocal techniques,
like changing one’s voice tone or pinching one’s nose, on forensic speaker identifica-
tion systems Falsetto speech might be a contributing issue here. The results showed that
when the reference populations had speech samples with similar voice disguises, using
three alternative voice disguises had no impact on a system’s performance [14]. The
impacts on the three different types of disguises under examination were more severe
and diverse if the only individuals in the control group were those who spoke normally
[6].
The speaker identification task is to identify the speaker who recorded the input speech
sample from a list of previously identified speakers. The collection of recognized voices
works in one of two ways. Identification can be done in two ways: open set mode or in
closed set mode (for short). When working in closed-set mode, the systems considers that
the voice to be identified must come from a group of voices that are previously known
[15]. It will operate in open-set mode under all other conditions. If the closed-set speaker’s
identity is known, it might be considered a problem of multiple-class categorization [16].
They are called speakers in open-set mode since they are not considered to be part of the
recognized voices. It is possible to use this problem to identify a criminal from an enor-
mous pool of previously known suspects, for example. Speech evidence is an example of
this kind of use. An investigation on the differences in listening capacities between pho-
netic experts and non-experts is described in this article [17]. The main objective of the
inquiry was to classify the speaker. Participants in both groups were requested to select the
speaker’s voice from a selection of five foils in an activity known as “direct identification.”
People who had previously studied phonetic speaker identification did significantly
better than those who had never studied the subject at all, according to the findings of a
research study [18]. Verifying the identity of a speaker is made possible through the use of
voice samples. This procedure is used to verify the authority of a speaker [19, 20]. Speaker
verifications are some of the terms used to describe this process. This situation could be
described as a true–false dilemma. Due of the open-set nature of the task at hand, it’s com-
mon to hear about the open-set difficulty when discussing this method. This is because the
goal is to identify the claimed speaker’s voice. For the time being, verification is the most
profitable activity, and voice recognition systems heavily rely on it [1].
For instance, the following are a few applications of voice recognition technology for
speakers: The technology of speaker recognition has several applications. Here is a list of
some practical applications for speaker recognition [8].
Security Speaker recognition system can be used for many dissimilar things, such as
transaction authentication, phone verification for banking access.
Personalization Personalized caller greetings are now a feature of intelligent answering
machines thanks to voice recognition technology improvements. We could design conver-
sational systems that are specific to each user’s requirements. Based on their profile, these
systems can recognize the user and point them in the direction of the goal in the shortest
period of time [15].
976 M. K. Singh
The term “speaker recognition” refers to a diversity of technique used to recognize the
source of a speech sample. For one thing, it’s possible to analyze the meaning of what
someone says by looking at their own distinctive vocal qualities and the words they use to
express their thoughts and feelings [1]. The examples of the same speaker, however, show
a wide range of differences. Because a speaker is unable to recite the same words over and
over again, this is the circumstance. In spite of this, trail to trail, a person’s signature differs
from person to person. It’s possible to categorize the work of a speaker recognition engi-
neer into three distinct specialties, depending on the specifics of the research at hand [4, 5].
Speaker recognition (also known as speech recognition) is sometimes considered a
catch-all term for the various subcategories of speaker identification and verification.
Overall, it describes how to recognize someone from their speech by assessing these quali-
ties alone, and how to do so [7]. To identify the speaker, you don’t analyze the language,
remember what the person looks like, or any other method. When a person isn’t clear if the
process is verification or identification, this can be employed [9, 10].
The verifiable activities in an open-set setting and speaker identification tasks in a
closed-set setting are combined into a single challenge in an identification for speakers
in an open set. In a closed-set environment, the system can identify recognized speakers,
but it has to be able to identify “unregistered speakers.” The use of speaker verification to
restrict access to financial services over the phone is a possible security method. It is essen-
tial to use methods that segment and group speakers when there are several presenters [12,
13]. When a voice recognition or speaker recognition application assumes that a specific
speaker’s speech can be processed, this assumption is common. Before the recognition pro-
cess can begin, the speech must be split into chunks including the voices of each of the sev-
eral speakers to guarantee that the intended speaker’s voice is not confused with the sounds
of other speakers. Segmentation is the term for this technique. To put it another way, the
goal here is to figure out who the speakers are in the incoming audio and then partition the
audio into identical pieces [6, 15].
Multi-speaker audio has recently been more prominent in popular web searches and con-
sumer electronics products, which has heightened interest in this activity. Audio archives can
be indexed using speaker segmentation and grouping, making it easier to find the record-
ings you need. Methods that don’t rely on the text are also available for automatic speaker
Identification of Speaker from Disguised Voice Using MFCC Feature… 977
recognition [18, 19]. For the system to learn, users must repeatedly enter the same text, but
with text-independent recognition, they can enter any phrase as long as it hasn’t been used
prior to. Typically, the “target speaker,” or “one who speaks for the model in question,” is
referred to as the one who makes the assertion. Test samples are compared to stored speech
models in order to identify the speaker of a given phrase or sentence. Model is the only char-
acteristic of the alleged identification that will be examined in the speaker verification process.
It’s also possible that the decision will be different depending on the system that you use. If
the test sample does not match one of the previously recorded speech models, an open-set rec-
ognition can reject the user. Verification tasks, on the other hand, are capable of accepting or
rejecting a person’s assertion of identity [15, 17].
You can select the identity of the model that most closely fits a test sample in a closed-
set identification task. Comparing the two is a simple way to do this. Additionally, in open-
set applications, a threshold may also be necessary to ensure that the likeness is genuine [12,
14]. For an open-set application, the cost of an error must be taken into account throughout
the selection process to account for the exclusion of some speakers. Discouraging a genuine
customer rather than dealing with the aftermath of turning away an impostor who wants to
withdraw money saves the bank money. In a particular setting, the effectiveness of a speaker-
recognition system can alter [3, 7]. Contrasting the system’s ability to identify speakers with
those who have already been recognized is one way of judging whether or not a system is
accurate. A false acceptance of a target speaker or a false rejection of that voice are the two
sorts of faults that can occur in speaker verification systems [8, 17].
It was shown that poor recording quality and a lack of equivalent terminology were the
most common causes of conclusions with no confidence or low confidence. Decisions were
influenced by disguised vocals and high-pitched voices. Far-field microphones are the subject
of investigation to see if they may increase speaker recognition reliability. He tried to find
ways to reduce the number of errors that occurred throughout the speaker recognition proce-
dure due of the instrument [7, 9]. According to a number of experts, the application of speaker
recognition in forensic situations should be approached with utmost caution. Speaker recogni-
tion researchers have a vital role to play in disseminating this information.
It is necessary to use auditory limits closely associated to the speech qualities that differ-
entiate speakers in a method for mechanical speaker recognition. For example, you might say.
The connections that have previously been built between the speech signal and the shapes and
movements of the vocal tracts must be taken into account when determining which parameters
to utilize. Most methods of speaker identification require the extraction of information from
the speaker’s speech in order to function properly. Besides phonetics, prosody, and lexical
info, high-level info contains values like dialect, accent, and the manner of context in which
the speaker speaks. Only humans are capable of recognizing and analyzing these characteris-
tics at this time [5, 7]. There should be a lot of variation across speakers, but this should be
kept to a minimum [6, 9].
The data sample was collected at signal processing (Acoustic) Lab of Thapar University
Patiala, Punjab, India for the age of 20–25 years students. For the study, 400 people of
various genders, races, religions, and ages were used as test subjects and control, mostly
of north Indian origin. Each voice sample was recorded using a digital recorder of the
highest quality. A variety of acoustical and perceptual factors influenced the recorded
voice samples used to create the disguised audio; therefore, each speaker’s voice sam-
ples were meticulously collected. Three control samples were also taken from each par-
ticipant to see if there was a noticeable difference between the person’s disguised voice
and their natural voice. The 10 disguise techniques listed below are all feasible depend-
ing on the conditions of disguise that various persons selected: Place hand or towel over
mouth, normal voice, differences in voice pitch, state of being extremely cold, state of
my throat, pan or tobacco chewing, voice box constriction, pinching the nose. The block
diagram for acoustic analysis of non-electronic disguised voice is shown in Fig. 2.
Audacity software
Computerized speech lab
Premium headphones
Data Cable
1. Prior to the collection of a voice sample, each subject was provided with a pre-recorded
sample of standard speech.
2. Every speaker was given the transcript and told to recite it out four times, once in a
controlled state, and the other three times in a disguised state they chose. There were
400 speakers in total who made up four samples.
3. Audio samples were gathered using an Audacity recorder.
4. A signed consent form from each participant, as well as a recording of their voice, was
acquired for further investigation. Additionally, each speaker was given a declaration
to ensure that their voice samples would be protected and that they might be used.
5. Each speaker’s full name, date of birth, gender, and place of residence were meticulously
recorded, and these data have been preserved to this day.
6. Subsequently, various software was used to compare the auditory and perceptual simi-
larities and differences across all of the recorded samples from each participant.
Using a variety of voice masking techniques on the same speaker, researchers were able
to uncover differences within the speaker’s own speech. Acoustic characteristics can be
utilised to identify a disguised speaker, even if the method of disguise is unclear, thanks to
investigations like these. To conduct this research, a group of ten speakers from the same
age range was selected and given a text to read aloud. Using the method outlined above,
each student was required to recite the same passage 17 times using a range of vocal dis-
guises. The acoustic variance of each speaker was determined by comparing their con-
cealed recordings to their corresponding control samples.
There are a variety of recording devices, each of which has its own unique format for audio
files. In order for files that are now in the improper format to be utilised for spectrographic
analysis, it is necessary to convert them using the appropriate format.
File Format: ‘.Wav’ recorded from audacity software.
Bit depth: 32 bits.
Channel: Mono.
Sampling rate: 8000 Hz.
Here found 20 terms that appeared in both the disguised and normal speech samples of
respectively speakers after compared the aural similarities between the two.
Collection of speech sample using audacity software.
By accessing the disguised and control voice files in other windows, you can examine
their attributes. To ensure consistency, all files should be converted to the same format.
Listen to each file at least three or four times to identify the common clue words in all four.
For spectrographic analysis, at least two windows of the same programme must contain at
least all words from each file it is shown in Fig. 3.
The spectrographic method of speaker recognition employs a device that displays voice sig-
nals. To convert sound into visuals, Potter of the Bell Telephone Laboratory developed an
electromechanical acoustic spectrograph. It is an apparatus that can monitor the fluctuating
980 M. K. Singh
energy frequency distribution as a speech wave moves through the atmosphere. To identify
people by their voiceprints, spectrographic impressions of their utterances are employed,
much like fingerprints. Law enforcement can use them to help identify suspicious callers.
It was thought to be an impenetrable method of verifying an individual’s identification.
The “voiceprint” approach of spectrographic analysis-based voice identification has been
in legal ambiguity for quite some time. The features, and bandwidth qualified examiner
may be able to determine the resemblance of the two samples.
This is an example of the extraction of feature using MFCC technique in operation. The
pre-emphasis, that is a high-pass filter with a high pass, should be used in this particular
scenario. The following is a time-domain equations that represents input t[n] with the pre-
emphasis coefficients, that range from 0.8 to 1.0:
k[n] = t[n] − 𝛼t[n − 1] (1)
A frameshift is the difference in milliseconds (often 25 ms) among two frames left mar-
gins. To produce the resultant signal at time n, multiply the normal voice signal t[n] by the
window u[n].
k[n] = t[n].u[n] (2)
One of the simplest types of windows is the disguised rectangular window, which has a
signal that suddenly stops at its edges. The abruptness of the signal, however, causes dis-
continuities, which make Fourier analysis more challenging to implement. When collecting
MFCC features, a Hamming window is utilised to prevent discontinuities. This window
changes the signal values to zero at the boundary positions. This is depicted mathemati-
cally as follows:
{
1 0<L<n−1
uL[n] =
0 otherwise (3)
Identification of Speaker from Disguised Voice Using MFCC Feature… 981
{ ( )
2𝜋n
0.54 − 0.46 cos 0<L<n−1
G[n] = L (4)
0 otherwise
There are several benefits that the FFT offers over the DFT. DFT’s one drawback is that it
can only be used with N powers of two numbers. The only drawback of DFT is this. Below is
an illustration of a DFT mathematical representation:
∑
N
Vi (k) = vi (n)e−j2𝜋kn∕N 1 ≤ k ≤ K (5)
n=1
A Mel filter bank has 10 linear filter under 1000 Hz and the log filters above 1000 Hz. The
filters absorb energy from each filter band. The graphic above shows triangular bandpass fil-
ters. Periodic filter placement is determined by the Mel scale frequency range.
( )
f
mel(f ) = 2595 × log10 1 + (3.12) (7)
100
One way to convert it to the time domain is by applying the Discrete Cosine Transform
(DCT). Applying the following formula allows one to ascertain MFCC:
(| N |) ( )
∑N
|∑ j2𝜋kn | k(n − 0.5)π
c[n] = log || x[n]e N || cos
−
(8)
| n=0 | N
n=0 | |
It is possible to express the overall power of a frame as the entirety of the powers of all the
frames samples taken at all the time points inside a given windows of time, say, from samples
t1 to samples t2.
∑
t2
Energy = x2 (t) (9)
t=t1
It is easy to determine the delta features and the dual delta by analyzing the difference
between the two frames. When a certain cepstral value is taken into consideration, the delta
features d(t) at time instant ‘t’ are as follow c(t):
∑N
n(c(t + n) + c(t − n))
d[n] = n=1 ∑ (10)
2 Nn=1 n2
Using statistical moments, speech vectors of the same duration can be recovered since
MFCC characteristics vary with length. MFCC vectors are N-frame speech signals with ‘j’ as
the feature component and ‘L’ as the frame number.
Oj = {o1j, o2j, … … ., oNj}; where j = 1,2, … … .L (11)
982 M. K. Singh
We employed two statistical moments in this study. We calculate the correlation coef-
ficient between unique MFCC characteristics after subtracting the means Fj of related
MFCC features O j. In Eqs. (12) and (13), the method varies by scenario.
Fj = F(Oj );j = 1,2, … … .L (12)
cov(Oj , Oj� )
CRjj� = √ ( )√ ( ) ;1 ≤ j < j� ≤ L
(13)
var Oj var Oj�
After combining the correlation coefficient, the arithmetical moment of MFCC vectors
is calculated:
UMFCC = (CR12 , CR13 , … … CRL−1L ) (14)
One of the fundamental components of this technology is a speaker identification
method that is dependent on text. It is possible that this occurrence can be rationally
explained due to the fact that the acoustic feature coefficients of the disguised voices are
identical to those of the actual voices.
When applied to the speech data in the current study, the chi-square test is performed in
correlation frame derived by MFCC feature extraction “p” (speech signal present in frame)
for each sub-band “k” in order to test the aforementioned hypothesis. The vector of expec-
tancies coming from these values, “e,” is The noise probability density function from the
prior frame is roughly approximated by the noise histogram.
( )
e = e1 , e2 … … … … … … … … .eN for N value (15)
The same method and identical “N” value are used to obtain the vector “o” of experi-
ment from the present signal the observed value (o).
( )
o = o1 , o2 … … … … … … … … .oN for N value (16)
The chi-square test is then conducted on these bins, with the following chi-square
statistic:
( )2
∑N
oi − e i
2
X = (17)
i=1
ei
The generated chi-square statistic’s threshold value must increase as the permitted error
probability grows. This number is used to compare the outcomes of the chi-square statistic
and can be found in typical chi-square tables. If the actual value turns out to be higher than
the estimated value, the hypothesis is rejected; if not, it is accepted.
The subjects had to modify their normal voices for one of the voice samples. On the
400 different test speaker, the following deception techniques were employed, including
Whispering (6%), pinching nostrils (10%), Protruding lips (6%), Obstacle in mouth (8%),
Identification of Speaker from Disguised Voice Using MFCC Feature… 983
Mimicry (11%), State of cold (6%), Pretending anger (16%), Covering mouth (12%),
Changing accent (10%), Raising pitch (8%), Lowering of pitch (7%) shown in Table 1.
Participants choose to conceal themselves by using a Pretending anger or other object to
cover their mouths externally, followed by modifying the pitch of their voice by increasing
or reducing the typical values, and so on. Changing one’s natural accent or tone, bulging
lips, a sore throat, and a cold were among the least liked traits.
Speech quality was hampered by the usage of voice disguising. Both males and females
were able to maintain a good level of speech quality under typical voice settings. A high
negative correlation of (− 0.932) and (− 0.952) was established between the speech qual-
ity of male and female respondents when speaking in a disguised or control voice. Female
individuals exhibited more diversity in speech quality while using voice disguise than
male subjects. Disguised voice samples from both men and women, including by modify-
ing pitch, pinching nostrils and a cold, constricting tract, covering mouth, protruding lips,
throat infection and whispering, usually result in a lower quality of speech. This indicates a
significant difference between the two samples when it comes to the quality of speech. For
both male and female voice samples, a substantial link was found in the quality of speech
between the samples disguised by tugging the cheeks, symphonizing with fury, and modi-
fying the accent/tone and their control counterparts shown in Table 2.
We calculated the chi-square value for speech quality from the correlation coefficients
of all of the normal and disguised voice samples. Tossing out the null hypothesis in favor
of the alternative, which holds that the differences in voice quality may be traced back to
the sample used to record the audio. As shown in Table 3, the chi-square test was signifi-
cant when used to test for disguise through constricted tract, lower pitch, pinching of nose
and mouth covering. This test rejected the and confirmed that changes in speech quality
are strongly dependent on the type of speech sample used. Chi-square value calculated by
equation no 16, here shown that the correlation coefficient mentioned as expected coeffi-
cients (e) of a disguised voice and observed coefficients (o).
The calculation process uses MFCC statistical moments such correlation coefficients,
acoustic features, and vectors. Classifiers used to recognize voices affected by non-natural
sounds are based on auditory data. A SVM and k-NN classifier are used to study a speaker
sound recognition system. Table 4 presents findings.
Figure 4 shows classifier detection rates. SVM classifiers identify better than any other.
The k-NN classifier algorithm aids speaker recognition. In multimedia, numerous MFCC
Table 1 Disguised method S. No. Disguised method No. of Speaker Speakers (%)
by different speakers (Total
speakers = 400)
1 Whispering 24 6
2 Pinching nostrils 40 10
3 Protruding lips 24 6
4 Obstacle in mouth 32 8
5 Mimicry 44 11
6 State of cold 24 6
7 Pretending anger 64 16
8 Covering mouth 48 12
9 Changing accent 40 10
10 Raising pitch 32 8
11 Lowering of pitch 28 7
984 M. K. Singh
Table 3 Statistical analysis using S. No. Disguised method Correlation Chi-square value
the chi-square test on normal and coefficient
disguised voices
1 Normal voice 1.00 0
2 Whispering 0.89 0.0121
3 Pinching nostrils − 0.56 2.4336
4 Protruding lips 0.78 0.0484
5 Obstacle in mouth 0.89 0.0121
6 Mimicry 0.93 0.0049
7 State of cold − 0.59 2.5281
8 Pretending anger 0.92 0.0064
9 Covering mouth − 0.48 2.1904
10 Changing accent 0.79 0.0441
11 Raising pitch − 0.87 3.4969
12 Lowering of pitch 0.66 0.1156
5 Conclusion
The usage of speech masking is one of the most common flaws that continues to be a chal-
lenge for specialists. When dealing with normal or ideal voice recognition, which is the voice
that is constant and free from any type of distortion brought on by noise, changes in emotional
Identification of Speaker from Disguised Voice Using MFCC Feature… 985
Table 4 Comparison of the proposed classification efficiency (%) with the existing technique
Disguised method Neural network [1] SVM (proposed) k-NN (proposed)
existing
Fig. 4 Existing and proposed classification methods for different disguised method
and physical state, intentional disguise, and other factors, the identification of the speaker
is easier and produces a more conclusive opinion. When trying to identify a person based
on a disguised voice sample, difficulties arise. The chi-square test is the inspiration for this
new piece of fiction. Rather than relying on the typical heuristic concepts, it deviates from
the norm by looking for deviations from the noise distribution while making its speech/noise
determination. Over a wide variety of SNRs and noise types, it was discovered that the pro-
posed benchmark generated the highest accurate speech/noise categorization.
Declarations
Conflict of interest The authors declare that they have no conflict of interest.
986 M. K. Singh
References
1. Nair, A. M., & Savithri, S. P. (2021). Classification of pitch and gender of speakers for forensic speaker
recognition from disguised voices using novel features learned by deep convolutional neural net-
works. Traitement du Signal, 38(1).
2. Zhang, C., & Tan, T. (2008). Voice disguise and automatic speaker recognition. Forensic Science
International, 175(2–3), 118–122.
3. Singh, M. K., Singh, A. K., & Singh, N. (2018). Multimedia analysis for disguised voice and classifi-
cation efficiency. Multimedia Tools and Applications, 78(20), 29395–29411.
4. Ahmed, B., & Holmes, P. H. (2004). A voice activity detector using the chi-square test. In 2004 IEEE
international conference on acoustics, speech, and signal processing (Vol. 1, pp. I-625). IEEE.
5. Perrot, P., & Chollet, G. (2008). The question of disguised voice. Journal of the Acoustical Society of
America, 123(5), 3878.
6. Singh, M. K. (2023). A text independent speaker identification system using ANN, RNN, and CNN
classification technique. Multimedia Tools and Applications, 1–13.
7. Rodman, R. (1998). Speaker recognition of disguised voices: A program for research. In Proceedings
of the consortium on speech technology in conjunction with the conference on speaker by man and
machine: Direction for forensic applications (pp. 9–22). COST 250.
8. Singh, M. K. (2023). Feature extraction and classification efficiency analysis using machine learning
approach for speech signal. Multimedia Tools and Applications, 1–16.
9. Wu, H., Wang, Y., & Huang, J. (2014). Identification of electronic disguised voices. IEEE Transactions
on Information Forensics and Security, 9(3), 489–500.
10. Reich, A. R., Moll, K. L., & Curtis, J. F. (1976). Effects of selected vocal disguises upon spectro-
graphic speaker identification. The Journal of the Acoustical Society of America, 60(4), 919–925.
11. Singh, M. K., Singh, A. K., & Singh, N. (2018). Multimedia analysis for disguised voice and classifi-
cation efficiency. Multimedia Tools and Applications, Springer Journal, 78(20), 29395–29411.
12. Nandan, D., Singh, M. K., Kumar, S., & Yadav, H. K. (2022). Speaker identification based on physical
variation of speech signal. Traitement du Signal, 39(2).
13. Farrús, M. (2018). Voice disguise in automatic speaker recognition. ACM Computing Surveys (CSUR),
51(4), 1–22.
14. Wolf, J. J. (1972). Efficient acoustic parameters for speaker recognition. The Journal of the Acoustical
Society of America, 51(6B), 2044–2056.
15. Liang, H., Lin, X., Zhang, Q., & Kang, X. (2017). Recognition of spoofed voice using convolu-
tional neural networks. In 2017 IEEE global conference on signal and information processing
(GlobalSIP) (pp. 293–297). IEEE.
16. Wang, L., Liang, H., Lin, X., & Kang, X. (2018). Revealing the processing history of pitch-shifted
voice using CNNs. In 2018 IEEE international workshop on information forensics and security
(WIFS) (pp. 1–7). IEEE.
17. Satorra, A., & Bentler, P. M. (2001). A scaled difference chi-square test statistic for moment structure
analysis. Psychometrika, 66(4), 507–514.
18. Yao, L. (2020). A compressed deep convolutional neural networks for face recognition. In 2020 IEEE
5th international conference on cloud computing and big data analytics (ICCCBDA) (pp. 144–149).
IEEE.
19 Lakshmi, P. A., Veerapandu, G., Gamini, S., & Singh, M. K. (2022). CNN Classification of multi-scale
ensemble OCT for macular image analysis. Algorithms. International Journal of Electrical and Elec-
tronics Research, 10(4), 858–861. https://doi.org/10.37391/IJEER.100417
20. Yang, H., Yang, Z., & Huang, Y. (2019). Steganalysis of voip streams with cnn-lstm network. In Pro-
ceedings of the ACM workshop on information hiding and multimedia security (pp. 204–209).
Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and
institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under
a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted
manuscript version of this article is solely governed by the terms of such publishing agreement and applicable
law.
Identification of Speaker from Disguised Voice Using MFCC Feature… 987