0% found this document useful (0 votes)
350 views36 pages

Speech Analysis in Forensic Science Mine

This document discusses speech analysis in forensic science. It begins with definitions and a brief history of speech analysis, noting its use as far back as ancient Greece and Rome. It then discusses forensic phonetics, how individual speech differs based on both organic and learned factors like anatomy, physiology, pronunciation, and accent. Key aspects of individual speech include vocal tract size and shape, laryngeal vibration, fricative sources, and use of the nasal cavity. The document also covers topics like forensic phonetics applications, speaker recognition, voice identification, speech and voiceprint analysis, and their use in forensic investigations.

Uploaded by

Muhamed Shahanas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
350 views36 pages

Speech Analysis in Forensic Science Mine

This document discusses speech analysis in forensic science. It begins with definitions and a brief history of speech analysis, noting its use as far back as ancient Greece and Rome. It then discusses forensic phonetics, how individual speech differs based on both organic and learned factors like anatomy, physiology, pronunciation, and accent. Key aspects of individual speech include vocal tract size and shape, laryngeal vibration, fricative sources, and use of the nasal cavity. The document also covers topics like forensic phonetics applications, speaker recognition, voice identification, speech and voiceprint analysis, and their use in forensic investigations.

Uploaded by

Muhamed Shahanas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 36

SPEECH ANALYSIS IN FORENSIC

SCIENCE
CONTENT

1. Definition
2. History
3. Forensic phonetics
4. Factors affecting individual’s speech
5. Forensic phonetics importance
6. Issues and solutions
7. Phonetic parameters
8. Speaker recognition
High / low
Verification / identification
Naïve / technical
9. Voice identification
Subjective / objective
10.Speech recognition
11.Voice print identification
12.Truth verification
13.Earwitness
14.Application
15.Articles

WHAT IS FORENSICS ??????


Human beings have many characteristics that make it possible to distinguish
one individual from another. Some individuating characteristics can be
perceived very rapidly such as facial features & vocal qualities & behavior.

The voice is the very emblem of the speaker, indelibly woven into the fabric of
speech. In this sense, each of our utterances of spoken language carries, not only
its own message, but through ones accent, tone of voice and habitual voice
quality, it is also an audible declaration of our membership of a particular
regional group, of our individual physical and psychological identity, and of our
momentary mood. Thus, voice of an individual is said to be having its own
characteristics and distinct distinguishable quality.

HISTORY
Early history:

Saslove and Yarmey (1980) quote a translation of some writings by the pre-Socratic
philosopher Heraclitus and he writes “Eyes and ears are bad witnesses for men since
their souls lack understanding”. But the roman philosopher Quintillion (1899) proves
ear witnesses are helpful and he writes “The voice of the speaker is as easily
distinguished by the ear as the face is by the eye’ for speaker identification. Hoffman
(1940) quotes a good speaker must be a good man. It appears that speaker
identification existed as a recognized entity and did so from the time people began
writing down their opinions about human behaviors and capabilities.

Semi-modern times:

The aural perceptual testimony can be traced back even earlier to the year 1660, when
voice identification was offered in the case of William Hulet in Great Britain. Yarmey
(1995) commented a quotation by Jeremy Bentham who said ‘witnesses are the Eyes
and ears of justice’ & points out that sometimes these witnesses are accurate, complete
and trustworthy; sometimes they are not. Before the 19th century ended, there was
even talk about whether or not voices could be recognized over the telephone. 20 th
century – In 1933, Charles Lindberg’s he was international hero; his son got
kidnapped and found murdered. Lindberg identified the victim hearing his voice after
2 years. During the period of kidnap, Lindberg heard the victim / kidnapper’s voice
twice: once over the telephone and again in person.

World War II and Immediately after:

In 1944, people became excited when word of bomb crackled around the world, to
know Hitler been killed or was he essentially unharmed. British and / or U.S agents
within Germany gave a contract to Dr Mack Steer to tap the telephone voices and
compare with the Speeches given by Hitler earlier in their lab. Using aural perceptual
procedures Steer and his associates, they could identify the voice similar to Hitler’s
speech and concluded that Adolf Hitler was alive and later intelligence proved them
correct. Gray and Kopp (1944) wrote an in-house report entitled voice print
identification. Immediately after the World War II, the Germans captured huge no of
loyal soviet soldiers and civilians in their drive towards the cities of Moscow,
Leningrad and Stalingrad and after the war was over, many of these Prisoners were
returned, to the USSR.

At this time, Stalin wanted to know who among the mass of people had been loyal and
who had not and how could he tell which was which? They split up the mass that were
put up in jail and assigned to a number of projects. One of the groups was asked to
create a workable procedure (Solzhenitsyn, 1968). When they were assigned to
determine which of 5 suspects who had committed crime against the state: They could
eliminate 3 out of 5 but between 2 members they couldn’t differentiate because of
similar speech characteristics.

During 21st century, a trend developed when the police started applying that procedure
they could get their hands on. It was during this period that personnel employed by
many of these agencies began to assume that ear witness line – ups (or voice parades–
It is where a person who had heard, but not seen, some individual attempts to identify
them by listening to their voice) would be just as effective as was visual identification.
Unfortunately they did not realize that there were substantial differences between the 2
approaches. For example they were not aware that memory for heard acoustic signals
could be quite variable, it was not the same as visual memory. Thus, police
departments were quite variable in how they developed ear witness identification and
hence, their use tended not be particularly rewarding. These concerns exist even to the
present day.

WHAT IS FORENSIC PHONETICS?????? WHAT DOES IT DO?????


The word forensic has two general meanings, one having to do with courts of
justice and other with public debate or discussion. It is the first meaning that is
dealt here. The field of forensic science pertains to the investigation of criminal
evidence including fingerprints, DNA, serology, voiceprints.

Phonetics is mainly concerned with speech-it studies especially how people


speak, how the speech is transmitted acoustically, & how it is perceived.

Forensic phonetics is the use of the principles of phonetics for legal purposes, and the
extension of phonetic research to investigations relevant to legal situations.
Forensic phoneticians can be defined as a professional speciality based in the utilization
of current knowledge about the communicative processes including the development of
specialized techniques and procedures for the purpose of meeting certain of needs of
legal groups and law enforcement agencies
Forensic phonetics consists of 2 general areas:
i. The first area involves the electro acoustical analysis of speech and voice signals
that have been transmitted and stored. It focused on problems such as the proper
transmission and storage of spoken exchanges, the authentication of tape
recordings, the enhancement of speech on tape recordings, speech decoding etc.

ii. The second major area involves the analysis (both physical and perceptual) of
communicative behaviors. That is it involves issues such as the identification of
speakers, the process of obtaining information relative to the physical or
psychological states and the analysis of speech.
In 1979, corsi said “A person voice is a complex signal which encodes various kinds of
information; among them reflects the anatomy and physiology of the speaker.”
HOW DOES ONE INDIVIDUAL’S SPEECH DIFFER FROM THAT OF
ANOTHER?
 Organic vrs learned differences
Human vocal apparatus varies in size and shape due to which resonant frequencies and
the rate of vocal fold vibration are different for individuals therefore an individual’s
physique clearly influences how an individual sounds. Speech is more than a physical
event. A child learns more than a language due to which there is a marked variety of
pronunciation, this would extent up to early adulthood. Oraganic / learned dichotomy
fails to show the complexity of speaker individuality. In recorded speech the effect of
organic differences are convolved with the effect of what the speaker has learnt, in terms
of the linguistic system and the choices from it made at the given moment. State of ill
health which affects the vocal organs from minor colds through to cancer of the larynx,
change in sound . in shorter terms stress, fatigue, intoxication also affect. The variation
in the socilinguistics in the speech cannot be ignored in work on speaker indentification.
 Individuality in the physical mechanism: speech as anatomy made audible
Speech analysis generally adopts a source filter theory od speech production, which
means that larynx is the source of acoustic energy and suprlaryngeal vocal tract is the
filter or resonator which shapes the energy. The range of vibration of the vocal tract
determines the mass and length of the vocal tract and also the shape of the glottal source
wave. Aanatomical irregularity (nodules, edema) may introduce irregularities in the
acoustic signal.
Due to the variation in the vocal tarct shape and length clear distinction between male,
female and child voice can be exhibited largyngeal vibration are not only the source of
acoustic energy in speech, the energy for fricatives are more likely to be generated by
the air turbulence. The precise acoustic property of this source depends upon the shape
and size of the teeth. This has been very uselful in work of speaker indentification.
The vacal tract filter consisting of pharyngeal and oral can be augumented by opening
of the velic port and coupling in the resonance of the nasal cavity. Nasal sounds have
often attracted the cues of speaker identity because the shape and sixe of nasal cavity is
both highly variable between speakers.
 Individuality in the linguistic mechanism: speech as behavior
An act of speaking usually conveys nore than a bare message. What an individual tends
to think of a message will be accompaniedby signals that indicate the attitude of the
speaker, help regulate the flow of the conversation, reinforce a social relationship
between participants in an interation, present the speaker’s self image. To an exptend
the massage is conveyed and the linguistic mechanism is exploited. Most obviously, an
individual has an accent. This means that their speech allows them to be identified with
a group. Accents are differentiated along the parameters, which the speaker can control,
specifically segmental phonology, prosody, and aspects of voice quality. Similarly on
the geographical; dimension, dialectologist have traditionally delighted in separating
smaller and smaller groups of speakers on the basis of pronunciation
 Individuality in phonetic implementation

There may be tight constraints imposed by a language on the auditory effects associated
with a consonant and with a following vowel, but the transition between the 2
articulations and hence the co-articulation and hence the co articulation of the 2
segments may allow for variation. The tongue position for the vowel might adapt
relatively early making the consonant’s secondary articulation highly dependent on the
quality of the vowel or relatively late producing the more of transition in the early part
of the vowel. Such between speaker variation was demonstrated by Su, Li and
Fu(1974)for nasal plus vowel and Nolan (1983)for lateral plus vowel sequences.
 Speaker outside the normal range

Virtually by definition the speaker who are most distinctive arethose who lie outside the
normal range. Speech people are usually who have speech aproblems of varying degree
of severity or whose command of a language is non-native. Speech phenomena outside
the normal range are however highly valuable in speaker identification. I for instance
acoustic measurement and phonetic analysis of accent lead to a conclusion that 2
recording are of the samemale speaker and that there is no more than 10 % chance that
they were from doffernt speaker th e prensence of a similar stutter in both
recordingcould reduce that to less than .2%. this is because rouhgh;y one man In 50 are
stutterers
 Between-speaker & within-speaker variation

Variation between speaker is larger than the variation within speaker. So greater the
ratio of between- to within-speaker variation, the easier the identification.

 Distribution in speaker space


Voices are not distributed equally in speaker space. Some are closer together
than others are (some voices sound more similar than others are).

 Multidimensionality
Voices are multidimensional objects. A speaker’s voice is potentially
characterisable in an exceedingly large no of different dimensions.

 Lack of control over variation


Variation is typical of speech & that there is variation both in b/n- & within-
speaker variation in acoustic o/p occurs trivially because of different speakers
(ex. Men & women have vocal tracts of different dimension). Within-speaker
variation might occur as a function of the speaker’s differing emotional or
physical state.

 Reduction in dimensionality
Since the ability to discriminate voices is clearly a function of the available
dimensions, any reduction in the number of available dimensions constitutes a
limitation. In addition to reduction, the distortion is a problem associated with
the real world, occurs commonly in telephone transmission (Rose & Simmons,
1996; Kunzel, 2001).

 Representativeness of forensic data


Means how representative the actual observation s are of the voice they came
from. The more representative the data are of their voices, the stronger will be
the estimate of the strength of the evidence, either for or against common origin.

IMPORTANCES
 Provides support just as it does for clinical areas of speech and voice
 Can be used to assist in military industrial and security organization
 The forensic phonetics interface is primarily with the criminal justice and judicial
systems
 Speaker identification.
 Transcription or content identification (phonetic expertise is used either
to create a transcript of a recording or to give evidence as to the reliability
of such a transcript or determining what was said when recordings are of
bad quality, or when the voice is pathological or has a foreign accent].
 Speaker Profiling [in the absence of a suspect, saying something about
the regional or socioeconomic accent of the offender’s voice(s)].
 The Construction of voice line-ups and tape authentication (determining
whether a tape has been tampered with).
 Language or accent identification (phonetic expertise is used to give
evidence as to the likely place of origin of a particular speaker).
 Analysis of forensic evidence is used in the investigation and prosecution
of civil and criminal proceedings. Often, it can help to establish the guilt
or innocence of possible suspects.
 Forensic evidence is also used to link crimes that are thought to be related
to one another. Linking crimes helps law enforcement authorities to narrow
the range of possible suspects and to establish patterns of for crimes, which
are useful in identifying and prosecuting suspects.
 Forensic scientists also work on developing new techniques and
procedures for the collection and analysis of evidence. In this manner, new
technology can be used
 Phoneticians and speech scientist has a major role in contributing to these
areas.
 It is important for phoneticians and speech scientist to be closely aware of
the practices in their countries ‘legal systems.

Many personal identifying characteristics are based on physiological properties,


others on behavior, & some combine physiological & behavioral properties.
Physiological characteristics may offer more intrinsic security since they are not
subject to the kinds of voluntary variations found in behavioral features. Ex:
voice- biometrics that combines physiological & behavioral characteristics.

The notion than an individual has a voice by which he can be recognized is a


natural one. This is based on our day-to-day experience in successfully
recognizing people by their speech-alone typically over phone. Whether
individual could be uniquely recognized by their voices, but how this
recognition could be most effectively and reliably carried out in an objective
way (Nolan, 1983).

SOME OF THE ISSUES AND SOLUTIONS


1. PROBLEMS WITH SPEECH FIDELITY
Without a doubt, intelligible, accurate speech is important to many groups include in
this long list are law enforcement agencies, other units within the criminal justice system
and the courts. For example the effectiveness of many detectives would be sharply
reduced if suddenly they could no longer utilize tape recordings for surveillance, during
interrogation and or for record keeping. Indeed, it is possible that analysis of signals
stored on tape recordings ranks among the more powerful of the tools that investigators
currently have at their disposal. On the other hand any degradation of the intelligibility
and or quality of these signals (by distortion noise, interference ) can create problems
with the interrogations, investigations, and so on. With specialized knowledge about the
problem plus effective processing procedures, the amount of information captured and
utilised can be substantial
2. SOURCES OF DIFFICULTY

The tape recorders generated for legal enforcement purpose are rarely high fidelity
indeed, they often are of rather limited quality. The two main sources of the difficulty
are distortion and noise. Both result primarily from use of inadequate equipment, poor
recording techniques or events occuring with in the acoustic environment with the tape
recordings were made.
The degradation of the speech intelligibility by and within equipment can result from

- Reduction of signal bandwidth


- Harmonics distortion
- System noise
- Intermittent masking / reduction / elimination of target sounds
For e.g. when noise is added to target speech recorded with inexpensive or very small
tape recorder, the end results in speech that is not intelligible enough for easy
decoding or perhaps any decoding at all.
Although very slow recording speeds permit a great deal of information to be
captures on small tapes recorders the end results is speech that is not intelligible
enough for easy decoding or perhaps any decoding at all.
Boise by itself can be a culprit and it takes manhy forms e.g. forensic noise
(competing speech, music, etc) however the more common types of noise can be:
- Broadband or narrow band
- That with a natural grequency or frequencies
- Steady state or intermittent
- Result from friction sources (wind clothes movement) radio transmission, vehicle
operation, explosion etc.

3. REMEDIES
First, it should be acknowledged that a number of scientisits and enginers have been
developing fairly sophisticated machine techniques designed to reconstruct degraded
speech. They employ approaches such as band width compression, cross-channel
correlation, mean least squares analysis, all-pole models, group delay functions, linear
adaptive filtering linear predictive coiefficent, cepstrum techniques, devolution.

FORENSIC PHONETIC PARAMETERS:


Auditory Vs Acoustic analysis:
Auditory analysis: this is concerned with comparing samples linguistically. It is
part of the traditional training of phoneticians, those who study how speech is
produced, transmitted acoustically, & perceived. Non-linguistic aspects include
voice quality, pitch range etc.

Acoustic analysis: The border between linguistic & individual information is


most clearly crossed within individual Vocal Tract, & the acoustic output of a
Vocal Tract is uniquely determined by its shape & size. Thus individuals with
differently shaped Vocal Tracts will output different acoustics for the same
linguistic sound.

Traditional Vs automatic acoustic parameter:

Traditional: The acoustic cues that relate to differences between language


sounds either within a language or between languages can be called traditional
acoustic parameters. Other term is natural parameters (Braun & Kunzel, 1998).
The acoustic cues are affected by being produced by Vocal Tracts of different
shapes & sizes (Fant, 1960; Stevens, 2000) i.e., pitch height is a linguistically
relevant features because, in a tone language, it signals the differences between
different tones, which helps to distinguish different words. And Fundamental
frequency depends on the speaker’s vocal cords.
Automatic acoustic parameter: Means how well computers be programmed to
automatically recognize speakers & speech. Humans & computers recognize
speech patterns in different ways (Ladefoged, 2001), automatic makes use of
acoustic parameters some of which differ considerably from the traditional
acoustic parameter that humans use to recognize speech sounds.

Linguistic Vs non-linguistic:

Linguistic: A linguistic parameter can be thought of as any sound feature that


has the potential to signal a contrast, either in the structure of a given language,
or across languages or dialects.
A Non-linguistic: Observation is that voices in both samples sound to have
lower than average pitch. This is a non-linguistic behavior because although it
might signal the speaker’s emotional state; a speaker’s overall level of pitch is
not used to distinguish linguistic structure to signal the differences b/n words.
Linguistic & non-linguistic observations can be made both in auditory & in
acoustic terms.

SPEAKER RECOGNITION (VOICE RECOGNITION)


Speaker recognition might be defined as any activity whereby a speech sample
is attributed to a person based on its phonetic-acoustic or perceptual properties.
A general concept, which subsumes “Speaker Identification” and “Speaker
Verification”. It relates to the overall process of recognizing a person from their
speech. Hecker (1971) suggests that speaker recognition is any decision making
process that uses speaker dependent features of the speech signal. Atal (976)
suggests speaker recognition is any decision making process that uses some
features of the speech signal to determine if a particular person is the speaker of
a given utterance.

Bases for speaker recognition:

The principal function associated with the transmission of a speech signal is to


convey a message. However, along with the message, additional kinds of
information are transmitted. These include information about gender, identity,
emotional state, health, etc of the speaker. The source of all these kinds of
information lies in both physiological & behavioral characteristics. The shape of
the Vocal Tract, determined by the position of articulators creates a set of
acoustic resonance. The spectral peaks associated with periodic resonance are
referred to as speech formants. The locations in frequency &, to a lesser degree,
the shapes of the resonances distinguish one speech sound from another. In
addition, the formant locations & bandwidths & spectral differences associated
with the overall size of the Vocal Tract serve to distinguish the same sounds
spoken by different speakers. The shape of the nasal tract, which determines the
quality of nasal sounds, varies significantly from speaker to speaker. The
fundamental frequency varies from individual to individual which is associated
with the m o the glottis.

Extraction of speaker characteristics from the speech signal:

The perceptual view classifies speech as containing low-level & high-level


kinds of information.

Low-level features of speech are associated with periphery in the brain’s


perception of speech. The features are easier to extract such as formant locations
& BandWidth, pitch periodicity, & segmental timings.

High-level features are associated with more-central locations in the perception


mechanism such as perception of words & their meaning, syntax, prosody,
dialect, & idiolect. It is not easy to extract stable & reliable formant features
explicitly from the speech signal.

There are 2 principal methods of short-term spectral analysis that are used- filter
bank analysis, & LPC analysis.

Other measurements that are often carried out are correlated with prosody such
as pitch & energy tracking. A pitch or periodicity measurement is relatively
easy & is meaningful only for voiced speech sounds so it is necessary to have a
detector that can discriminate voiced from unvoiced sounds. This complication
often makes it difficult to obtain reliable pitch tracks over long- duration
utterances. LTAS & FF measurements have been used in the past for speaker
recognition, but since these measurements provide feature averages over long
durations they are not capable of resolving detailed individual differences.

Bricker and pruzansky (1976) recognize three major methods of speaker


recognition- by listening, by machine, and by visual inspection of spectrograms.
Speaker recognition by listening involves the study of how human listeners
achieve the task of associating a particular voice with the particular individual
and indeed to what extent such a task could be performed.

Nolan (1997) explained there are two classes of speaker recognition task-
identification and verification.

Speaker Verification: is a common task in speaker recognition, where “an


identity claim from an individual is accepted or rejected by comparing a sample
of his speech against a stored reference sample by the individual whose identity
he is claming” (Nolan, 1983). For instance access by tephone to a bank account
or other priviledged information might be controlled b checking the claimed
identify of the caller. This might in principle be done naively by a human
reciepent of the phone call who is familiar with the voice of the account holder.
however speaker verification implies techniques by which a computer
automatically compares the voice of the caller to a stored reference sample of
the speech of the person whose identity is being claimed. The speaker is likely
to be cooperative to be willing to produce and if necessary repeat a chosen
utterance for comparision and unlikely to be adopting any voice disguise
although day to day variation in the voice will have to be accommodated.

Speaker verification analysis divided into spectrographic analysis ( spectral and


temporal parameters) and perceptual analysis (listening experiment)

Speaker Identification: Includes the usual forensic situation in the standard use
of the term, the circumstances are in many ways rather more difficult. Speaker
identification is often thought of as involving a difference from speaker
verification because the task may be to match an unknown sample to one of the
closed set of suspect and so all that is needed is to determine which suspect’s
speech is nearest the unknown speaker’s sample. Speaker identification is the
one where an utterance from an unknown speaker has to be attributed, or not, to
one of a population of known speakers for whom reference samples are
available. Speaker identification are mainly divided into 3 which are as
follows:

1. Speaker identification by listening:

Some studies were reported earlier on speaker identification by listening method.


Hecker (1971) reported that speaker recognition by listening appears to be the
most accurate and reliable method at that time. Stevens (1968) reported that aural
identification of talkers based on utterances of single words or phrases is more
accurate than identification from the spectrograms and average error rate obtained
by listening is 6% than visual 21%. Schwartz (1968) suggest that listeners can
identify speaker sex from isolated productions of /s/ and /∫/, but cannot identify
that of /f/ and /q/ and Ingemann (1968) reported that listeners are often able to
identify the sex of a speaker from hearing voiceless fricatives in isolation and sex
was better identified on fricative /h/.

Lass (1976) reported that speaker sex identification judgments made for 96%
were correct for the voiced tape, 91% were correct for the filtered tape, and 75%
were correct for the whispered tape. Findings indicate that the laryngeal
fundamental frequency appears to be a more important acoustic cue in speaker
sex identification tasks than the resonance characteristics of the speaker. Results
of the experiment conducted by Reich (1979), suggest that certain vocal disguises
markedly interfere with speaker identification by listening. The reduction in
speaker identification performance by vocal disguise ranged from naïve listeners
is 22.0% (slow rate) to 32.9% (nasal) and sophisticated listeners is 11.3% (hoarse)
to 20.3% (nasal). Nasal disguise was the most effective.

Speaker identification by listening only, one of the methods discussed is, far from
being 100% accurate. It is an entirely subjective method; an expert witness using
only this method would be unable to justify his conclusions in a court of law.
2. Speaker identification by machine:

In the years following identification by the aural mode, voice-processing


technology became quite popular and the simplest approach used was to generate
and examine amplitude and frequency, time matrices of speech samples. The
other approach was to extract speaker dependent parameter from the signals and
analyze them by machines. The objective methods can be further classified into
the following:

(a) Semi-automatic method

(b) Automatic method

In the semi-automatic method, there is extensive involvement of the


examiner with the computer, whereas in the automatic method, this contact is
limited. The examiner selects unknown and known samples (similar phonemes,
syllables, words and phrase) from speech samples, which have to be compared,
i.e. computer processes these samples, extracts parameters and analyzes them
according to a particular program. The examiner makes the interpretation.

In the Automatic method, the computer does all the work and the
participation of the examiner is minimal. For the purpose of automatic
identification, special algorithms are used which differ based on the phonetic
context. This method is used very often in forensic sciences but factors such as
noise and distortion factors of voice and other samples need to be controlled. In
such case a combination of subjective and objective methods should be used.
(Tosi, 1979) Under speaker identification three types of recognition tests can be
carried out

- closed tests,
- open tests,
- discrimination test

In a closed test, it is known that the speaker to be identified is among the


population of reference speaker, whilst in an open test, the speaker to be identified
may or may not be included in that population. Thus, in a closed test, only an
error of false identification may occur.

In an open test, there is an additional possibility of incorrectly eliminating all the


members of the reference population when in reality. It included the test speaker.
In a discrimination test, the decision procedure has to ascertain whether or not
two samples of speech are similar enough to have been spoken by the same
speakers; errors of false identification and false elimination are possible (Nolan
1983).

3. Speaker identification by visual examination of spectrograms:

Stevens (1968) compared aural with the visual examination of spectrograms


using a set of eight talkers and a series of identification tests was carried out. The
average error rate for listening is 6% and for visual is 21%. They investigated and
observed that mean error rate decreased from approximately 33.0% to 18.0 % as
the duration of the speech sample increased from monosyllabic words to phrases
and sentences. They also concluded that for visual identification, longer
utterances increase the probability of correct identification.

Bolt (1970) reported that speech spectrograms, when used for voice
identification, are not analogous to fingerprints, primarily because of fundamental
differences in the source of patterns and differences in their interpretation. To
Assess reliability of voice identification under practical conditions, whether by
experts or explicit procedures not yet been made, requirements for such studies
outlined. Hecker (1971) reported that speaker recognition by visual comparison
of spectrograms is coming into use in criminology, but the validity of this method
is still in question.

Reich (1976) reported that the examiners were able to match speakers with a
moderate degree of accuracy (56.67%) when there was no attempt to vocally
disguise either utterance. In spectrographic speaker identification nasal and slow
rate were the least effective disguises, while free disguise was the most effective.

A survey of 2000 voice identification comparisons made by Federal Bureau


of Investigation (FBI) examiners (Koenig 1986) was used to determine the
observed error rate of the spectrographic voice identification technique under
actual forensic conditions. The survey revealed that decisions were made in
34.8% of the comparisons with a 0.31% false identification error rate and a 0.53%
false elimination error rate. These error rates are expected to represent the
minimum error rates under actual forensic conditions. Bolt (1972) concluded that
the scientific information available at that time was not adequate to provide valid
estimates of the degree of reliability of voice identification by examination of
spectrograms.

Hence, some scientists supported speaker identification by visual examination of


spectrograms and others did not support this technique for speaker identification.

In 1944,gray and kopp suggested of using the spectrograms as a method of


speaker recognition. They coined the term voice printing for spectrograms used
for speaker identification purpose. Spectrum analysis has been used for speaker
identification (Tosi,1979). Two types of spectra of speech signal can be obtained.

The short term spectra

Represents spectrogram of very short duration(25ms) of the speech signal. One


or two minutes sample will generate a succession of short term spectra.
Spectrogram is a simple example of short term spectra. Spectrogram is a graphic
display of short term spectra, obtained by acoustic spectrograph. The graph shows
time, frequency &intensity within each band of frequencies. It portrays short
term spectra of a flow of speech sound in on going speech. Each phoneme
represents a characteristics pattern on spectrogram. It represents both phonetic &
talker feature information.

The long term spectra

It is an average spectrum of speech signals of more than two minutes. For


continuous speech production, speaker make use of two sources-glottal
&functional source. These two can be used successively and simultaneously
during on going speech. The flow of variable acoustic signal is produced by
movement of various articulators in combination along with these two sources.
The resonance system gives a particular envelop to the sound output. This
envelop includes the phonetic variations and individual characteristics of each
talker.

Speaker recognition contains 2 sub-fields i.e.,


- naïve speaker recognition
- Technical speaker recognition

Naïve speaker recognition:

This recognizes speakers by their voices where “normal every day abilities” are
used and is performed by untrained observers in real-life conditions (Nolan,
1983). The recognition is performed by untrained observers whether in course
of normal everyday life for in instance while answering a telephone or hearing a
voice in the next room, or in the more dramatic circumstances of the crime.

Technical speaker recognition:

Usually called as “Speaker Identification by expert”. This probably first brings


to mind the use of machines and indeed much work on the comparision of
recording to establish identity does involve acoustic analysis by machine. The
technical speaker recognition contains “auditory forensic analysis” and
“computerized analysis”, where acoustic forensic analysis and automatic
speaker recognize or parts of computerized analysis. Auditory forensic analysis
will be predominantly concerned with comparing samples linguistically,
especially with respect to aspects of both phonetic quality and voice quality that
is assumed to underlie the speech. An auditory phonetic analysis provides
summary of the similarities and differences between the samples of the sound
system used.

- Acoustic forensic analysis:

There will be a great amount of human involvement in order to decide whether


samples are of good quality for analysis. To select comparable parts of speaker
samples for computerized acoustic analysis and to evaluate the results that the
computer provides.

- Automatic speaker recognition:

A machine is used to recognize a person from a spoken phrase. It includes


Verification and Identification.
Pamela (2002) investigated the percentage of similarity between the acoustic
parameters of two speech samples in Hindi language & reported that, two
samples can be considered to be of the same speaker when not more than 60%
of the measures are different.

Ranganathan (2003) studied undisguised and disguised speech in Tamil language


for statements and commands, and reported no significant difference between
disguised and undisguised speech for both the conditions. He also reports that the
best speech sounds which could be used for forensic evaluation were /ε/, /n/, /l/,
/i/, /a/ and /p/.

Rose (2002) reported that one major difference between automatic speaker
verification/identification and forensic speaker identification is that in
verification and identification the set of speakers that constitutes the reference
sample is known, and therefore the acoustic properties of their speech are
known. In forensic speaker identification, the reference set is not known, and
consequently the acoustic properties of its speakers can only be estimated
(Broeders, 1995). Another difference is in degree of control that can be
exercised over the samples to be compared. A high degree of control means a
high degree of comparability, which is conducive to efficient recognition (in
case of speaker identification and verification).

In forensic speaker identification, the important speaker based problem is the


voice disguise. Hecker (1971) stated “Vocal Characteristics which have their
origin in the tone generated by the larynx (including pitch, intensity, and
phonemic voicing patterns), are considered to make an important contribution to
the identifiability of a speaker. Wolf (1972) suggested that the Fundamental
Frequency is the easiest acoustic property to modify for purposes of disguising
the voice and it is important to know how much speaker identity is retained
when normal inter-subject differences in the laryngeal fundamentals are
eliminated. Abberton (1976), presenting real and synthesized laryngoscopic
signals to listeners, found that the most important cue to speaker identity was
Fo. However, certain results by Miller (1964) would tend to indicate that
Articulatory characteristics rather than these glottal source characteristics
contribute more to speaker identification. Coleman (1973) infact showed that
sufficient individuality exerts in speech characteristics other than those
associated with the glottal sound source to support speaker-pair discrimination
with slightly better than 90% accuracy. This result could be interpreted as
indicating that the maximum reduction in speaker identifiability that might be
expected to result from attempts to disguise the voice by modifying the
laryngeal tone would be something <10% and he also suggested that females
might be better at disguising their voices than males.

Itoh (1992) the vocal tract spectral envelope was found to have the largest effect
on speaker identification.

Lavner, Gath and Rosenhouse (2000) results suggested that on average, the
contribution of the VT features to the identification process for the vowel /a/ as
cues to familiar speaker identification is more important than that of the glottal
source features. This suggests that for each speaker a different group of acoustic
features serves as the cue to the vocal identity.

Speaker signal carries information from many sources. But not all information
is relevant or important for speaker recognition. In speaker recognition, the 1st
crucial step is the feature extraction, where the speech signal of a given frame is
converted to a set of acoustic features with the hope that these features will
encapsulate the important information that is necessary for recognition. Once
these features are computed a back end classifier is used to recognize the input
speech signal into a sequence of words in light of the extracted features and pre-
trained models.

Reich (1981) reported that people could usually tell when the speaker is
attempting voice disguise. Indeed, it is very difficult to consistently disguise
ones voice over long periods of time. If the sample is short, the problem in
detecting disguise is severe. If the sample is long, there may be ways to
identify, which parts are normal and which are not.

VOICE IDENTIFICATION
There are subjective and objective methods of voice identification. The
subjective procedure are based on either audio or visual comparisons of signals,
while in objective procedures, a computer usually compares the visual
representation of an audio signal from 1 or more speakers. There are 2 methods
of examination

 Aural
 Spectrographic

In Aural examination of recorded voice, a listener may use LTM process or the

STM processes to identify or eliminate an unknown talker as being the same as


a particular known one. The success of aural recognize depends on the factors
such as the remembrance or the familiarity of the speaker to the listener, the
homogeneity of the talkers involved, and the discriminating ability of the
listener.

Mc Gehee (1937) studied memory decay for voices and found that decay in
correct identifications occurred over time. She also reported that male auditors
could be expected to perform at levels better than those for women. Bull and
Clifford (1984) reported that females performed better than males in a task of
speaker identifications.

Another important and well-documented fact is that some voices are identified
better than others (papcun, kreiman and Davis, 1989; Rose and Duncan 1995)
and it can therefore be assumed that some voices carry more individual
identifying content than others. It may also be case that a voice is badly
identified because it has a wide range of variation that takes it into the ranges of
other voices (Rose and Duncan, 1995).

Spectrographic analysis becomes necessary unless the phonetician’s perception


is totally free from the habits and biases ingrained by experience of his native
language and accent. Hence, it becomes easier to compare the speech sample
with a suspect’s speech sample. In a spectrographic study, Endres, Bamabach
and Flosser (1971), they found that in disguise, individual formants were shifted
to higher or lower frequencies with respect to the normal voice; only the 1 st
formant remained relatively stable.

Kuwabara and Takagi (1991) investigated the effects of modifying acoustic


features on speaker identification scores. The results showed that shifting the
formant frequency affected speaker identification significantly, while formant
BW and pitch modification caused relatively smaller effects.

Speaker identification by speech spectrograms observation by Bolt & Cooper et


al (1973), & Tosi et al., (1971, 72) – involved the effects of 5 variables: -

1) The number of speakers in the known set.

2) Open Vs closed tests – in open test the observer was not told whether the
unknown speaker was represented in the known set, but in closed tests he
was told that the unknown speaker was represented in the set.

3) The context of the speech materials –whether the test words were spoken
either in isolation or in sentences.

4) Certain characters of the speech transmission system, and

5) Contemporary Vs non-contemporary voice samples – the voice samples to


be compared were recorded either on the same occasion (contemporary)
or on different occasion (non-contemporary).

Three main types of effect arise due to telephone speech. They are:

The physical setting in which a telephone call is made, generally referred to as,
environmental effects, For e.g., the context of high degree of background noise
such as traffic.
Recordings of telephone interactions will of course include sounds resulting
from any background activity, which may pose problems for forensic analysis
of speech since the background noise may obscure some of the crucial
information in the speech signal (Hirson and Howard 1994).
From speakers modifying their behavior as a result of speaking into a telephone,
called as speaker effects. Behavioral differences may be conscious or
subconscious depending on the individual and/or the interactional setting. It has
been shown experimentally that many individuals speak more loudly when
using a telephone, probably as a subconscious reflex to circumvent
environmental factors such as background noise (Summers et al. 1988).

Technical effects of telephone recordings, referring to those features that result


from the act of transmitting speech through a handset and telephone line.
Energy below 300 Hz and above 3,400 Hz is either attenuated or removed
altogether (Kunzel 2001, Rose 2003). The loss of high frequency components is
particularly destructive for forensic speaker identification, since a great deal of
potentially useful speaker-specific information (particularly voice quality
information) is encoded in higher vowel formants (Nolan 1983, Kunzel 1995).

Speaker identification is not an easy task because speaking through telephone


alters some spectral and temporal parameters (Hirson and French’92). Studies
done in India by Saravanan (1998) suggest that there is no significant difference
between normal speech and speech through telephone on word duration, vowel
duration, burst duration, VOT, closure duration, frication duration and speed of
formant transition. However spectral parameters like FO, intensity and formant
frequencies were significantly affected.

Sreedevi, Pooja & Mahima (2008)- compared the spectral and the temporal
parameters between original and disguised voice using mobile phones in 7 normal
native Hindi speaking adult males. The acoustic analysis included spectral
parameters such as the first three formant frequencies of the vowels /a/, /i/ and
/u/. The temporal parameters included word duration, nasal murmur, frication
duration, burst duration, voice onset time and closure duration. The results
showed that lowered formant frequencies, No significant difference was seen for
most of the temporal parameters (except nasal murmur) across both the
conditions.

Kunzel (1995) indicated that in German, a sample of 30 second was necessary


to attempt any type of speaker identification.
Bhuvaneswari (2005) reported that the maximum number of syllables required
for correct speaker identification in Kannada varied from 30.96 to 36.89
syllables with an average of 18.47 syllables with accuracy of 85.71%. She
concluded that in forensic practice, if speaker samples of the length of 37
syllables are available, then speaker identification could be close to 95%
accuracy.

Neha .M (2008) studied mean length of utterance used for SPID in high-pitch
disguise condition & results showed that there was no significant difference b/n
normal & high-pitch disguise condition on the accuracy of identification, sig
difference b/n conditions on the no of syllables required to correctly identify
speakers, & min of 5-6 syllables are required for SPID in high-pitch disguise
condition.

Karthikeyan (2008) studied mean length of utterance used for SPID in hoarse
voice disguise condition & results showed that there was no significant
difference b/n normal & hoarse voice disguise condition on the no of syllables
required to correctly identify speakers & an average of 8.73 syllables are
required to correctly identify speakers.

SPEECH RECOGNITION:
Speech recognizer is a device that converts an acoustic signal into another form,
like writing to be stored or used in some way. Speech recognition device
accepts acoustic signals as input and produces sequences of words as output.
The observers try to perform a recognition task (i.e.) they try to match the
spectrograms that represent the same speaker and are instructed to examine
features such as,

 Mean frequencies of vowel formants.


 Formant bandwidths
 Gaps and types of vertical striations
 Slopes of formants
 Durations
 Characteristics features of fricatives and inter formant energies.
The process of speech recognition includes the capture of speech utterance (data
acquisition), the analysis of the raw speech into a suitable set of parameters
(feature extraction), the comparison of these features against some previously
stored templates (reference features), & a decision-making process.
There is a large spectrum of possible recognition capabilities, including:

 Isolated words recognizers for segments separated by pauses,


 Word spotting algorithms that detect occurrence of key word in
continuous speech,
 Connected word recognizers that identify uninterrupted, but strictly
formatted, sequences of words (e.g. recognition of telephone no.)
 Restricted speech understanding systems that handle sentences relevant
to a specific task and
 Task independent continuous speech recognizers, which is the ultimate
goal in this field.
It performs spectral analysis of the speech signal with a filters bank of band
pass. The output of the filters bank will be cross-related with stored information
and the best match will be selected. Such system performs well when the task
was simple and a single speaker was involved.

They were less successful when the same techniques were applied to make them
independent of task, vocabulary and speakers. This led researchers to look into
the basic problem of speech organization. The basic problem is the paradox
consists of a continuous stream of sounds with no obvious disconnection at the
boundaries between different sounds and yet speech is perceived as a set of
discrete symbols.

The problem is to segment the speech onto linguistic units such as words or sub
word units and identification need to be looked upon as two distinct entities or a
single intergraded process.

There are 4 general viewpoints on speech recognition:

1) The acoustic signal viewpoint –since speech signal is waveform (or vector
of numbers), we can simply apply general signal analysis techniques (e.g.
Fourier frequency spectrum analysis, principal component analysis,
statistical decision procedures, & other mathematical schemes) to
establish the identity of the i/p.
2) The speech production viewpoint- we understand the communication
‘source’ of the speech signal, & capture essential aspects of the way in
which speech was produced by the human vocal system (e.g. rate of
vibration, place & manner of articulation, coarticulatory movements, etc).
3) The sensory reception viewpoint- suggests duplicating the human auditory
reception process, by extracting parameters & classifying patterns as is
done in the ear, auditory nerves, & sensory feature detectors of the ear-
brain system.
4) The speech perception viewpoint- suggests we extract features & make
categorical distinctions that are experimentally established as being imp to
human perception of speech (e.g. VOT, formant transition, etc).
Testing the performance of different algorithms and systems has become a
major issue in the development of ASR systems (Pallet, 1995), since systems
are costly and are very divergent quality and task applicability

 Progress has been achieved in developing speech reorganization by


attempting to solve the above problems.
 It has led to the reorganization approach and knowledge based on
cognitive approach
 It also results in two distinct classes of automatic speech
reorganization (ASR) system namely
o Isolated word reorganization (IWR) and
o Continuous speech reorganization (CSR).
 Another important classification of ASR system can be
o Speech to text system and
o Speech understanding system
Application:

o Personal telephone book with voice access


o Voice-activated domestic appliance system (turn the TV/radio on & off,
change channels, etc)
o Voice-controlled computer as the basis for solutions to a wide range of
problems encountered by the disabled
o Aircrafts
o Military services
o Space

VOICE PRINT IDENTIFICATION


Voice print identification is a combination of both aural(listening) and spectrogram
(instrumental) comparision of one or more voices with an unknown voice for the
purpose of identification. Developed by bell laboratories in the late 1940 for miltary
intelligence purpose.
There are two general factors involved in the process of human speech.
- The first factor in determining voice uniquness lies in the sizes of the vocal
cavities, such as the throat, nasal and oral cavities and the shapes, length and
tension of the individuals vocal folds located in the larynx. The vocal cavities are
resonators, much like organ pipes, which reinforce some of the over tones
produced by the vocal cords, which produce formants or voice print bars. The
likelihood that 2 people would have all their vocal cavities the same size and
configuration coupled identically appears very remote.

- The second factor in determining voice uniqueness lies in the manner in which
the articulators or speech muscles are manipulated during speech. The articulators
include the lips, teeth, tongue, soft plate and jaw muscles where controlled
interplay produces intelligible speech. Intelligible speech is developed by the
random learning process of imitating others who are communicating. The
likelihood that 2 people could develop identical use patterns of their articulators
also appear very remote.
To facilitate the visual comparisions of voices, a sound spectrograph is used to analyze
the complex speech wave form into a pictorial display on what is referred to as
spectogram. The resonace of the speakers voice is displayed in the form of vertical
signal impressions or markings for constant sounds, and horizontal bars or formats for
vowel sounds. The spectograms serve as permanent record of the words spoken and
facilitate visual comparision of similar words spoken between and unknown and known
speakers voice. The investigator should attempt to select a reasonably quiet environment
for controlled activities as drug or other illegal operations being investigated. It may
require the recording of telephone conversations or face to face encounters under a
variety of acoustic conditions in which some one is wearing a body recorder or
transmitting the conversation via radio frequency to a remote location. Unfortunately in
many cases the investigators cannot control the acoustic environmet. Speech samples
obtained should contain exactly the same words and phrases as those in the questioned
sample because only like speech sounds are used for comparision. If the caller voice
was disguised ,the suspect should give a normal sample and a disguised one as in the
questioned call. Recorded evidence should be wrapped in tinfoil to protect in from
possible contact with a magnetic field if it is submitted by mail. Both an aural (listening)
and visual examination and comparision is conducted. Visual comparision of
spectrograms involves in general,the examination of spectograph features of like sounds
as portrayed in spectograms in terms of time, frequency,and amplitude.
Following are the basic principles of spectrogram reading

1. Indentify the beginning and end of a sentence


2. Identify the beginning and end of a word
3. Identify the probable beginning and end of phoneme.

4. Identify place of articulation


5. Identify the manner of articulation

6. Identify voicing
7. Identify frication
8. Identify aspiration/affrication
9. Identify phoneme
10.Join phonemes to make a word in a language.
Indentify the beginning and end of a sentence: The sentences usually has a pause.
therefore ,we can identify the end and beginning of sentences with pauses.
Identify the beginning and end of a word: Difficult to find the beginning and end os a
word,one can segment words when read. Vowel in word- final position is lengthened
and has low intensity. So Identifying the beginning and end of a word starts with low
intensity / lenthend vowel in word-final position.
Identify the probable beginning and end of phoneme: If the phoneme of concern is
vowel/semivowel/fricatives/ diphthongs /nasal continuants, the beginning and end of
phoneme is marked by onset and offset of resonance bars. If it is a voiced plosive, the
beginning is marked by onset of voice bars and the onset is marked by beginning of
resonance bars for the following vowel, burst if plosive is released. An unvoiced
plosives-silence followed by burst,unvoiced fricatives-the onset and offset of frication.
Identify the manner of articulation: A vowel is charectherized by resonance bars,as they
are open sounds. Semivowel –articulatory movemnet from one vowel to another is fast
whereas in dipthongs its slow. A fricative –energy at high frrquencies. Nasal
continuents-damped resonance. Laterals-high F2 &F3. Trills- burst like striations are
visible on spectrograms and when it is tap, silence varying from 10 ms to 40 ms are
seen. A plosive can be identified by a silence followed by a burst or voice bars followed
by burst. Affricates-silence/voice bars ,burst and affrication characterized affricates

Identify place of articulation: Bilabial bursts tend to have a low frequency dominance,
b/w 500 and 1500Hz. The palatals and velars are characterized by a mid frequency burst,
b/w 1500 and 4000Hz. The alveolars are associated with high frequency energy, above
4000Hz.

Release burst spectrum


Bilabials – low frequency - 500-1500 Hz
Alveolars – high frequency - above 4 kHz
Velars – mid frequency - 1.5-4 kHz

FACTORS AFFECTING SPEAKER IDENTIFICATION

- Recording condition of the clue words


o 1.Samples recorded directly into recorders
o 2.Samples recorded via telephone or in a noisy environment.
- Different number of known speakers included in each experimental trial.
- Intra speaker variations
- Awarness of the examiner.

TRUTH VERIFICATION
Truster pro is an innovative, highly advanced computerized system that is specially
designed to provide you with easy access to truth verification in highly professional and
discrete manner. Conversation can be analysed in real time off line to provide timely
and reliable data for making the right decisions. It is based on the technology of vocal
stress analysis caluclated from a series of complex sophisticated algorithms that detect
states of stress, and then measure and grade them accordingly. Truster pro technology
pin points the cause of stress, and reports back with a determination as to weather a
speakers stress is caused by lie excitement as exaggeration or cognitive conflict.
A lie detector is a tool designed for the purpose of determining ones level of the
truthfulness. The basic idea in all of the existing machines today is to monitor
involuntary body reactions, to determine and analyse the subjects state of fear,stress,
and arousal. Other types of lie detectors use vocal stress to determine the level of
honesty by measuring the stress in the voice. These lie detectors use a technology known
microtremors stress detection and analyse the stress indicative to deception. Because
there are many types of lies there is no set of voice pattern or frequency for deceptive
speech. However there is a uniform appearance for truthful situations where the main
stream thought process is fluent and uninterrupted . The trusterpro system is a
combination of three different vocal lie detectors which include;
THE ON LINE MODE
1.This allows you to conduct interrogations where analysis is required in real time and
allows you to focus better on suspected portions and ask additional question, if
necessary.
2.A telephone conversation or normal discussion can be analysed in this mode.
THE INTERROGATION MODE
This is equivalent to the traditional poly graph system, providing quick and
computerized summaries and reports using all familiar poly graph techniques of
interrogation.
THE OFF LINE MODE
This can be analysing prerecorded material to produce an indepth psycholigical
structure view. Truster pros technology uses there psychological patterns to distniguish
between stress resulting from excitement or any other emotional stress,confusion or any
other cognitive stress, global stress resulting from the circumstances and deceptive
stress. Some courts will permit witness to identify a speaker only if they can satisfy the
presiding jurist that they really known that person. This approach is a reasonable one as
it is consistent with relevant research. That is if a witness has been in close contact with
a speaker for a long period of time , he or she probably can recognize the speakers voice.
And be fairly accurate in doing so.More over many courts also permit a qualified
specialist to render an opinion after comparing a sample of the unknown talkers speech
to an appropriate recording. Here the professional conducts an examination and then
decide if two talkers are involved or only one. But before either of these approaches is
described ,yet a third type of aural perceptual speaker identification should be
considered. It involves ear witness line up or voice paders,

EAR WITNESS IDENTIFICATION


An ear witness line up is a procedure where a witness who has heard but not seen a
suspect attempts to pick his/her voice from a field of voices. The procedure must be
conducted in a manner that is scrupulosly fair to both the ear witness and the suspect.
Accurate records must be kept of all phases of the procedure.
Validity;the witness must demonstrate that he or she has adequate hearing and that he
or she attended to the speakers voice at a level that would permit identification.
Witness should be told that the suspect may or may not be present in the line up and that
only one suspect will be evaluated at a time.
PROCEDURE
All samples should be recorded with as good acoustic fidelity as is possible;they should
be presented in an identical manner. The same or similar speech should be used ,samples
should be of equal length ,ambient background should be parallel for all samples,and so
on. Utterances should be of neutral material. Between five and eight foils or distractor
speakers should be used. Each of them should be similar be similar to the suspect with
respect to age and dialect plus social ,economic and educaional status. The suspects
voice can be described to the foil talkers ,but they must not have heared it. Once a tape
recorded line up has been developed.a series of 2 to 3 mock trials should be carried out
with 4 to 6 dispassionate individuals as listeners. If they consistently identify the suspect
as the target the line up tape must be restructed.
PRESENTATION
The witness should then listen to the tape and be asked to identify one of the samples as
the person they orginally heared. However they should be reassured that it is not
necessary to make a selction. Two appropriate approaches are available for presentation
purposes.
The first is referred to as the serial approach. Here the suspects speech sample is
embedded among several other samples. The samples are played in sequence and the
witness makes a judgment. The entire tape may be replayed as structured or with the
order of the individual samples rearranged. This procedure may be applied either to
assist the witness in making a decision or to establish the reliability of the witness
judgements.
The second of the two procedures has been named the sequential approach. In this case
two small rooms are used with the first containing the witness( A )and a video camera
(E) The second is the of a TV monitor (F) and observers (G). Each of the samples is
recorded on its own type(C) ,and these tapes are provided to the witness one at a time
in any order. Once all of them have been played,the witness may request and play any
or all of them and in any order designed.

Forensic significance:
Vocal cord activity: parameters associated with vocal cord activity –
pitch & phonation type are considered very important in forensic
phonetics. Becoz they can be used extra linguistically, as a
characteristic of the speaker.

Nasals & nasalization: nasal consonants like [m] or [ng] are


forensically relevant because their characteristics are assumed to be
among the strongest automatic speaker identification parameters. This
is becoz the relative rigidity of the nasal cavity ensures a low within-
speaker variation in the acoustic features associated with the cavity’s
acoustic resonance, & its internal structure & dimensions are
complicated enough to contribute to relatively high between-speaker
variation.

Suprasegmentals: tone, intonation, pitch accent & stress accent are


forensically important becoz they are primarily signaled by pitch. The
acoustic correlate of pitch called fundamental frequency (F0) is one of
the most robust acoustical parameters; it is transmitted undistorted
under most common adverse conditions, e.g. Telephone
conversations, high ambient noise.
Phonemic structure: it is crucial for the description & comparison of
forensic speech samples-a) forensic speech samples are samples of
language, so it makes sense to use appropriate model for comparing
speech sounds in two or more samples on the linguistic level, b)
phonemic information is logically necessary to establish
comparability b/n sounds in different samples, & c) there is both
within- & b/n-speaker variation in phonemic structure, & the
phonemic conceptual framework provides a principled & precise way
of talking about it, when it occurs in forensic speech samples.

Vocal tract length & formant frequencies: although speakers certainly


differ with respect to resting supralaryngeal vocal tract length, the
range of differences in population is not great. Therefore results in
low ratios of within- & b/n-speaker variation.

Differential effects of phonological environment: phonological


environment (part of the linguistic knowledge) is a source of variation
in vowel plots; it must be controlled in comparing different forensic
samples.

High frequency formants: there is a general assumption that the


formants in the higher frequency region (F3 & above) reflect
individual characteristics more than those in the lower frequencies.
Thus Stevens (1971) mentions that mean F3 is a good indicator of a
speaker’s VT length, & Ladefoged (1993) mentions that F4 & F5 are
indicative of a speaker’s voice quality. F-ratios (ratio of b/n-speaker
to within-speaker variation) tend to be bigger for the higher formants
(Rose 1999), & consequently speaker recognition is more successful
when they are used. Mokhtari & Clermont (1996) showed that ‘the
high spectral regions of vowel sounds contain more speaker-specific
information’.
Vowel acoustics: the acoustics of normal speech in both individual
vowel (hello-/a/ & /e/) acoustics demonstrates in detail with respect to
the linguistic & articulatory events that they encode & the
descriptions give an idea of the great no of potential acoustic features
available for the forensic comparison of speech samples.

Fundamental frequency: it is important forensically becoz it is robust,


& it can be extracted with relatively ease from poor quality
recordings, & it is readily available since it has to do with voicing,
majority of speech sounds are usually voiced.

Speech-specific perception: knowledge of the relationship b/n the


acoustic & perceptual categories in speech are essential for the
interpretation of the complex acoustic patterns of sounds in forensic
sample.

Phonetic quality & voice quality: the voice quality and phonetic
quality for forensic speech samples can differ in 4 ways-same voice
and phonetic quality, different voice and phonetic quality, same voice
but different phonetic quality, and same phonetic but different voice
quality. In Naïve voice discrimination, the voice quality is the
dominant factor. Ex., hello ( // or /a/).

Affect: refers to the attitudes & feeling a speaker wishes to convey


(Nolan, 1983). How different emotions are actually signaled
linguistically is very complicated. Phonation-type features like
breathy or creaky voice, or pitch features are used to signal affective
intent & it is speaker-specific.
Social intent: speaker signals working class, regional, ethnicity
information in the speech. The correlation of the linguistic features
with the geographical & social background helps in speaker
identification.

Regulatory intent: knowledge of the dynamics of conversational


interaction is important in forensic phonetics, becoz the phonological
cues that speakers use to regulate conversations must be understood to
ensure the comparability b/n the samples.

Intrinsic indexical factors: Means the information about the


characteristics of a speaker, which is intrinsic. It includes age, sex, &
physique. This is forensically important becoz the speaker does not
have control over them.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy