0% found this document useful (0 votes)

7 views17 pages

10 1016@j Specom 2008 08 002

cgf

Uploaded by

許峻維

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views17 pages

10 1016@j Specom 2008 08 002

cgf

Uploaded by

許峻維

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Available online at www.sciencedirect.

com

Speech Communication 51 (2009) 793–809

www.elsevier.com/locate/specom

Towards a neurocomputational model of speech

production and perception
Bernd J. Kröger *, Jim Kannampuzha, Christiane Neuschaefer-Rube
Department of Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and Aachen University, Aachen, Germany

Received 5 March 2008; received in revised form 29 July 2008; accepted 27 August 2008

Abstract

The limitation in performance of current speech synthesis and speech recognition systems may result from the fact that these systems
are not designed with respect to the human neural processes of speech production and perception. A neurocomputational model of
speech production and perception is introduced which is organized with respect to human neural processes of speech production and
perception. The production–perception model comprises an artificial computer-implemented vocal tract as a front-end module, which
is capable of generating articulatory speech movements and acoustic speech signals. The structure of the production–perception model
comprises motor and sensory processing pathways. Speech knowledge is collected during training stages which imitate early stages of
speech acquisition. This knowledge is stored in artificial self-organizing maps. The current neurocomputational model is capable of pro-
ducing and perceiving vowels, VC-, and CV-syllables (V = vowels and C = voiced plosives). Basic features of natural speech production
and perception are predicted from this model in a straight forward way: Production of speech items is feedforward and feedback con-
trolled and phoneme realizations vary within perceptually defined regions. Perception is less categorical in the case of vowels in compar-
ison to consonants. Due to its human-like production–perception processing the model should be discussed as a basic module for more
technical relevant approaches for high-quality speech synthesis and for high performance speech recognition.
Ó 2008 Elsevier B.V. All rights reserved.

Keywords: Speech; Speech production; Speech perception; Neurocomputational model; Artiﬁcial neural networks; Self-organizing networks

1. Introduction tion as well as speech synthesis systems currently are not

modeled with respect to the basic human neural processes
Current speech recognition systems are easily outper- of speech production and speech perception.
formed in the case of (i) non-restricted vocabulary, (ii) if A variety of brain imaging studies clarify the role of dif-
the speaker is not well-known by the system and (iii) if noise ferent subcortical and cortical brain regions for speech pro-
reduces the speech signal quality (e.g. Benzeghiba et al., duction (e.g. Murphy et al., 1997; Kuriki et al., 1999; Wise
2007; Scharenborg, 2007). Current corpus-based speech syn- et al., 1999; Bookheimer et al., 2000; Rosen et al., 2000;
thesis systems are limited as well, especially concerning (i) Scott et al., 2000; Benson et al., 2001; Huang et al., 2001;
ﬂexibility in modeling diﬀerent speaker and voice character- Blank et al., 2002; Vanlancker-Sidtis et al., 2003; Acker-
istics and concerning (ii) segmental as well as prosodic natu- mann and Riecker, 2003; Hillis et al., 2004; Shuster and
ralness (e.g. Clark et al., 2007; Latorre et al., 2006). These Lemieux, 2005; Kemeny et al., 2005; Riecker et al., 2006;
limitations may be attributed to the fact that speech recogni- Sörös et al., 2006) as well as for speech perception (e.g. Bin-
der et al., 2000; Hickok and Poeppel, 2000; Fadiga et al.,
*
2002; Wilson et al., 2004; Boatman, 2004; Poeppel et al.,
Corresponding author.
2004; Rimol et al., 2005; Liebenthal et al., 2005; Uppenk-
E-mail addresses: bkroeger@ukaachen.de (B.J. Kröger), jkannampuzha@
ukaachen.de (J. Kannampuzha), cneuschaefer@ukaachen.de (C. Neuschaefer- amp et al., 2006; Zekveld et al., 2006; Obleser et al.,
Rube). 2006, 2007). Other studies focus on the interplay of speech

0167-6393/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.specom.2008.08.002
794 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809

production and perception (Heim et al., 2003; Okada and 2. The structure of the neurocomputational model
Hickok, 2006; Callan et al., 2006; Jardri et al., 2007) but
only few among them introduce functional neural models While the structure of this neurocomputational model is
which explain and emulate (i) the complex neural sensori- based on neurophysiological and neuropsychological facts
motor processes of speech production (Bailly, 1997; Guen- (Kröger et al., 2008), the speech knowledge itself is gathered
ther, 1994, 1995, 2006; Guenther et al., 2006) and (ii) the by training artificial neural networks which are part of this
complex neural processes of speech perception including model (Kröger et al., 2006a,b). The organization of the
comprehension (McClelland and Elman, 1986; Gaskell model is given in Fig. 1. It comprises a cortical and a sub-
and Marslen-Wilson, 1997; Luce et al., 2000; Grossberg, cortical–peripheral part. The cortical part is subdivided
2003; Norris et al., 2006; Hickok and Poeppel, 2004, 2007). with respect to neural processing within the frontal, the
It is the aim of this paper to introduce a biologically temporal, and the parietal cortical lobe. Functionally the
motivated approach for speech recognition and synthesis, model comprises a production and a perception part. In
i.e. a computer-implemented neural model using artificial its current state the model excludes linguistic processing
neural networks, capable of imitating human processes of (mental grammar, mental lexicon, comprehension, concep-
speech production and speech perception. This produc- tualization) but focuses on sensorimotor processes of
tion–perception model is based on neurophysiological speech production and on sublexical speech perception,
and neuropsychological knowledge of speech processing i.e. sound and syllable identification and discrimination.
(Kröger et al., 2008). The structure of the model and the The production part is divided into feedforward and
process of collecting speech knowledge during speech feedback control (see also Guenther, 2006). It starts with
acquisition training stages are described in detail in this the phonemic representation of a speech item (speech
paper. Furthermore it is described how the model is capa- sound, syllable, word, or utterance) and generates the
ble of producing vowels and CV-syllables and why the appropriate time course of articulatory movements and
model is capable of perceiving vowels and consonants the appropriate acoustic speech signal. The phonemic rep-
categorically. resentation of a speech item is generated by higher level lin-

Fig. 1. Organization of the neurocomputational model. Boxes with black outline represent neural maps. Arrows indicate processing paths or neural
mappings. Boxes without outline indicate processing modules. Grey letters and grey arrows indicate processing modules and neural mappings which are
not computer-implemented in the current version of the model.
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 795

guistic modules (Levelt et al., 1999; Dell et al., 1999; Inde- parameter with complementary activation (a2 = 1 a1) in
frey and Levelt, 2004) subsumed as widely distributed fron- order to hold the overall activation a (a = a1 + a2) constant
tal–temporal procedural and declarative neural processing (=1) for each parameter value. The size of the (higher level)
modules (Ullman, 2001; Indefrey and Levelt, 2004) which motor plan map depends on the length of the utterance
are not specified in detail in this model. Subsequently each under production. In the case of V-, CV-, and VC-items
phonologically specified syllable (i.e. a phonemic state; a three vocalic higher level parameters (high–low, front–
neural activation pattern on the level of the phonemic back, rounded–unrounded) and four higher level conso-
map) is processed by the feedforward control module. In nantal parameters (labial, apical, dorsal, exact closing posi-
the case of a frequent syllable, the sensory states (auditory tion) are controlled. These vocalic parameters and the
and somatosensory state) and the motor plan state of the consonantal parameter closing position are encoded using
syllable (which are already learned or trained during speech two neurons with complementary activation each, while
acquisition; see below) are activated via the phonetic map. the three remaining consonantal parameters are encoded
The phonetic map (Fig. 1) can be interpreted as the central by one neuron each in order to reflect the activation of a
neural map constituting the mental syllabary (for the con- specific vocal tract organ. Thus the motor plan map for
cept of mental syllabary, see Levelt and Wheeldon, 1994; V-, CV-, and VC-items consists of 11 neurons. Since a
Levelt et al., 1999). For each frequent syllable a phonemic motor plan encodes a motor or sensory V-, CV-, or VC-
state initiates the neural activation of a specific neuron item of a transition for C (encoded by four time labels)
within the phonetic map, which subsequently leads to acti- and a steady state portion for V (encoded by one time
vation patterns of the appropriate sensory states and the label) the (lower level) primary motor state of these items
appropriate motor plan state. In the case of infrequent syl- is encoded by five consecutive time labels. Thus the appro-
lables the motor plan state is assembled within the motor priate number of primary motor map neurons for a whole
planning module on the level of sub-syllabic units, e.g. syl- speech item is 5 20 = 100 neurons plus 10 neurons for
lable constituents like syllable onset and syllable rhyme or coding five time intervals describing the temporal distance
single speech sounds (Varley and Whiteside, 2001). This from label to label.
path is not implemented in our model at present. On the A computer-implemented numerical articulatory vocal
level of the motor plan map a high level motor state (motor tract model generates the time course of vocal tract geom-
plan) is activated for each speech item under production etries and subsequently the acoustic vocal tract model gen-
(current speech item). This high level motor state defines erates the acoustic speech signal. A three-dimensional
the temporal coordination of speech gestures or vocal tract articulatory–acoustic model is used here which is capable
action units (Goldstein et al., 2006; Saltzman and Munhall, of generating high-quality articulatory and acoustic speech
1989; for a general description of goal-directed action signals (Birkholz and Jackèl, 2004; Birkholz and Kröger,
units, see Sober and Sabes, 2003; Todorov, 2004; Fadiga 2006, 2007; Birkholz et al., 2006, 2007 and Kröger and
and Craighero, 2004). The motor plan of a speech item is Birkholz, 2007). These articulatory and acoustic signals
processed by the motor execution module in order to define are used for feedback control.
the spatio-temporal trajectories of articulator movements. The articulatory and acoustic signals generated by feed-
Thus the motor execution module calculates the concrete forward control are continuously monitored or controlled.
specification of each speech gesture on the level of the pri- For this feedback control the articulatory and acoustic sig-
mary motor map (cf. Ito et al., 2004; Sanguineti et al., 1997; nals are converted into neural signals by auditory and
Saltzman, 1979; Saltzman and Munhall, 1989; Saltzman somatosensory (i.e. tactile and proprioceptive) receptors.
and Byrd, 2000). For example, a labial closing gesture Somatosensory feedback signals (relative positions of artic-
involves coordinated movement of at least the lower jaw, ulators to each other and position and degree of vocal tract
the lower and upper lips. Thus each of these articulators constrictions, see Saltzman and Munhall, 1989; Shadmehr
must be controlled synergetically for the realization of a and Mussa-Ivaldi, 1994; Tremblay et al., 2003; Nasir and
speech gesture. Subsequently the movement of an articula- Ostry, 2006) are used for controlling motor execution. In
tor is executed by activating the motor units controlling addition sensory (i.e. somatosensory and auditory) signals
this articulator via the neuromuscular processing module. are converted into higher level cortical sensory states, which
The (lower level) primary motor map comprises 10 artic- represent the current speech item. These auditory and
ulatory parameters (Kröger et al., 2006b). Each articula- somatosensory (feedback) states of a currently produced
tory parameter value is coded by two neurons with speech item are processed by comparing them with the
complementary activation (see below) leading to 20 neu- appropriate prelearned auditory and somatosensory state,
rons to encoding the primary motor commands for each activated by feedforward control before the current speech
point in time. The conversion of physical parameter values item is produced. This comparison is done on the level of
(e.g. displacement of an articulator) into neuromotor acti- the somatosensory and auditory processing modules. If
vation patterns is done (i) by mapping the physical dis- the prestored (or feedforward) sensory state and the feed-
placement range for each parameter onto a neural back sensory states indicate a reasonable difference an
activation range [0, 1] (i.e. no activation to full activation error signal is activated for correcting the motor plan dur-
of a neuron) and (ii) by defining two neurons for each ing the ongoing feedforward control.
796 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809

The conversion of physical or psychophysical sensory

parameter values (e.g. bark scaled formant values) into
neural activation patterns is done (i) by mapping the whole
physical parameter range onto the ‘‘neural” range [0, 1] (i.e.
no activation to full activation of a neuron) and (ii) by
deﬁning two neurons per parameter with complementary
activation (see above for the primary motor map). Since
auditory states are processed as whole patterns, parameter
values for our V-, CV-, and VC-items (see above) are
obtained at ﬁve positions (labels) in the acoustic signal.
Three formants were processed leading to 3 5 = 15
parameter values and thus to 30 neurons per item for the
auditory state map. 10 proprioceptive and nine tactile
parameters were processed (Kröger et al., 2006b) leading Fig. 2. One-layer feedforward network connecting two neural maps 1 and
to 19 parameter values and thus 28 neurons for each item. 2. Grey lines indicate the neural connections connecting each neuron of
Only one tactile and proprioceptive state is coded for the map 1 with each neuron of map 2.
whole speech item representing the gestural target region
of the vocalic part in the case of a V-item and representing
vation of all connecting neurons ai within a neural map 2
the gestural target regions of the vocalic and the consonan-
(comprising N neurons I = 1, . . . , N) (see Eq. (1) and
tal part in the case of VC- or VC-items (overlay of tactile
Fig. 2).
and proprioceptive patterns). !
The perception part of the neurocomputational model XN

starts from an acoustic speech signal, generated by an bj ¼ actfunc ai wij for j ¼ 1; . . . ; M ð1Þ
i¼1
external speaker (Fig. 1). This signal is converted into neu-
ral signals by auditory receptors and is further processed Here actfunc is the activation function (a sigmoid function
into a cortical higher level auditory signal via the same in the case of our modeling; see Zell, 2003) which repre-
auditory pathway that is used for the feedback control of sents the total activation of neuron bj in map 1 as function
speech production (self-productions). Speech perception of the sum of activations from all neurons i within map 2.
comprises two pathways (cf. Hickok and Poeppel, 2004, The link weight values wij are limited to the interval
2007). The auditory-to-meaning pathway (ventral stream) [1, +1] (i.e. maximal inhibitory to maximal excitatory link
directly activates neural states within the mental lexicon weight value).
by the high level cortical auditory state for a speech item The link weight values reflect the whole knowledge
(e.g. a word). This pathway is not included in our model, inherent in the training data and thus the knowledge gath-
since high level mental lexical representations are out of ered during the training procedures. Link weight values are
the scope of this study. The auditory-to-motor pathway adjusted during training stages, i.e. during speech acquisi-
(dorsal stream) activates the phonetic state of the current tion stages (see below). They are allowed to be modified
speech item (e.g. sound or syllable) within the cortical fron- continuously in order to reflect new knowledge gained over
tal motor regions. This pathway is included in our model life time.
and it will be shown below that this pathway is capable One-layer feedforward networks (Fig. 2) are of limited
of modeling categorical perception of speech sounds and power and are used in our model exclusively for calculating
is capable of modeling differences in categorical perception articulatory joint-coordinate parameters from articulatory
of vowels and consonants. tract-variable parameters (cf. Kröger et al., 2006c). In this
The structure of the neurocomputational model differen- paper we will focus on the central phonetic map and the
tiates neural maps and neural mappings. Neural maps are multilateral co-activation of phonemic states, sensory
ensembles of neurons which represent the phonemic, pho- states, and motor plan states via the phonetic map. This
netic, sensory or motor speech states. These maps are capa- multilateral co-activation is achieved by using self-organiz-
ble of carrying states of different speech items by different ing maps or networks (Kohonen, 2001 and Fig. 3). Each
neural activation patterns. These activations change from neuron of the central self-organizing map (i.e. the phonetic
speech item to speech item under production or perception. map) represents a speech item. Different phonetic submaps
Neural mappings represent the neural connections between (i.e. different parts within the phonetic map) are defined for
the neurons of neural maps (Fig. 2). These connections can each class of speech items, i.e. for vowels, for CV-, and for
be excitatory or inhibitory. The degree of excitatory or VC-syllables. Multilateral co-activation of phonemic, sen-
inhibitory connection is described by link weight values. sory, and motor plan states for a speech item via the pho-
These values wij characterize the neural connection between netic map means that an activated neuron of the phonetic
each pair of neurons. They define the degree of activation map (representing a currently perceived or produced
of a connected neuron bj within a neural map 1 (comprising speech item) leads to a co-activation of neural activation
M neurons j = 1, . . . , M) resulting from the degree of acti- patterns within the phonemic, motor plan, or sensory side
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 797

neural levels (Kandel et al., 2000; Kröger et al., 2008). In

the case of speech production vocal tract action units or
speech gestures are well established as basic units of speech
production (Browman and Goldstein, 1989, 1992; Gold-
stein et al., 2006, 2007) separating motor speech planning
– i.e. the level of action scores (Goldstein et al., 2006) –
and motor speech execution (Saltzman and Munhall,
1989; Goldstein et al., 2006) – i.e. the detailed determina-
tion of all articulator movements. The practical importance
of dynamically defined speech action units becomes appar-
ent if modeling of segmental reduction effects resulting
from high speech rate (Kröger, 1993) or if modeling of
Fig. 3. Self-organizing network connecting three neural maps (side layer speech errors (Goldstein et al., 2007) is attempted. Sec-
maps) by a central self-organizing map (SOM or central layer map). Black ondly, the DIVA model does not explicitly introduce a pho-
lines indicate the neural connections, connecting each neuron of each side netic map or at least a map, reflecting the self-organization
layer map with each neuron of the central map.
of speech items between sensory, motor, and phonemic rep-
resentation; and the DIVA model does not explicitly claim
layer maps representing this current speech item. The set of bidirectional mappings between phonemic, sensory, and
link weight values of the connections between all neurons motor representations. But the assumption of bidirectional
of the phonemic, motor plan, or sensory side layer map associations is essential in our production–perception
and a neuron within the central phonetic map characterize model. Production is modeled in our approach using neural
the phonemic, motor plan, or sensory state of the speech connections from the phonemic map directed towards the
item represented by this neuron within the phonetic map. motor and sensory maps via the phonetic map and percep-
Activation patterns of neurons within the side layer maps tion is modeled in our approach using the neural connec-
induced by an activation pattern of the phonetic map as tions from sensory maps directed toward phonemic map
well as activation patterns of the phonetic map induced via the phonetic map. Furthermore, the phonetic map itself
by an activation pattern of one of the side layer maps are is a central concept in our approach. On the one hand, the
calculated in the same way as it is described above for sim- phonetic map introduces a hypermodal description of
ple one-layer feedforward networks (Eq. (1)). speech items which connects the sensory and motor repre-
The structure of the neurocomputational production– sentations of a speech item as is claimed in the mirror neu-
perception model introduced here is based on the structure ron theory. Our simulation results indicate that it is very
of the DIVA model introduced by Guenther (2006) and by feasible to introduce this level of neural self-organization
Guenther et al. (2006). The approach described in this (phonetic map) since it elucidates the ordering of speech
paper as well as the Guenther approach comprise a feedfor- items with respect to phonetic features (phonetotopy, see
ward and a feedback control path. Both approaches com- below). Furthermore the notion of the phonetic map is
prise self-organizing networks for processing neural states important for modeling speech perception since perceptual
and comprise neural maps for storing phonemic, motor, discrimination is defined in our approach as a distance
and sensory states representing speech items. Both between activated states on this neural level (see below).
approaches introduce pre-linguistic and early linguistic lan- Thirdly, the DIVA model is a production model not aiming
guage-specific training (i.e. babbling and imitation training, for modeling speech perception. But according to the argu-
see below) in order to shape the neural mappings within the ments given above the modeling of speech production and
computational models and both approaches include the speech perception as two closely related processes is of
concept of a mental syllabary (Levelt and Wheeldon, great importance. This is achieved in our approach.
1994; Levelt et al., 1999) and basic ideas of the mirror neu-
ron concept (Fadiga and Craighero, 2004; Rizzolatti and 3. Gaining speech knowledge: training stages for speech
Craighero, 2004) since both approaches claim a simulta- acquisition
neously occurring activation of sensory and motor states
for speech items. Speech knowledge is gained during training stages which
But there are three major differences between both model basic stages of human speech acquisition. This knowl-
approaches. Firstly, the DIVA approach does not separate edge is stored within the mappings of the model, i.e. by the
motor planning and motor execution as is introduced here. link weight values connecting the neurons within the pro-
This separation results from the fact that for all types of duction–perception model. Link weight values are adjusted
voluntary movements (actions) just the goal of an action during training stages. Two basic training stages can be dif-
(e.g. grasping a definite object or pressing a sequence of ferentiated, i.e. the babbling and the imitation stage (Oller
buttons) and the temporal overlap or temporal sequencing et al., 1999).
of actions are determined on the planning level while the For babbling training the training sets comprise pre-lin-
details of movement execution are determined on lower guistic speech items, i.e. proto-vocalic and proto-syllabic
798 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809

speech items. The model randomly produces proto-vocalic

and proto-syllabic speech items and listens to its own pro-
ductions using the auditory feedback path (Fig. 1). The link
weights between the sensory maps and the motor plan map
are adjusted via the phonetic map during this training
stage. No phonemic activation occurs since these training
items are pre-linguistic. The knowledge which is gained
during babbling training is language independent general
phonetic knowledge. During this training stage the neuro-
computational model learns the sensorimotor interrelation-
ship of the vocal tract apparatus and its neural control, i.e.
the interrelationship between various motor plan states and
their resulting somatosensory and auditory states. The bab-
bling training can be subdivided into training stages for
training of proto-vocalic states and for proto-syllabic
CV- and VC-states.
The proto-vocalic babbling training set comprises a set of
proto-vocalic states which exhibit a quasi-continuous varia-
tion of the vocalic motor plan parameters low–high and
front–back. The training set used here comprises 1076
proto-vocalic states which cover the language independent
articulatory vowel space between the cardinal vowel quali-
ties [i], [a], and [u] (Fig. 4). Each proto-vocalic motor plan
state is defined by the vocalic parameters back–front and
low–high. Thus the proto-vocalic training stimuli form a
two-dimensional plane within the F1–F2–F3 acoustic vowel
space (Fig. 4). Other motor parameters like tongue body
height, tongue body horizontal and vertical position, and
like lip opening are functions of these two motor plan param-
eters (Kröger et al., 2006c). Even lip protrusion is a function
of the parameter front–back in the case of this preliminary
training set since the training set does not include rounded
front proto-vowels (e.g. [y]-like vowel qualities).
The proto-syllabic CV and VC babbling training sets are
based on a set of 31 proto-vocalic motor plan states which
are characterized by the motor plan parameters back–front
and low–high and which are covering the whole articula-
tory vowel space. Labial, apical, and dorsal opening and
closing gestures (proto-CV- or proto-VC-gestures) starting
or ending with a full closure are superimposed on these
Fig. 4. Position of all auditory patterns of the proto-vocalic training
proto-vocalic states. Three proto-consonantal places of stimuli (grey points) in the normalized and bark-scaled (a) F1–F2 and (b)
articulation (front–mid–back) are defined per gesture. This F1–F3 vowel space.
leads to a total amount of 279 training items for each of the
two proto-syllabic training sets (proto-CV- and proto-VC-
training set). The articulatory velocity of the gesture-exe- and by (iii) the exact proto-consonantal closing position
cuting articulator for all closing and opening gestures is (front–mid–back). The lower level (or primary) motor
proportional to the distance between actual articulator parameters and their time courses are calculated from these
position and gestural target position. This leads to an expo- motor plan parameters by the motor execution module.
nential time function for the displacement of this articula- The appropriate auditory state of these opening and clos-
tor. A gesture is ending if the articulator-target distance is ing gestures (proto-CV- and proto-VC-syllables) is the time
below 10% in the case of opening or proto-CV-gestures or courses of the first three formants F1, F2, and F3 (Fig. 5
if a full closure is reached in the case of closing or proto- and see Section 2) and the appropriate somatosensory state
VC-gestures (target of closing gestures is beyond the full for each proto-syllabic motor plan state comprises tactile
closure position). In summary the motor plan state for information of the proto-consonantal closure and proprio-
these proto-CV- and proto-VC-gestures is defined by (i) ceptive information of the proto-vocalic opening.
two vocalic parameters (back–front and low–high), by (ii) Proto-vocalic, proto-CV-syllabic, and proto-VC-syllabic
the gesture-performing articulator (labial, apical, or dorsal) babbling training is performed independently from each
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 799

Fig. 5. Auditory state (right side) for a dorsal closing gesture (left side).

other. Three self-organizing maps (size: M = 15 15 = 225 with neighborhood radius of bwinner. The additional step
neurons) form three phonetic submaps and are trained by dependent function r(t) is introduced to get a constantly
using the three training sets described above. Training decreasing neighborhood radius (see Eq. (6)).
leads to an adjustment of link weight values wij between
the N side layer neurons ai and the M central layer neurons t
rðtÞ ¼ 1:0 þ ðrinit 1:0Þ 1 ð6Þ
bj. The side layers consist of the motor plan map tmax
(i = 1, . . . , K) and the sensory (auditory and somatosen-
sory) maps (i = K + 1, . . . , N) while the central layer repre- For the babbling training an initial neighborhood radius
sents the phonetic map (j = 1, . . . , M). Link weight values rinit = 12 and an initial learning rate Linit = 0.8 are chosen.
are initialized by random values within the interval [0, 1] Proto-vocalic and proto-syllabic test sets were defined
(i.e. no activation to full activation). The link weights for testing the proto-vocalic and proto-syllabic training
wij(tinit) are initialized using random values between 0 and results. The proto-vocalic test set comprises 270 proto-
1 (Eq. (2)). This adjustment of the link weights is done vocalic states which cover the language independent artic-
incrementally, i.e. step by step, using Hebbian learning ulatory vowel space between the cardinal vowel qualities
(Eq. (3)). When a new stimulus I with I = (x0, . . . , xN) is [i], [a], and [u]. This proto-vocalic test set is organized in
presented, the winner neuron bwinner is identified in the cen- the same way as the proto-vocalic training set but the test
tral layer by calculating the minimum of Euclidian norm set exhibits a much lower density within the articulatory or
between I and Wj, j = 1, . . . , M; i.e. winner = arg minj auditory vowel space. This also results in different training
(kI Wjk), where Wj is a vector containing the link weights and test items. Both proto-syllabic test sets are based on a
of all links from the central layer neuron bj to the side layer set of 22 quasi-vocalic motor plan states covering the whole
neurons ai, i.e. Wj = (w1j, . . . , wNj). Once the winner neuron language independent articulatory vowel space. Both
bwinner is identified the link weights for a step t with proto-syllabic test sets are organized in the same way as
tinit < t < tmax are updated as the proto-syllabic training sets but the test sets exhibit a
lower density within the articulatory or auditory vowel
wij ðtinit Þ ¼ randð0; 1Þ ð2Þ space for the proto-vocalic starting or ending positions of
wij ðt þ 1Þ ¼ wij ðtÞ þ N winner;j ðtÞ LðtÞ ðI i wij ðtÞÞ ð3Þ the VC- or CV-proto-syllables. Both proto-syllabic test sets
comprise 198 items. The test items were different from the
where 0 < L(t) < 1 is a constantly decreasing learning fac- training items defined above.
tor defined as An estimation of the quality of the proto-vocalic and the

t proto-syllabic training results is done by calculating a mean
LðtÞ ¼ 0:00001 þ ðLinit 0:00001Þ 1 ð4Þ error over all test set items for estimating an articulatory
tmax
state of a test set item from its auditory state. The calcula-
and Nwinner,j(t) is a neighborhood kernel (see Eq. (5)). Only tion of the error value for each test item comprises six
the link weights of the neurons in the neighborhood around steps: In a first step the motor plan state of a test item is
the winner neuron are updated. A 1-neighborhood is applied to the motor execution module for calculating the
defined as all 8 neurons around the winner neuron, if they appropriate articulatory patterns (i.e. the time course of
exist. A (n + 1)-neighborhood contains all neurons of a articulatory parameters for a speech item) by using the
n-neighborhood and their 1-neighbors, if they exist. Thus feedforward part of the model. This calculated articulatory
a neighborhood kernel Nwinner,j(t) is defined as pattern is called initial articulatory pattern. In a second step
the appropriate auditory state pattern is calculated by
1 if bj 2 rðtÞ-neighborhood
N winner;j ðtÞ ¼ ð5Þ using the output of the three-dimensional articulatory–
0 if bj R rðtÞ-neighborhood acoustic model for the initial articulatory pattern and by
800 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809

applying this output to the auditory feedback pathway of proto-vocalic training in Fig. 6 and for the proto-CV-syl-
the model. In a third step the motor plan state is recalcu- labic training in Fig. 7. It appears that motor plan states
lated from the auditory state pattern calculated in the sec- are organized with respect to phonetic categories. In the
ond step. Note that the trained self-organizing network is case of the vocalic phonetic submap vocalic states are
used for this step. This step leads to an estimated motor ordered continuously with respect to the motor plan
plan state which results from the sensorimotor knowledge parameters high–low and front–back. Experimental evi-
stored within the self-organizing network, i.e. which results dence for this kind of ordering is given by Obleser et al.
from the learning or training procedure. In a fourth step (2006). In the case of the syllabic submap three regions
the estimated articulatory pattern is calculated for the esti- occur which represent the gesture-performing articulator
mated motor plan states by reusing the feedforward part of (labial, apical, and dorsal), i.e. an ordering occurs with
the model. In a fifth step the estimated and initial articula- respect to the motor-plan parameter gesture-performing
tory patterns are compared. An error value is calculated for articulator. This neural behavior resulting from self-organi-
each test item which is the difference between estimated and zation of vocalic and consonantal or syllabic states with
initial articulatory pattern. This difference is normalized respect to phonetic categories (high–low, front–back, ges-
with respect to the initial articulatory pattern. In a sixth ture-performing articulator) can be labeled as phonetotopy
step the mean error over all test set items is calculated in parallel to tonotopy for the cortical ordering of auditory
for the trained network. states with respect to their fundamental frequency (Kandel
500,000 training steps are sufficient for predicting asso- et al., 2000, p. 609) or in parallel to somatotopy for the
ciated articulatory states from the auditory states of the ordering of somatosensory states with respect to their loca-
test items with a precision below 2% error rate on the pri- tion on the body surface (Kandel et al., 2000, p. 460f).
mary motor level in the case of the proto-vocalic training It should be kept in mind at this point that the general
(using the proto-vocalic training set) and 280,000 training phonetic sensorimotor knowledge stored in these phonetic
steps are sufficient for predicting the articulatory states maps is knowledge of sensorimotor relations exclusively
from the auditory states with a precision below 5% error generated by the three-dimensional articulatory and acous-
rate in the case of both proto-syllabic trainings (using both tic vocal tract model. Thus it is important for the perfor-
proto-syllabic training sets). Thus the complete babbling mance or quality of neurocomputational models of
training requires less than five minutes on standard PC’s. speech production and perception that these models com-
The resulting link weight values for the neurons con- prise realistic articulatory and acoustic vocal tract models
necting the self-organizing phonetic maps with the motor as front–end modules which are capable of generating
plan and auditory map are graphically displayed for the high-quality articulatory and acoustic signals, since the sig-

Fig. 6. Motor plan and auditory link weight values after vocalic babbling and imitation training for each neuron within the vocalic phonetic map (15 15
neurons). Link weight values are given for two motor plan parameters within each neuron box: back–front (left bar) and low–high (right bar). Link weight
values are given for three auditory parameters: bark scaled F1, F2, and F3 (horizontal lines within each neuron box). The outlined boxes indicate the
association of neurons with vowel phoneme categories. These associations are established during imitation training (see text).
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 801

Fig. 7. Motor plan and auditory link weight values after CV-syllabic babbling and imitation training for each neuron within the CV-phonetic map
(15 15 neurons). Link weight values are given for ﬁve motor plan parameters within each neuron box. First three columns: vocal tract organ which
performs the closing gesture (labial, apical, dorsal); two last columns: back–front value (forth column) and low–high value (ﬁfth column) of the vowel
within the CV-sequence. Link weight values are given for three auditory parameters: bark scaled F1, F2, and F3 (formant transitions within each neuron
box). The outlined boxes indicate the association of neurons with consonant phoneme categories /b/, /d/, and /g/; each of these three regions comprises the
appropriate consonant in all vocalic contexts. These associations are established during imitation training (see text).

nals generated by the articulatory–acoustic model are the during this training stage the neurocomputational model
basis for the calculation of all sensory signals. mainly learns to link neurons which represent different
After babbling training the neurocomputational model phonemes or phonemic descriptions of syllables with the
is capable of reproducing (or imitating) the motor plan motor plan states and with the sensory states of their
state (i.e. the articulation) of any pre-linguistic speech item appropriate typical realizations. In parallel to babbling
– in our case of any proto-vowel, proto-CV-syllable and training also imitation training can be subdivided into
proto-VC-syllable (with C = proto-consonantal closing training procedures for vowels, CV- and for VC-syllables.
gestures) – from their acoustic (or auditory) state patterns. The vowel imitation training set comprises a set of 100
Thus the neurocomputational model is now ready for lan- acoustic vowel realizations per phoneme for a typical five
guage-specific imitation training. For imitation training the vowel phoneme system /i/, /e/, /a/, /o/, and /u/ (e.g. Brad-
training sets comprise language-specific speech items; in our low, 1995 and Cervera et al., 2001). A three-dimensional
case vocalic and syllabic speech items. Beside the adjust- Gaussian distribution was chosen for each phoneme for
ment of link weights of the mapping between the phonetic distributing the 100 realizations per phoneme over the
map and the sensory maps and of the mapping between the F1–F2–F3-space (Fig. 8 for the F1–F2-space). The distri-
phonetic map and the motor plan map, which is mainly bution of the phoneme realizations in the acoustic vowel
done during babbling training, now in addition the link space (F1–F2–F3-space) is chosen as realistically as possi-
weights of the mapping between the phonetic map and ble. The acoustic vowel realizations within the acoustic
the phonemic map are adjusted. Language-specific imita- vowel space slightly overlap. These 500 vowel realizations
tion training results in (i) specifying regions of typical pho- are supposed to be realizations given by different external
neme realizations (phone regions) within the phonetic map, speakers, but matched with respect to the models babbling
i.e. in specifying regions of neurons within the phonetic vowel space. It should be noted that vowel phonemes nor-
map, which represent typical realizations of a phoneme mally are learned in the context of words during speech
or of a syllable phoneme chain (see Figs. 6 and 7) and in acquisition. This is replaced in this model by training of
(ii) fine-tuning of the sensorimotor link weights already isolated vowels by reason of simplicity. More complex
trained during babbling. This fine-tuning mainly occurs training scenarios are beyond the scope of this paper.
at the phone regions. Thus the knowledge which is gained During vowel imitation training each external acoustic
during imitation is language dependent. In other words (or auditory) vowel item is processed by the proto-vocalic
802 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809

realizations (re-articulations or imitations) can be accepted

and thus can be added to the imitation training data set
since the acoustic realizations of all re-articulations (or imi-
tations) occur within the phoneme realization clouds
(Fig. 8). Thus the vocalic imitation training set comprises
500 items of appropriate phonemic, motor plan, and sen-
sory states. These data are the basis for imitation training.
The syllable CV and VC imitation training sets are based
on a set of a labial, apical, and dorsal closing and opening
gesture ending or starting at 31 different vowel realizations
per vowel phoneme. That leads to 31 acoustic realizations
for each of the phonemic CV- or VC-syllables (i.e. /bi/,
/di/, /gi/, /be/, /de/, /ge/, /ba/, /da, /ga/, /bo/, /do/, /go/,
/bu/, /du/, and /gu/) and results in 465 training items. Each
of these externally produced acoustic items are imitated in
the same way as described above for the vowel items. Thus
465 training items of appropriate phonemic, motor plan,
and sensory states for CV- or VC-stimuli are generated.
Only 5,000 training steps for vowels and only 5,000
training steps for CV- and VC-syllables had to be added
to the proto-vocalic and proto-syllabic babbling training
for obtaining clear phoneme realization regions (phone
regions) within the phonetic maps (see the outlined neuron
boxes in Figs. 6 and 7). A neuron of the phonetic map is
defined to be a part of a phone region if the phonemic link
weight value for this neuron of the phonetic map and the
appropriate neuron of the phonemic map is above the level
of 0.95. Thus for the neurons which form a phone region
within the phonetic map, strong excitatory connections
exist towards the neuron representing the appropriate pho-
neme within the phonemic map.
For imitation training the imitation training sets are
used in addition to the ongoing applied babbling training
set. The network is not reset; the already trained babbling
network is used as a basis for further training. The algo-
rithms for adjusting the network link weight are identical
for babbling and for imitation training. Thus the succes-
sion from babbling to imitation training needs not to be
Fig. 8. Positions of all auditory patterns of the language-specific vocalic
abrupt. Imitation training can start in parallel to babbling
training stimuli (phone clouds: 100 realizations per phoneme /i/ (square),
/e/ (cross), /a/ (circle), /o/ (triangle), and /u/ (plus)) in the normalized and training if some sensorimotor knowledge, i.e. if some
bark-scaled (a) F1–F2 and (b) F1–F3 vowel space. The patterns (or phone knowledge how to estimate motor plan states from audi-
clouds) are added to the proto-vocalic training stimuli (points). tory states, is already available from early babbling train-
ing. The complete imitation training requires less than
one minute on a standard PC for the training sets used
babbling network in order to estimate the appropriate here.
motor plan parameters. Thus the model is capable of re-
articulating (imitating) these externally produced vowels 4. Producing and perceiving vowels and CV-syllables
and the model is capable of generating the appropriate
internal auditory feedback states. In natural human speech It should be emphasized that babbling and imitation
acquisition scenarios the imitated vowel item is then judged training is not only the basis for learning to produce speech
as right or wrong (i.e. is accepted or not accepted) by an items of a target language. Since the sensory states of all
external listener; i.e. the produced item is awarded or not self-productions are perceived by the feedback loop during
by the external listener, e.g. by communication between babbling training and since external acoustic speech items
carer and toddler. If the item is accepted as a proper real- as well as self-productions are perceived during imitation
ization of the intended phoneme, its motor and sensory training it can be hypothesized that babbling and imitation
states can be linked to the neuron representing this pho- training are also important for learning to perceive speech
neme in the phonemic map. In the case of our model all items of a target language.
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 803

The production pathway (phonemic map ? phonetic model trained thus far for vowels and simple CV- and
map ? motor plan map ? primary motor map ? articu- VC-syllables is capable of producing categorical perception
lation) has been introduced in Section 2. The speech items for vowels and in an even stronger way for consonants (i.e.
which were trained in this study can be labeled as frequent voiced plosives in the case of our model). The auditory
syllables. The description of the processing of infrequent pathway for perception of external speech items (auditory
syllables is beyond the scope of this paper. Our training receptors ? auditory map ? phonetic map ? phonemic
results given above indicate strong neural connections from map) has already been introduced in Section 2 (auditory-
a phonemic state within the phonemic map to a set of neu- to-motor pathway, see Hickok and Poeppel, 2000, 2004).
rons within the phonetic map. Each of these sets of neurons Thus the phonetic map is not only a central neural repre-
within the phonetic map represent a region of phoneme sentation in speech production but also in speech percep-
realizations (phone regions) and thus represent production tion at least for sublexical speech units like speech sounds
variability since neighboring neurons within the phonetic and syllables. In order to show that the current neuron-
map represent slightly different motor and sensory states computational production–perception model perceives
(for natural variation in vowel realizations, see Perkell vowels (for the five vowel system /i/, /e/, /a/, /o/, and /u/)
et al., 1993). If a phonemic speech item is activated (phone- and consonants (for the voiced plosives /b/, /d/, and /g/)
mic map) this leads to an activation of several neurons in a speech-like categorical way, speech identification and
within the phonetic map (see the outlined boxes or phone discrimination experiments were carried out using the
regions for example for the vocalic phonetic map; Fig. 6). model. In order to be able to perform these experiments
Thus in our model the maximal activated neuron within using the model, 20 different instances of the model were
the phonetic map can differ from realization to realization. trained using (i) different sets of training data due to differ-
Therefore the motor plan and the subsequent articulatory ent randomization procedures for determining the vocalic
realization of a phonemic item are allowed to vary within items within all training sets, using (ii) a different ordering
a perceptually acceptable region. These regions for phone- of training stimuli during each training stage, and using (iii)
mic items are the phoneme realization regions or phone different sets of randomly generated initial link weight val-
regions and they are language-specific and are defined dur- ues for each of the 20 instances. The resulting 20 instances
ing imitation training (see Figs. 6 and 7). of the model are called virtual listeners.
Furthermore coarticulation is introduced in our neuro- Identification of an external acoustic stimulus is per-
computational model. Two sources of coarticulation are formed in our model by a virtual listener by identifying
implemented in our model. Firstly, coarticulation results the most excited neuron within the phonemic map. Dis-
from the fact that the exact coordination of articulators crimination of two external acoustic stimuli is performed
for executing a speech gesture is controlled by the motor in our model by calculating the most activated neuron on
execution module and that a speech gesture is not encoded the level of the phonetic map for each acoustic stimulus
in all details on the motor plan level. That leads to variabil- and subsequently by calculating the city block distance
ity in gesture execution with respect to context. For exam- between these both neurons for each virtual listener. The
ple the realization of /b/ in /ibi/ or /aba/ is different in our phonetotopic ordering of speech items on the level of the
model. In /aba/ the lower jaw is more involved in the exe- phonetic map (see above) is a first hint that distance
cution of the labial closing gesture than in /ibi/ because of between speech items (states) on the level of this map indi-
the wide mouth opening occurring in /a/ in comparison to cates phonetic similarity or dissimilarity. Moreover we
/i/. Because of this wide mouth opening in /a/ it would be assume that the sensory resolution of two states (i.e. the
ineffective to execute the closing gesture in /aba/ just by capability for discrimination between these states) is gov-
using the lips. It is more effective to add a synergetic eleva- erned by the spatial distance of these two states on the level
tion of the lower jaw. Thus, the lower jaw elevation and the of the phonetic map. This assumption holds for tonotopic
lower lip elevation form a labial closing gesture in a syner- ordering and thus for F0-discrimination of auditory stimuli
getic way. Secondly, coarticulation results from the fact (see the discussion of tonotopic cortical maps, Kandel
that gesture specifications can vary even on the level of et al., 2000, p. 609) and this assumption also holds for
the motor plan. For example lip protrusion is allowed to somatotopic ordering and thus for the spatial discrimina-
vary for a consonantal labial closing gesture since lip pro- tion of tactile stimuli (see the discussion of somatotopic
trusion is a non-relevant phonemic feature in the case of a maps, Kandel et al., 2000, p. 460ff). Consequently it can
labial closing gesture in our target language. Since the be hypothesized that two stimuli can be discriminated if
labial closing gesture within a CV-syllable temporarily the distance of the activated neurons representing the stim-
overlaps with the following vocalic gesture (e.g. for a uli on the level of the phonetic map exceeds a certain neu-
gesture for realizing an /i/ or /u/) our simulations show ron distance within this map and it can be hypothesized
anticipatory lip protrusion on the motor execution level that discrimination becomes stronger with increasing neu-
in /pu/ while lips are not protruded during the labial ron distance.
closure in /pi/. Vocalic and consonantal identification and discrimina-
In the case of language-specific perception of speech tion tests were performed on the basis of quasi-continuous
items it can easily be shown that the neurocomputational acoustic stimulus continua (for an introduction to speech
804 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809

perception experiments, see e.g. Raphael et al. (2007)). The

stimulus continua generated for these tests model an /i/-
/e/-/a/-continuum for vowels and a /ba/-/da/-/ga/-contin-
uum for CV-syllables (Figs. 9 and 10). The resulting iden-
tification and discrimination scores are given in Figs. 11
and 12. It can be seen that the measured identification
scores (measured for the 20 virtual listeners by identifying
the most excited neuron within the phonemic map via the
phonetic map for each stimulus) indicate more abrupt pho-
neme boundaries in the case of consonants than in the case
of the vowels. Additionally it can be seen that the measured
discrimination scores (measured for the same 20 virtual lis-
teners by estimating the distance for both stimuli on the
level of the phonetic map; see above) indicate higher dis-
crimination scores at least for consonant perception. Beside Fig. 11. Measured identification scores (non-bold black lines) and
measured discrimination (naturally perceived discrimina- measured (bold black line) and calculated (bold grey line) discrimination
tion) also calculated discrimination scores are shown in score for the vocalic /i/-/e/-/a/ stimulus continuum for 20 virtual instances
of the model.
Figs. 11 and 12. Calculated discrimination scores are theo-
retical constructs (see Liberman et al., 1957). They are cal-
culated from (measured) identification scores for each
single (virtual) listener. Thus calculated discrimination is
a discrimination of stimuli which merely results from
differences in identification of these stimuli. The probability
pdiscr for a certain percentage of calculated discrimination
of two stimuli a and b is based just on the identification
probabilities pid of these two stimuli for each phonemic cat-
egory i = 1, 2, or 3 (with 1 = /b/, 2 = /d/, and 3 = /g/ in
case of consonants and with 1 = /i/, 2 = /e/, and 3 = /a/
in the case of vowels, see Eq. (7) and Liberman et al.,
1957, p. 363).
X
3
2
pdiscr ¼ 0:5 þ 0:5 ðpid ða; iÞ pid ðb; iÞÞ ð7Þ
i¼1
Fig. 12. Measured identification scores (non-bold black lines) and
Consequently calculated discrimination just indicates that measured (bold black line) and calculated (bold grey line) discrimination
scores for the consonantal /ba/-/da/-/ga/ stimulus continuum for 20
part of discrimination of stimuli which results from the
virtual instances of the model.
ability of subjects to classify stimuli to different categories.
Calculated discrimination or discrimination based on identi-
fication (Liberman et al., 1957; Eimas, 1963) and its differ- and Harnad, 2000). Calculated discrimination indicates
ence to (naturally) measured discrimination is discussed as discrimination which is just based on discrete linguistic or
an important feature of categorical perception (Damper phonemic categorical knowledge, while measured discrimi-

Fig. 9. Bark-scaled formant pattern for 13 vocalic stimuli (/i/-/e/-/a/-continuum) for the vocalic perceptual identiﬁcation and discrimination tests.

Fig. 10. Bark-scaled formant pattern for 13 CV-stimuli (/ba/-/da/-/ga/-continuum) for the consonantal perceptual identiﬁcation and discrimination tests.
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 805

nation scores indicate the complete discrimination of two 2000; Hartsuiker and Kolk, 2001). In addition imaging
stimuli based on all available auditory information given studies focusing on speech perception have demonstrated
by these stimuli; not just the linguistic, phonemic, or cate- that perception is capable of activating parts of the speech
gorical information, needed for (categorical) identification. production cortical networks (Fadiga et al., 2002; Wilson
It can be seen from Figs. 11 and 12 that measured discrim- et al., 2004; Hickok and Poeppel, 2004, 2007).
ination rates are always higher than calculated discrimina- Bidirectional mappings between phonemic and phonetic
tion rates. That is in agreement with identification and and between sensory and phonetic maps are introduced in
discrimination scores extracted from identification and dis- our neural model in order to illustrate the close relation-
crimination experiments carried out with humans and can ship between production and perception. The introduction
be interpreted in the way that acoustic speech stimuli al- of these bidirectional mappings is the basis for important
ways convey categorical (linguistic) and non-categorical features of the model like categorical perception. Physio-
(para-linguistic or non-linguistic extra) information. While logically a bidirectional mapping comprises two related
measured and calculated discrimination scores are nearly unidirectional mappings since neurons always forward
identical in the case of consonants, it comes out from our their firing pulses in one direction (Kandel et al., 2000).
modeling data that measured discrimination is better than Thus physiologically bidirectional mappings are repre-
calculated discrimination especially in the case of vowels. sented by two neural paths connecting the maps in both
This is in agreement with result of natural speech percep- directions (see the separate arrows in Fig. 1). The pho-
tion (Fry et al., 1962; Eimas, 1963) and reflects the typical netic map – which forms the central map for all bidirec-
differences in categorical perception of consonants and tional mappings in our model (see Fig. 1) can be
vowels. interpreted as the central part of the mental syllabary
(Levelt and Wheeldon, 1994; Levelt et al., 1999). Neural
5. Discussion and conclusions cortico-cortical connections exist in both directions
between this part of the frontal cortex and the sensory
The experimental results presented in this paper indicate areas as well as between this part of the frontal cortex
that a model of speech production and perception which is and those temporal regions which process phonemic
shaped with respect to basic neurophysiological facts is information (Kandel et al., 2000).
capable of embedding important features of speech pro- Other computer-implemented models of speech produc-
duction and speech perception in a straight forward way tion (Bailly, 1997; Guenther, 1994, 1995, 2006; Guenther
even if the neurocomputational modeling is relatively basic et al., 2006) as well as the model introduced here reflect
as is here by using simple standard self-organizing net- the relationship between perception and production by
works. Typical and therefore important features of speech incorporating perceptual feedback control loops or by incor-
production and perception like production variability of porating production–perception pathways for self-moni-
phoneme realizations and categorical speech perception toring processes (Indefrey and Levelt, 2004). Dual stream
and especially the fact of different degrees of categorical models of speech perception have recently been published
perception for consonants and vowels, occur in a straight- which introduce a ventral stream for passive auditory pro-
forward way in this production–perception model. Since cessing and a dorsal stream activating auditory-motor net-
human speech production and perception easily outper- works (e.g. Hickok and Poeppel, 2004, 2007) but passive
forms speech synthesis and speech recognition systems at models of speech perception that do not refer to produc-
least in difficult conditions, it could be useful to include tion processes can also be found (McClelland and Elman,
human-like speech processing routines into such technical 1986; Gaskell and Marslen-Wilson, 1997; Luce et al.,
speech processing systems. This may help to increase the 2000; Norris et al., 2006). The model introduced here
quality and the level of performance of technical speech reflects the close relationship between speech production
processing systems. and speech perception since on the one hand our model
Furthermore this modeling study indicates the close comprises basic features of speech production models (cf.
relationship of speech production and speech perception. Guenther et al., 2006) and since on the other hand our
Speech perception theories such as the motor theory of model is capable of incorporating in addition the dual
speech perception (Liberman et al., 1967; Liberman and stream idea (Hickok and Poeppel, 2007) in a straight for-
Mattingly, 1985) or the direct-realist theory (Fowler, ward way (see the labels ‘‘ventral stream” and ‘‘dorsal
1986) have already postulated this close relationship. And stream” in Fig. 1).
recent experimental results provide support for this claim Mirror neurons (visual and audio–visual mirror neuron
and suggest that the development of an integrative model system) appear to be one of the neural systems that are
on speech production and perception is highly desirable. involved in the association of production and perception
For example perceptual feedback loops (also called self- processes (Rizzolatti and Arbib, 1998; Studdert-Kennedy,
monitoring processes) are known to activate parts of the 2002; Kohler et al., 2002; Fadiga and Craighero, 2004; Riz-
speech perception mechanism during overt (external per- zolatti and Craighero, 2004; Wilson et al., 2004; Iacoboni,
ceptual loop) as well as covert speech production (internal 2005; Wilson and Knoblich, 2005; Arbib, 2005). Systems of
perceptual loop, cf. Indefrey and Levelt, 2004; Postma, mirror neurons have been detected which code the abstract
806 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809

meaning of goal-directed actions (e.g. grasping) and which netotopic ordering presented in this paper for the different
are capable of co-activating motor and sensory (visual and submaps discussed here.
audio–visual) representations of these actions by neural It is unclear whether the training sets used here consti-
cortico-cortical associations. These visual and audio–visual tute a representative natural model of babbling and imita-
mirror neuron systems also co-activate abstract concepts tion training during early states of human speech
(preferably for action words) and thus are capable of asso- acquisition. Our training sets comprise a widespread set
ciating higher order linguistic representations with goal- of vocalic vocal tract positions and a widespread set of
directed actions. A speech mirror neuron system (‘‘mirror opening and closing movements. At least these sets com-
resonant system” after Fadiga and Craighero, 2004, p. prise all vocal tract positions and all opening and closing
167, ‘‘auditory mirror neuron system” or ‘‘echo neurons” movements which are physiologically possible. But it is
after Rizzolatti and Craighero, 2004, p. 185f) is postulated conceivable that toddlers very quickly reduce their set of
which is newer from the viewpoint of evolution in compar- training items from all physiological possible positions
ison to the mirror neuron system introduced above and and movements towards a subset of positions and move-
which is directly linked with the capacity of humans to ments which are especially important for speech.
learn speech items by imitation. It can be assumed that this It should be noted that our neural modeling approach
speech mirror neuron system in parallel co-activates motor does not include modeling of temporal aspects of neural
representations, sensory representations, and phonemic functioning. Rather the temporal aspects of production
representations of speech items. Given that from a pho- and perception are included in the speech items and thus
netic viewpoint speech items also are built up by goal-direc- in the sensory, motor, phonetic, and phonemic states. In
ted actions (called speech gestures) which build-up the our production–perception model sensory and motor states
motor plans for speech items in our model (see Section of vowels and syllables are processed as a whole. Our mod-
2), it can be hypothesized that a mirror neuron layer also eling approach thus is sufficient as long as only a descrip-
exists for the association of motor, sensory, and phonemic tion of the training and processing of syllables is wanted.
representations of speech gestures (see also Westerman and In contrast a detailed temporal organization becomes
Miranda, 2004). important if speech items comprise more than one syllable.
Self-organization is a central principle of learning and In this case processing delays must be introduced for all
self-organizing maps are used for modeling cortical net- pathways postulated in the model (cf. Guenther et al.,
works (Kohonen, 2001). Within our neurocomputational 2006) and temporal aspects of neural activity need to be
model artificial self-organizing neural networks are imple- considered (cf. Maass and Schmitt, 1999).
mented since self-organizing neural networks are biologi- The two training stages identified by our modeling study
cally plausible and have been used successfully for distinguish between babbling (i.e. the build-up stage for
modeling semantic lexical networks (Ritter and Kohonen, sensorimotor representations of pre-linguistic proto-voca-
1989), for (i) modeling semantic and phonological aspects lic and proto-consonantal speech gestures) and imitation
during early lexical development (Li et al., 2004), and for (i.e. the build-up stage for language-specific perceptual,
(ii) modeling the generation and recognition of goal-direc- motor, phonetic, and phonemic representations of speech
ted movements (Bullock et al., 1993; Tani et al., 2004). A items). A closer modeling of early stages of speech acquisi-
further argument for using self-organizing maps is their tion (Oller et al., 1999) is beyond the scope of this paper.
success in modeling the mapping between phonemic and Furthermore in reality the two training stages introduced
phonetic aspects of speech production as demonstrated here overlap in time. This is partially realized in our
by the learning experiments for vowels and syllables approach, since babbling and imitation training items are
described in this study. applied in parallel during the imitation training stage after
In our current model different submaps are used for dif- a short babbling training stage.
ferent classes of speech items (V, CV, VC) and separate The next important step would be to introduce processes
training procedures were introduced for training these clas- for building up the mental lexicon and for modeling the
ses of speech items. This separation of the phonetic map in process of word segmentation and identification (cf. Batch-
submaps as well as the separation of training procedures elder, 2002; Werker and Yeung, 2005; Jusczyk, 1999; Brent,
for different speech items was done in order to simplify 1999). The representation of the mental lexicon of the tar-
the modeling of the speech acquisition procedure for these get language is very important for including top-down pro-
three classes of speech items from the computational view- cesses of speech perception and thus for speech recognition.
point. But in principle all types of speech items (i.e. all However consideration of these processes currently goes
types of syllables and words or word components) can be beyond the scope of the current implementation of our
trained simultaneously by introducing just one comprehen- model. But the model in generally is open for integrating
sive learning task and by using one single phonetic map. a mental lexicon.
Recent preliminary experiments indicate that a comprehen- Last but not least it has to be stated that the neurocom-
sive single phonetic map shapes different subregions repre- putational production–perception model developed thus
senting different classes of speech items. The ordering of far by no means is an alternative solution for high-perfor-
speech items within these subregions is similar to the pho- mance speech recognition or speech synthesis systems. At
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 807

present the model described here is capable of producing Boatman, D., 2004. Cortical bases of speech perception: evidence from
and perceiving simple CV- and VC-syllables under ideal functional lesion studies. Cognition 92, 47–65.
Bookheimer, S.Y., Zeffiro, T.A., Blaxton, T.A., Gaillard, W., Theodore,
conditions. Concerning a further development of the model W.H., 2000. Activation of langauge cortex with automatic speech
introduced here two different strategies are imaginable. On tasks. Neurology 55, 1151–1157.
the one hand, this model can be further developed in order Bradlow, A.R., 1995. A comparative acoustic study of English and
to handle more complex classes of speech items (words, Spanish vowels. J. Acoust. Soc. Amer. 97, 1916–1924.
sentences, or a whole discourse) under ideal and non-ideal Brent, M.B., 1999. Speech segmentation and word discovery: a compu-
tational perspective. Trends Cognit. Sci. 3, 294–301.
conditions (e.g. different speakers, different emotional Browman, C., Goldstein, L., 1989. Articulatory gestures as phonological
states, external noise). On the other hand, the organization units. Phonology 6, 201–251.
of the neurocomputational model outlined in this paper Browman, C., Goldstein, L., 1992. Articulatory phonology: an overview.
could be integrated at least partially into the architecture Phonetica 49, 155–180.
of current or new speech recognition and speech synthesis Bullock, D., Grossberg, S., Guenther, F., 1993. A self-organizing neural
model of motor equivalent reaching and tool use by a multijoint arm.
systems. J. Cognit. Neurosci. 5, 408–435.
Callan, D.E., Tsytsarev, V., Hanakawa, T., Callan, A.K., Katsuhara, M.,
Acknowledgement Fukuyama, H., Turner, R., 2006. Song and speech: brain regions
involved with perception and covert production. Neuroimage 31,
This work was supported in part by the German Re- 1327–1342.
Cervera, T., Miralles, J.L., Gonzales-Alvarez, J., 2001. Acoustical analysis
search Council Grant No. KR 1439/13-1. of Spanish vowels produced by laryngectomized subjects. J. Speech
Lang. Hear. Res. 44, 988–996.
References Clark, R.A.J., Richmond, K., King, S., 2007. Multisyn: open-domain unit
selection for the Festival speech synthesis system. Speech Comm. 49,
Ackermann, H., Riecker, A., 2003. The contribution of the insula to 317–330.
motor aspects of speech production: a review and a hypothesis. Brain Damper, R.I., Harnad, S.R., 2000. Neural network models of categorical
Lang. 89, 320–328. perception. Percept. Psychophys. 62, 843–867.
Arbib, M.A., 2005. From monkey-like action recognition to human Dell, G.S., Chang, F., Griffin, Z.M., 1999. Connectionist models of
language: an evolutionary framework for neurolinguists. Behav. Brain language production: lexical access and grammatical encoding. Cognit.
Sci. 28, 105–167. Sci. 23, 517–541.
Bailly, G., 1997. Learning to speak: sensory-motor control of speech Eimas, P.D., 1963. The relation between identification and discrimination
movements. Speech Comm. 22, 251–267. along speech and non-speech continua. Lang. Speech 6, 206–217.
Batchelder, E.O., 2002. Bootstrapping the lexicon: a computational model Fadiga, L., Craighero, L., 2004. Electrophysiology of action representa-
of infant speech segmentation. Cognition 83, 167–206. tion. J. Clin. Neurophysiol. 21, 157–168.
Benson, R.R., Whalen, D.H., Richardson, M., Swainson, B., Clark, V.P., Fadiga, L., Craighero, L., Buccino, G., Rizzolatti, G., 2002. Speech
Lai, S., Liberman, A.M., 2001. Parametrically dissociating speech and listening specifically modulates the excitability of tongue muscles: a
nonspeech perception in the brain using fMRI. Brain Lang. 78, 364– TMS study. Eur. J. Neurosci. 15, 399–402.
396. Fowler, C.A., 1986. An event approach to the study of speech perception
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, from a direct-realist perspective. J. Phonetics 14, 3–28.
D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Fry, D.B., Abramson, A.S., Eimas, P.D., Liberman, A.M., 1962. The
Wellekens, C., 2007. Automatic speech recognition and speech identification and discrimination of synthetic vowels. Lang. Speech 5,
variability: a review. Speech Comm. 49, 763–786. 171–189.
Binder, J.R., Frost, J.A., Hammeke, T.A., Bellgowan, P.S.F., Springer, Gaskell, M.G., Marslen-Wilson, W.D., 1997. Integrating form and
J.A., Kaufman, J.N., Possing, E.T., 2000. Human temporal lobe meaning: a distributed model of speech perception. Lang. Cognit.
activation by speech and nonspeech sounds. Cereb. Cortex 10, 512– Process. 12, 613–656.
528. Goldstein, L., Byrd, D., Saltzman, E., 2006. The role of vocal tract action
Birkholz, P., Jackèl, D., 2004. Influence of temporal discretization units in understanding the evolution of phonology. In: Arbib, M.A.
schemes on formant frequencies and bandwidths in time domain (Ed.), Action to Language via the Mirror Neuron System. Cambridge
simulations of the vocal tract system. In: Proc. Internat. Conf. on University Press, Cambridge, pp. 215–249.
Speech and Language Processing (Interspeech 2004, Jeju, Korea), Goldstein, L., Marianne, Pouplier, Larissa, Chen, Elliot, Saltzman, Dani,
pp. 1125–1128. Byrd, 2007. Dynamic action units slip in speech production errors.
Birkholz, P., Kröger, B.J., 2006. Vocal tract model adaptation using Cognition 103, 386–412.
magnetic resonance imaging. In: Proc. 7th Internat. Seminar on Speech Grossberg, S., 2003. Resonant neural dynamics of speech perception. J.
Production (Belo Horizonte, Brazil), pp. 493–500. Phonetics 31, 423–445.
Birkholz, P., Kröger, B.J., 2007. Simulation of vocal tract growth for Guenther, F.H., 1994. A neural network model of speech acquisition and
articulatory speech synthesis. In: Proc. 16th Internat. Congress of motor equivalent speech production. Biological Cybernet. 72, 43–53.
Phonetic Sciences (Saarbrücken, Germany), pp. 377–380. Guenther, F.H., 1995. Speech sound acquisition, coarticulation, and rate
Birkholz, P., Jackèl, D., Kröger, B.J., 2006. Construction and control of a effects in a neural model of speech production. Psychol. Rev. 102, 594–
three-dimensional vocal tract model. In: Proc. Internat. Conf. on 621.
Acoustics, Speech, and Signal Processing (ICASSP 2006, Toulouse, Guenther, F.H., 2006. Cortical interaction underlying the production of
France), pp. 873–876. speech sounds. J. Comm. Disorders 39, 350–365.
Birkholz, P., Jackèl, D., Kröger, B.J., 2007. Simulation of losses due to Guenther, F.H., Ghosh, S.S., Tourville, J.A., 2006. Neural modeling and
turbulence in the time-varying vocal system. IEEE Trans. Audio imaging of the cortical interactions underlying syllable production.
Speech Lang. Process. 15, 1218–1225. Brain Lang. 96, 280–301.
Blank, S.C., Scott, S.K., Murphy, K., Warburton, E., Wise, R.J.S., 2002. Hartsuiker, R.J., Kolk, H.H.J., 2001. Error monitoring in speech
Speech production: Wernike, Broca and beyond. Brain 125, 1829– production: a computational test of the perceptual loop theory.
1838. Cognit. Psychol. 42, 113–157.
808 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809

Heim, S., Opitz, B., Müller, K., Friederici, A.D., 2003. Phonological Levelt, W.J.M., Wheeldon, L., 1994. Do speakers have access to a mental
processing during language production: fMRI evidence for a shared syllabary?. Cognition 50 239–269.
production-comprehension network. Cognit. Brain Res. 16, 285–296. Levelt, W.J.M., Roelofs, A., Meyer, A., 1999. A theory of lexical access in
Hickok, G., Poeppel, D., 2000. Towards a functional neuroanatomy of speech production. Behav. Brain Sci. 22, 1–75.
speech perception. Trends Cognit. Sci. 4, 131–138. Liberman, A.M., Mattingly, I.G., 1985. The motor theory of speech
Hickok, G., Poeppel, D., 2004. Dorsal and ventral streams: a framework perception revised. Cognition 21, 1–36.
for understanding aspects of the functional anatomy of language. Liberman, A.M., Harris, K.S., Hoffman, H.S., Griffith, B.C., 1957. The
Cognition 92, 67–99. discrimination of speech sounds within and across phoneme bound-
Hickok, G., Poeppel, D., 2007. Towards a functional neuroanatomy of aries. J. Exp. Psychol. 54, 358–368.
speech perception. Trends Cognit. Sci. 4, 131–138. Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy,
Hillis, A.E., Work, M., Barker, P.B., Jacobs, M.A., Breese, E.L., Maurer, M., 1967. Perception of the speech code. Psychol. Rev. 74, 431–461.
K., 2004. Re-examining the brain regions crucial for orchestrating Liebenthal, E., Binder, J.R., Spitzer, S.M., Possing, E.T., Medler, D.A.,
speech articulation. Brain 127, 1479–1487. 2005. Neural substrates of phonemic perception. Cereb. Cortex 15,
Huang, J., Carr, T.H., Cao, Y., 2001. Comparing cortical activations for 1621–1631.
silent and overt speech using event-related fMRI. Hum. Brain Mapp. Li, P., Farkas, I., MacWhinney, B., 2004. Early lexical development in a
15, 39–53. self-organizing neural network. Neural Networks 17, 1345–1362.
Iacoboni, M., 2005. Neural mechanisms of imitation. Curr. Opin. Luce, P.A., Goldinger, S.D., Auer, E.T., Vitevitch, M.S., 2000. Phonetic
Neurobiol. 15, 632–637. priming, neighborhood activation, and PARSYN. Percept. Psycho-
Indefrey, P., Levelt, W.J.M., 2004. The spatial and temporal signatures of phys. 62, 615–625.
word production components. Cognition 92, 101–144. Maass, W., Schmitt, M., 1999. On the complexity of learning for spiking
Ito, T., Gomi, H., Honda, M., 2004. Dynamical simulation of speech neurons with temporal coding. Inform. Comput. 153, 26–46.
cooperative articulation by muscle linkages. Biological Cybernet. 91, McClelland, J.L., Elman, J.L., 1986. The TRACE model of speech
275–282. perception. Cognit. Psychol. 18, 1–86.
Jardri, R., Pins, D., Bubrovszky, M., Despretz, P., Pruvo, J.P., Steinling, Murphy, K., Corfield, D.R., Guz, A., Fink, G.R., Wise, R.J.S., Harrison,
M., Thomas, P., 2007. Self awareness and speech processing: an fMRI J., Adams, L., 1997. Cerebral areas associated with motor control of
study. Neuroimage 35, 1645–1653. speech in humans. J. Appl. Physiol. 83, 1438–1447.
Jusczyk, P.W., 1999. How infants begin to extract words from speech. Nasir, S.M., Ostry, D.J., 2006. Somatosensory precision in speech
Trends Cognit. Sci. 3, 323–328. production. Curr. Biology 16, 1918–1923.
Kandel, E.R., Schwartz, J.H., Jessell, T.M., 2000. Principles of Neural Norris, D., Cutler, A., McQueen, J.M., Butterfield, S., 2006. Phonological
Science, fourth ed. McGraw-Hill, New York. and conceptual activation in speech comprehension. Cognit. Psychol.
Kemeny, S., Ye, F.Q., Birn, R., Braun, A.R., 2005. Comparison of 53, 146–193.
continuous overt speech fMRI using BOLD and arerial spin labeling. Obleser, J., Boecker, H., Drzezga, A., Haslinger, B., Hennenlotter, A.,
Hum. Brain Mapp. 24, 173–183. Roettinger, M., Eulitz, C., Rauschecker, J.P., 2006. Vowel sound
Kohler, E., Keysers, C., Umilta, M.A., Fogassi, L., Gallese, V., Rizzolatti, extraction in anterior superior temporal cortex. Hum. Brain Mapp. 27,
G., 2002. Hearing sounds, understanding actions: action representa- 562–571.
tion in mirror neurons. Science 297, 846–848. Obleser, J., Wise, R.J.S., Dresner, M.A., Scott, S.K., 2007. Functional
Kohonen, T., 2001. Self-Organizing Maps. Springer, Berlin NewYork. integration across brain regions improves speech perception under
Kröger, B.J., 1993. A gestural production model and its application to adverse listening conditions. J. Neurosci. 27, 2283–2289.
reduction in German. Phonetica 50, 213–233. Okada, K., Hickok, G., 2006. Left posterior auditory-related cortices
Kröger, B.J., Birkholz, P., 2007. A gesture-based concept for speech participate both in speech perception and speech production: neural
movement control in articulatory speech synthesis. In: Esposito, A., overlap revealed by fMRI. Brain Lang. 98, 112–117.
Faundez-Zanuy, M., Keller, E., Marinaro, M. (Eds.), Verbal and Oller, D.K., Eilers, R.E., Neal, A.R., Schwartz, H.K., 1999. Precursors to
Nonverbal Communication Behaviours, LNAI 4775. Springer-Verlag, speech in infancy: the prediction of speech and language disorders. J.
Berlin, Heidelberg, pp. 174–189. Comm. Disorders 32, 223–245.
Kröger, B.J., Birkholz, P., Kannampuzha, J., Neuschaefer-Rube, C., 2006a. Perkell, J.S., Matthies, M.L., Svirsky, M.A., Jordan, M.I., 1993. Trading
Modeling sensory-to-motor mappings using neural nets and a 3D relations between tongue-body raising and lip rounding in production
articulatory speech synthesizer. In: Proc. 9th Internat. Conf. on Spoken of the vowel /u/: a pilot ‘‘motor equivalence” study. J. Acoust. Soc.
Language Processing (Interspeech 2006 – ICSLP), pp. 565–568. Amer. 93, 2948–2961.
Kröger, B.J., Birkholz, P., Kannampuzha, J., Neuschaefer-Rube, C., Poeppel, D., Guillemin, A., Thompson, J., Fritz, J., Bavelier, D., Braun,
2006b. Learning to associate speech-like sensory and motor states A.R., 2004. Auditory lexical decision, categorical perception, and FM
during babbling. In: Proc. 7th Internat. Seminar on Speech Production direction discrimination differentially engage left and right auditory
(Belo Horizonte, Brazil), pp. 67–74. cortex. Neurophysiologia 42, 183–200.
Kröger, B.J., Birkholz, P., Kannampuzha, J., Neuschaefer-Rube, C., Postma, A., 2000. Detection of errors during speech production: a review
2006c. Spatial-to-joint mapping in a neural model of speech produc- of speech monitoring models. Cognition 77, 97–131.
tion. In: DAGA-Proc. 32th Annu. Meet. German Acoustical Society Raphael, L.J., Borden, G.J., Harris, K.S., 2007. Speech Science Primer:
(Braunschweig, Germany), pp. 561–562. <http:// Physiology, Acoustics, and Perception of Speech, fifth ed. Lippincott
www.speechtrainer.eu>. Williams & Wilkins.
Kröger, B.J., Lowit, A., Schnitker, R., 2008. The organization of a Riecker, A., Kassubek, J., Gröschel, K., Grodd, W., Ackermann, H.,
neurocomputational control model for articulatory speech synthesis. 2006. The cerebral control of speech tempo: opposite relationship
In: Esposito, A., Bourbakis, N., Avouris, N., Hatzilygeroudis, I. between speaking rate and BOLD signal change at stratal and
(Eds.), Verbal and Nonverbal Features of Human–Human and cerebellar structures. Neuroimage 29, 46–53.
Human–Machine Interaction. Selected papers from COST Action Rimol, L.M., Specht, K., Weis, S., Savoy, R., Hugdahl, K., 2005.
2102 International Workshop. Springer-Verlag, pp. 121–135. Processing of sub-syllabic speech units in the posterior temporal lobe:
Kuriki, S., Mori, T., Hirata, Y., 1999. Motor planning center for speech an fMRI study. Neuroimage 26, 1059–1067.
articulation in the normal human brain. Neuroreport 10, 765–769. Ritter, H., Kohonen, T., 1989. Self-organizing semantic maps. Biological
Latorre, J., Iwano, K., Furui, S., 2006. New approach to the polyglot Cybernet. 61, 241–254.
speech generation by means of an HMM-based speaker adaptable Rizzolatti, G., Arbib, M.A., 1998. Language within our grasp. Trends
synthesizer. Speech Comm. 48, 1227–1242. Neurosci. 21, 188–194.
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 809

Rizzolatti, G., Craighero, L., 2004. The mirror neuron system. Annu. Rev. Tani, J., Masato, I., Sugita, Y., 2004. Self-organization of distributed
Neurosci. 27, 169–192. represented multiple behavior schemata in a mirror system: reviews of
Rosen, H.J., Ojemann, J.G., Ollinger, J.M., Petersen, S.E., 2000. robot experiments using RNNPB. Neural Networks 17, 1273–1289.
Comparison of brain activation during word retrieval done silently Todorov, E., 2004. Optimality principles in sensorimotor control. Nature
and aloud using fMRI. Brain Cognit. 42, 201–217. Neurosci. 7, 907–915.
Saltzman, E., 1979. Levels of sensorimotor representation. J. Math. Tremblay, S., Shiller, D.M., Ostry, D.J., 2003. Somatosensory basis of
Psychol. 20, 91–163. speech production. Nature 423, 866–869.
Saltzman, E., Byrd, D., 2000. Task-dynamics of gestural timing: phase Ullman, M.T., 2001. A neurocognitive perspective on language: the
windows and multifrequency rhythms. Hum. Movement Sci. 19, 499– declarative/procedural model. Nature Rev. Neurosci. 2, 717–726.
526. Uppenkamp, S., Johnsrude, I.S., Norris, D., Marslen-Wilson, W.,
Saltzman, E., Munhall, K.G., 1989. A dynamic approach to gestural Patterson, R.D., 2006. Locating the initial stages of speech-sound
patterning in speech production. Ecol. Psychol. 1, 333–382. processing in human temporal cortex. Neuroimage 31, 1284–1296.
Sanguineti, V., Laboissiere, R., Payan, Y., 1997. A control model of Vanlancker-Sidtis, D., McIntosh, A.R., Grafton, S., 2003. PET activation
human tongue movements in speech. Biological Cybernet. 77, 11–22. studies comparing two speech tasks widely used in surgical mapping.
Scharenborg, O., 2007. Reaching over the gap: a review of efforts to link Brain Lang. 85, 245–261.
human and automatic speech recognition research. Speech Comm. 49, Varley, R., Whiteside, S., 2001. What is the underlying impairment in
336–347. acquired apraxia of speech. Aphasiology 15, 39–49.
Scott, S.K., Blank, C.C., Rosen, S., Wise, R.J.S., 2000. Identification of a Werker, J.F., Yeung, H.H., 2005. Infant speech perception bootstraps
pathway for intelligible speech in the left temporal lobe. Brain 123, word learning. Trends Cognit. Sci. 9, 519–527.
2400–2406. Westerman, G., Miranda, E.R., 2004. A new model of sensorimotor
Shadmehr, R., Mussa-Ivaldi, A., 1994. Adaptive representation of coupling in the development of speech. Brain Lang. 89, 393–400.
dynamics during learning of a motor task. J. Neurosci. 14, 3208–3224. Wilson, M., Knoblich, G., 2005. The case for motor involvement in
Shuster, L.I., Lemieux, S.K., 2005. An fMRI investigation of covertly and perceiving conspecifics. Psychol. Bull. 131, 460–473.
overtly produced mono- and multisyllabic words. Brain Lang. 93, 20–31. Wilson, S.M., Saygin, A.P., Sereno, M.I., Iacoboni, M., 2004. Listening to
Sober, S.J., Sabes, P.N., 2003. Multisensory integration during motor speech activates motor areas involved in speech production. Nature
planning. J. Neurosci. 23, 6982–6992. Neurosci. 7, 701–702.
Sörös, R., Guttman Sakoloff, L., Bose, A., McIntosh, A.R., Graham, S.J., Wise, R.J.S., Greene, J., Büchel, C., Scott, S.K., 1999. Brain regions
Stuss, D.T., 2006. Clustered functional MRI of overt speech produc- involved in articulation. The Lancet 353, 1057–1061.
tion. Neuroimage 32, 376–387. Zekveld, A., Heslenfeld, D.J., Festen, J.M., Schoonhoven, R., 2006. Top-
Studdert-Kennedy, M., 2002. Mirror neurons, vocal imitation, and the down and bottom-up processes in speech comprehension. Neuroimage
evolution of particulate speech. In: Stamenov, M.I., Gallese, V. (Eds.), 32, 1826–1836.
Mirror Neurons and the Evolution of Brain and Language. Benjamin, Zell, A., 2003. Simulation neuronaler Netze. Oldenbourg Verlag, Mün-
Philadelphia, pp. 207–227. chen, Wien.

GK For
No ratings yet
GK For
171 pages
ACCA - P1 Governance, Risk and Ethics - Study Text 2016-2017
100% (4)
ACCA - P1 Governance, Risk and Ethics - Study Text 2016-2017
506 pages
Survey of Deep Learning Paradigms For Speech Processing
No ratings yet
Survey of Deep Learning Paradigms For Speech Processing
37 pages
Stuart Mcroberts New Brawn Series Book 1 How To Build Up To 50 Pounds of Muscle The Natural Way by Stuart Mcrobert
100% (2)
Stuart Mcroberts New Brawn Series Book 1 How To Build Up To 50 Pounds of Muscle The Natural Way by Stuart Mcrobert
13 pages
Thesis Attention-Based Encoder-Decoder Models For Speech Processing
No ratings yet
Thesis Attention-Based Encoder-Decoder Models For Speech Processing
219 pages
Test 1
No ratings yet
Test 1
77 pages
How To Process Malta Student Visa Applications
No ratings yet
How To Process Malta Student Visa Applications
8 pages
Speech Processing
No ratings yet
Speech Processing
70 pages
Wheel Loader
No ratings yet
Wheel Loader
28 pages
Speech Processing in The Auditory System 1st Edition Full Text Download
100% (9)
Speech Processing in The Auditory System 1st Edition Full Text Download
14 pages
Pelamis - Sociedade.unipessoal - Limitada Bank - Account.statement CBD 2024-12-09
No ratings yet
Pelamis - Sociedade.unipessoal - Limitada Bank - Account.statement CBD 2024-12-09
1 page
Sample Memorial
No ratings yet
Sample Memorial
18 pages
Kurteff Thesis 2020
No ratings yet
Kurteff Thesis 2020
104 pages
(Synthesis Lectures On Speech and Audio Processing Volume 0) Li Deng - Dynamic Speech Models-Morgan and Claypool Publishers (2006)
No ratings yet
(Synthesis Lectures On Speech and Audio Processing Volume 0) Li Deng - Dynamic Speech Models-Morgan and Claypool Publishers (2006)
118 pages
NLP 1.3.1 - Speed Recogmnition
No ratings yet
NLP 1.3.1 - Speed Recogmnition
20 pages
Lecture 1
No ratings yet
Lecture 1
48 pages
HG3052 SpeechSynthesisAndRecognition Lecture 1 Update2019-20
No ratings yet
HG3052 SpeechSynthesisAndRecognition Lecture 1 Update2019-20
125 pages
A Comprehensive Survey On Automatic Speech Recognition Using Neural Networks
No ratings yet
A Comprehensive Survey On Automatic Speech Recognition Using Neural Networks
46 pages
Programming Imp Questions
No ratings yet
Programming Imp Questions
32 pages
Computational Models of Spoken Word Recognition Processes
No ratings yet
Computational Models of Spoken Word Recognition Processes
30 pages
TUNE, S. - OBLESER, J. (2022) - A Parsimonious Look at Neural Oscillations in Speech Perception
No ratings yet
TUNE, S. - OBLESER, J. (2022) - A Parsimonious Look at Neural Oscillations in Speech Perception
31 pages
A Unified Acoustic-To-Speech-To-Language Embedding Space Captures The Neural Basis of Natural Language Processing in Everyday Conversations
No ratings yet
A Unified Acoustic-To-Speech-To-Language Embedding Space Captures The Neural Basis of Natural Language Processing in Everyday Conversations
18 pages
Question
100% (1)
Question
17 pages
Apple Pouch Pattern PDF
No ratings yet
Apple Pouch Pattern PDF
10 pages
A Shifting Role of Thalamocortical Connectivity in The Emergence of Cortical Functional Organization
No ratings yet
A Shifting Role of Thalamocortical Connectivity in The Emergence of Cortical Functional Organization
32 pages
Piyu Sem Report.5
No ratings yet
Piyu Sem Report.5
30 pages
DL Proj Rep
No ratings yet
DL Proj Rep
11 pages
An Open Access EEG Dataset For Speech de
No ratings yet
An Open Access EEG Dataset For Speech de
16 pages
8416 Full
No ratings yet
8416 Full
11 pages
The Evidence Based Practitioner Applying Research To Meet Client Needs 1st Edition, (Ebook PDF
No ratings yet
The Evidence Based Practitioner Applying Research To Meet Client Needs 1st Edition, (Ebook PDF
50 pages
Embeddings
No ratings yet
Embeddings
7 pages
Andre Whines Thesis
No ratings yet
Andre Whines Thesis
141 pages
SNN Speech Recognition Presentation
No ratings yet
SNN Speech Recognition Presentation
8 pages
Speech Perception
No ratings yet
Speech Perception
16 pages
Neuro Speak
No ratings yet
Neuro Speak
13 pages
Bhutan Spirit Sanctuary Prices For Walk in Guests Wellness 2024 Final
No ratings yet
Bhutan Spirit Sanctuary Prices For Walk in Guests Wellness 2024 Final
8 pages
中樞神經疾病簡介
No ratings yet
中樞神經疾病簡介
78 pages
Examining The Association Between Posttraumatic Stress Disorder and Disruptions in Cortical Networks Identi Fied Using Data-Driven Methods
No ratings yet
Examining The Association Between Posttraumatic Stress Disorder and Disruptions in Cortical Networks Identi Fied Using Data-Driven Methods
11 pages
Temporal Pattern Classification Using Spiking Neural Networks
No ratings yet
Temporal Pattern Classification Using Spiking Neural Networks
67 pages
Rohit
No ratings yet
Rohit
14 pages
Neurocomputational Models of Speech Production
No ratings yet
Neurocomputational Models of Speech Production
20 pages
1 Base
No ratings yet
1 Base
5 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Translating Neural Signals To Text Using A Brain-Machine Interface
No ratings yet
Translating Neural Signals To Text Using A Brain-Machine Interface
10 pages
Life Cycle of Angiosperms
No ratings yet
Life Cycle of Angiosperms
5 pages
A Cognitive Neuroscience Perspective On Embodied Language - 2010 - Brain and La
No ratings yet
A Cognitive Neuroscience Perspective On Embodied Language - 2010 - Brain and La
9 pages
(Journal) Mapping The Policy Space of Public Consultations Evidence From The European Union - Supp
No ratings yet
(Journal) Mapping The Policy Space of Public Consultations Evidence From The European Union - Supp
7 pages
Minor Project123
No ratings yet
Minor Project123
40 pages
Science Abm2461
No ratings yet
Science Abm2461
5 pages
2020 - Speech Rhythms and Their Neural Foundations
No ratings yet
2020 - Speech Rhythms and Their Neural Foundations
13 pages
test 1 ЗНО 2021 без слушанья
No ratings yet
test 1 ЗНО 2021 без слушанья
7 pages
Ann LA2 Project
No ratings yet
Ann LA2 Project
23 pages
ML For MEG During Speech Task
No ratings yet
ML For MEG During Speech Task
14 pages
Anumanchipalli2019 - Speech Synthesis From Neural Decoding of Spoken Sentences
No ratings yet
Anumanchipalli2019 - Speech Synthesis From Neural Decoding of Spoken Sentences
20 pages
Language, Brain and Representation
No ratings yet
Language, Brain and Representation
39 pages
Sentosa Case - SIS Experience
100% (2)
Sentosa Case - SIS Experience
14 pages
Exploring The Interdependence of Word Perception and Production Through Computational Modelling
No ratings yet
Exploring The Interdependence of Word Perception and Production Through Computational Modelling
2 pages
Speech and Audio Processing and Coding
No ratings yet
Speech and Audio Processing and Coding
52 pages
A Dynamical System Model For Generating Fundamental Frequency For Speech Synthesis
No ratings yet
A Dynamical System Model For Generating Fundamental Frequency For Speech Synthesis
15 pages
Modeling Auditory Cortical Processing As An Adaptive Chirplet Transform
No ratings yet
Modeling Auditory Cortical Processing As An Adaptive Chirplet Transform
7 pages
Speech Recognition Using Neural Networks: A. Types of Speech Utterance
No ratings yet
Speech Recognition Using Neural Networks: A. Types of Speech Utterance
24 pages
Functional Network Dynamics of The Language System
No ratings yet
Functional Network Dynamics of The Language System
12 pages
Ibps Po Prelims - 25 (18-10-2024) - Rank List
No ratings yet
Ibps Po Prelims - 25 (18-10-2024) - Rank List
2 pages
11 - Professional Secrecy
100% (1)
11 - Professional Secrecy
10 pages
Test Answer Sheet 3 (En)
No ratings yet
Test Answer Sheet 3 (En)
18 pages
Cerumen Prop: Dr. Tolkha Amaruddin, M.Kes, SPTHT
No ratings yet
Cerumen Prop: Dr. Tolkha Amaruddin, M.Kes, SPTHT
15 pages
Natural Language Processing: by Dr. Parminder Kaur
No ratings yet
Natural Language Processing: by Dr. Parminder Kaur
26 pages
Widcollogo1 FINAL
No ratings yet
Widcollogo1 FINAL
83 pages
Cat Driver Information Card - LEDT7022
No ratings yet
Cat Driver Information Card - LEDT7022
2 pages
Ap Literature and Composition Syllabus MR
No ratings yet
Ap Literature and Composition Syllabus MR
2 pages
Inventory Management and Its Effects On Customer Satisfaction
No ratings yet
Inventory Management and Its Effects On Customer Satisfaction
12 pages
Speech Production
100% (1)
Speech Production
6 pages
Speech Recognition
No ratings yet
Speech Recognition
4 pages
Geography: Test Series
No ratings yet
Geography: Test Series
9 pages
Operation Manual WBL-100/101/200: 3, Hagavish St. Israel 58817 Tel: 972 3 5595252, Fax: 972 3 5594529
No ratings yet
Operation Manual WBL-100/101/200: 3, Hagavish St. Israel 58817 Tel: 972 3 5595252, Fax: 972 3 5594529
5 pages
Devi Priya SECOND PAPER
No ratings yet
Devi Priya SECOND PAPER
7 pages
Virtual Personal Assistant
No ratings yet
Virtual Personal Assistant
4 pages
Psycholinguistics, Computational: Advanced Article Richard L Lewis, University of Michigan, Ann Arbor, Michigan, USA
No ratings yet
Psycholinguistics, Computational: Advanced Article Richard L Lewis, University of Michigan, Ann Arbor, Michigan, USA
8 pages
Challenges in Speech Synthesis: David Suendermann, Harald Höge, and Alan Black
No ratings yet
Challenges in Speech Synthesis: David Suendermann, Harald Höge, and Alan Black
15 pages
Trek Marlin 29er Owners Manual
No ratings yet
Trek Marlin 29er Owners Manual
3 pages
3.BP Travel - Create Quotes - Functional Requirements Questionnaire (FRQ)
No ratings yet
3.BP Travel - Create Quotes - Functional Requirements Questionnaire (FRQ)
11 pages
Selective High School Placement Test: Session
100% (1)
Selective High School Placement Test: Session
10 pages
Kristina Koleva: Chief Stewardess
No ratings yet
Kristina Koleva: Chief Stewardess
3 pages
Inverter: Power
No ratings yet
Inverter: Power
1 page
Isolated Speech Recognition Using Artificial Neural Networks
No ratings yet
Isolated Speech Recognition Using Artificial Neural Networks
5 pages
Moroccan Arabic Textbook 23
No ratings yet
Moroccan Arabic Textbook 23
2 pages
Speech Recognition Full Report
No ratings yet
Speech Recognition Full Report
11 pages
The Neural Architecture of Grammar
From Everand
The Neural Architecture of Grammar
Stephen E. Nadeau
No ratings yet
A Study on Form-meaning Relation from the Neuro-cognitive Perspective
From Everand
A Study on Form-meaning Relation from the Neuro-cognitive Perspective
Zhang Shiqian
No ratings yet
The Motor Theory of Language Origin: 1989
From Everand
The Motor Theory of Language Origin: 1989
Robin Allott
3/5 (2)
Sciences of Communication Disorders
From Everand
Sciences of Communication Disorders
Meenakshi Nehru
No ratings yet
Grammar and Linguistics: Core Concepts
From Everand
Grammar and Linguistics: Core Concepts
Saraswati Saini
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Silent Speech Interface: Fundamentals and Applications
From Everand
Silent Speech Interface: Fundamentals and Applications
Fouad Sabry
No ratings yet
Speech Recognition: Fundamentals and Applications
From Everand
Speech Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

10 1016@j Specom 2008 08 002

Uploaded by

10 1016@j Specom 2008 08 002

Uploaded by

Available online at www.sciencedirect.

Speech Communication 51 (2009) 793–809

Towards a neurocomputational model of speech

1. Introduction tion as well as speech synthesis systems currently are not

The conversion of physical or psychophysical sensory

neural levels (Kandel et al., 2000; Kröger et al., 2008). In

speech items. The model randomly produces proto-vocalic

realizations (re-articulations or imitations) can be accepted

perception experiments, see e.g. Raphael et al. (2007)). The

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.