10 1016@j Specom 2008 08 002
10 1016@j Specom 2008 08 002
com
Received 5 March 2008; received in revised form 29 July 2008; accepted 27 August 2008
Abstract
The limitation in performance of current speech synthesis and speech recognition systems may result from the fact that these systems
are not designed with respect to the human neural processes of speech production and perception. A neurocomputational model of
speech production and perception is introduced which is organized with respect to human neural processes of speech production and
perception. The production–perception model comprises an artificial computer-implemented vocal tract as a front-end module, which
is capable of generating articulatory speech movements and acoustic speech signals. The structure of the production–perception model
comprises motor and sensory processing pathways. Speech knowledge is collected during training stages which imitate early stages of
speech acquisition. This knowledge is stored in artificial self-organizing maps. The current neurocomputational model is capable of pro-
ducing and perceiving vowels, VC-, and CV-syllables (V = vowels and C = voiced plosives). Basic features of natural speech production
and perception are predicted from this model in a straight forward way: Production of speech items is feedforward and feedback con-
trolled and phoneme realizations vary within perceptually defined regions. Perception is less categorical in the case of vowels in compar-
ison to consonants. Due to its human-like production–perception processing the model should be discussed as a basic module for more
technical relevant approaches for high-quality speech synthesis and for high performance speech recognition.
Ó 2008 Elsevier B.V. All rights reserved.
Keywords: Speech; Speech production; Speech perception; Neurocomputational model; Artificial neural networks; Self-organizing networks
0167-6393/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved.
doi:10.1016/j.specom.2008.08.002
794 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809
production and perception (Heim et al., 2003; Okada and 2. The structure of the neurocomputational model
Hickok, 2006; Callan et al., 2006; Jardri et al., 2007) but
only few among them introduce functional neural models While the structure of this neurocomputational model is
which explain and emulate (i) the complex neural sensori- based on neurophysiological and neuropsychological facts
motor processes of speech production (Bailly, 1997; Guen- (Kröger et al., 2008), the speech knowledge itself is gathered
ther, 1994, 1995, 2006; Guenther et al., 2006) and (ii) the by training artificial neural networks which are part of this
complex neural processes of speech perception including model (Kröger et al., 2006a,b). The organization of the
comprehension (McClelland and Elman, 1986; Gaskell model is given in Fig. 1. It comprises a cortical and a sub-
and Marslen-Wilson, 1997; Luce et al., 2000; Grossberg, cortical–peripheral part. The cortical part is subdivided
2003; Norris et al., 2006; Hickok and Poeppel, 2004, 2007). with respect to neural processing within the frontal, the
It is the aim of this paper to introduce a biologically temporal, and the parietal cortical lobe. Functionally the
motivated approach for speech recognition and synthesis, model comprises a production and a perception part. In
i.e. a computer-implemented neural model using artificial its current state the model excludes linguistic processing
neural networks, capable of imitating human processes of (mental grammar, mental lexicon, comprehension, concep-
speech production and speech perception. This produc- tualization) but focuses on sensorimotor processes of
tion–perception model is based on neurophysiological speech production and on sublexical speech perception,
and neuropsychological knowledge of speech processing i.e. sound and syllable identification and discrimination.
(Kröger et al., 2008). The structure of the model and the The production part is divided into feedforward and
process of collecting speech knowledge during speech feedback control (see also Guenther, 2006). It starts with
acquisition training stages are described in detail in this the phonemic representation of a speech item (speech
paper. Furthermore it is described how the model is capa- sound, syllable, word, or utterance) and generates the
ble of producing vowels and CV-syllables and why the appropriate time course of articulatory movements and
model is capable of perceiving vowels and consonants the appropriate acoustic speech signal. The phonemic rep-
categorically. resentation of a speech item is generated by higher level lin-
Fig. 1. Organization of the neurocomputational model. Boxes with black outline represent neural maps. Arrows indicate processing paths or neural
mappings. Boxes without outline indicate processing modules. Grey letters and grey arrows indicate processing modules and neural mappings which are
not computer-implemented in the current version of the model.
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 795
guistic modules (Levelt et al., 1999; Dell et al., 1999; Inde- parameter with complementary activation (a2 = 1 a1) in
frey and Levelt, 2004) subsumed as widely distributed fron- order to hold the overall activation a (a = a1 + a2) constant
tal–temporal procedural and declarative neural processing (=1) for each parameter value. The size of the (higher level)
modules (Ullman, 2001; Indefrey and Levelt, 2004) which motor plan map depends on the length of the utterance
are not specified in detail in this model. Subsequently each under production. In the case of V-, CV-, and VC-items
phonologically specified syllable (i.e. a phonemic state; a three vocalic higher level parameters (high–low, front–
neural activation pattern on the level of the phonemic back, rounded–unrounded) and four higher level conso-
map) is processed by the feedforward control module. In nantal parameters (labial, apical, dorsal, exact closing posi-
the case of a frequent syllable, the sensory states (auditory tion) are controlled. These vocalic parameters and the
and somatosensory state) and the motor plan state of the consonantal parameter closing position are encoded using
syllable (which are already learned or trained during speech two neurons with complementary activation each, while
acquisition; see below) are activated via the phonetic map. the three remaining consonantal parameters are encoded
The phonetic map (Fig. 1) can be interpreted as the central by one neuron each in order to reflect the activation of a
neural map constituting the mental syllabary (for the con- specific vocal tract organ. Thus the motor plan map for
cept of mental syllabary, see Levelt and Wheeldon, 1994; V-, CV-, and VC-items consists of 11 neurons. Since a
Levelt et al., 1999). For each frequent syllable a phonemic motor plan encodes a motor or sensory V-, CV-, or VC-
state initiates the neural activation of a specific neuron item of a transition for C (encoded by four time labels)
within the phonetic map, which subsequently leads to acti- and a steady state portion for V (encoded by one time
vation patterns of the appropriate sensory states and the label) the (lower level) primary motor state of these items
appropriate motor plan state. In the case of infrequent syl- is encoded by five consecutive time labels. Thus the appro-
lables the motor plan state is assembled within the motor priate number of primary motor map neurons for a whole
planning module on the level of sub-syllabic units, e.g. syl- speech item is 5 20 = 100 neurons plus 10 neurons for
lable constituents like syllable onset and syllable rhyme or coding five time intervals describing the temporal distance
single speech sounds (Varley and Whiteside, 2001). This from label to label.
path is not implemented in our model at present. On the A computer-implemented numerical articulatory vocal
level of the motor plan map a high level motor state (motor tract model generates the time course of vocal tract geom-
plan) is activated for each speech item under production etries and subsequently the acoustic vocal tract model gen-
(current speech item). This high level motor state defines erates the acoustic speech signal. A three-dimensional
the temporal coordination of speech gestures or vocal tract articulatory–acoustic model is used here which is capable
action units (Goldstein et al., 2006; Saltzman and Munhall, of generating high-quality articulatory and acoustic speech
1989; for a general description of goal-directed action signals (Birkholz and Jackèl, 2004; Birkholz and Kröger,
units, see Sober and Sabes, 2003; Todorov, 2004; Fadiga 2006, 2007; Birkholz et al., 2006, 2007 and Kröger and
and Craighero, 2004). The motor plan of a speech item is Birkholz, 2007). These articulatory and acoustic signals
processed by the motor execution module in order to define are used for feedback control.
the spatio-temporal trajectories of articulator movements. The articulatory and acoustic signals generated by feed-
Thus the motor execution module calculates the concrete forward control are continuously monitored or controlled.
specification of each speech gesture on the level of the pri- For this feedback control the articulatory and acoustic sig-
mary motor map (cf. Ito et al., 2004; Sanguineti et al., 1997; nals are converted into neural signals by auditory and
Saltzman, 1979; Saltzman and Munhall, 1989; Saltzman somatosensory (i.e. tactile and proprioceptive) receptors.
and Byrd, 2000). For example, a labial closing gesture Somatosensory feedback signals (relative positions of artic-
involves coordinated movement of at least the lower jaw, ulators to each other and position and degree of vocal tract
the lower and upper lips. Thus each of these articulators constrictions, see Saltzman and Munhall, 1989; Shadmehr
must be controlled synergetically for the realization of a and Mussa-Ivaldi, 1994; Tremblay et al., 2003; Nasir and
speech gesture. Subsequently the movement of an articula- Ostry, 2006) are used for controlling motor execution. In
tor is executed by activating the motor units controlling addition sensory (i.e. somatosensory and auditory) signals
this articulator via the neuromuscular processing module. are converted into higher level cortical sensory states, which
The (lower level) primary motor map comprises 10 artic- represent the current speech item. These auditory and
ulatory parameters (Kröger et al., 2006b). Each articula- somatosensory (feedback) states of a currently produced
tory parameter value is coded by two neurons with speech item are processed by comparing them with the
complementary activation (see below) leading to 20 neu- appropriate prelearned auditory and somatosensory state,
rons to encoding the primary motor commands for each activated by feedforward control before the current speech
point in time. The conversion of physical parameter values item is produced. This comparison is done on the level of
(e.g. displacement of an articulator) into neuromotor acti- the somatosensory and auditory processing modules. If
vation patterns is done (i) by mapping the physical dis- the prestored (or feedforward) sensory state and the feed-
placement range for each parameter onto a neural back sensory states indicate a reasonable difference an
activation range [0, 1] (i.e. no activation to full activation error signal is activated for correcting the motor plan dur-
of a neuron) and (ii) by defining two neurons for each ing the ongoing feedforward control.
796 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809
starts from an acoustic speech signal, generated by an bj ¼ actfunc ai wij for j ¼ 1; . . . ; M ð1Þ
i¼1
external speaker (Fig. 1). This signal is converted into neu-
ral signals by auditory receptors and is further processed Here actfunc is the activation function (a sigmoid function
into a cortical higher level auditory signal via the same in the case of our modeling; see Zell, 2003) which repre-
auditory pathway that is used for the feedback control of sents the total activation of neuron bj in map 1 as function
speech production (self-productions). Speech perception of the sum of activations from all neurons i within map 2.
comprises two pathways (cf. Hickok and Poeppel, 2004, The link weight values wij are limited to the interval
2007). The auditory-to-meaning pathway (ventral stream) [1, +1] (i.e. maximal inhibitory to maximal excitatory link
directly activates neural states within the mental lexicon weight value).
by the high level cortical auditory state for a speech item The link weight values reflect the whole knowledge
(e.g. a word). This pathway is not included in our model, inherent in the training data and thus the knowledge gath-
since high level mental lexical representations are out of ered during the training procedures. Link weight values are
the scope of this study. The auditory-to-motor pathway adjusted during training stages, i.e. during speech acquisi-
(dorsal stream) activates the phonetic state of the current tion stages (see below). They are allowed to be modified
speech item (e.g. sound or syllable) within the cortical fron- continuously in order to reflect new knowledge gained over
tal motor regions. This pathway is included in our model life time.
and it will be shown below that this pathway is capable One-layer feedforward networks (Fig. 2) are of limited
of modeling categorical perception of speech sounds and power and are used in our model exclusively for calculating
is capable of modeling differences in categorical perception articulatory joint-coordinate parameters from articulatory
of vowels and consonants. tract-variable parameters (cf. Kröger et al., 2006c). In this
The structure of the neurocomputational model differen- paper we will focus on the central phonetic map and the
tiates neural maps and neural mappings. Neural maps are multilateral co-activation of phonemic states, sensory
ensembles of neurons which represent the phonemic, pho- states, and motor plan states via the phonetic map. This
netic, sensory or motor speech states. These maps are capa- multilateral co-activation is achieved by using self-organiz-
ble of carrying states of different speech items by different ing maps or networks (Kohonen, 2001 and Fig. 3). Each
neural activation patterns. These activations change from neuron of the central self-organizing map (i.e. the phonetic
speech item to speech item under production or perception. map) represents a speech item. Different phonetic submaps
Neural mappings represent the neural connections between (i.e. different parts within the phonetic map) are defined for
the neurons of neural maps (Fig. 2). These connections can each class of speech items, i.e. for vowels, for CV-, and for
be excitatory or inhibitory. The degree of excitatory or VC-syllables. Multilateral co-activation of phonemic, sen-
inhibitory connection is described by link weight values. sory, and motor plan states for a speech item via the pho-
These values wij characterize the neural connection between netic map means that an activated neuron of the phonetic
each pair of neurons. They define the degree of activation map (representing a currently perceived or produced
of a connected neuron bj within a neural map 1 (comprising speech item) leads to a co-activation of neural activation
M neurons j = 1, . . . , M) resulting from the degree of acti- patterns within the phonemic, motor plan, or sensory side
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 797
Fig. 5. Auditory state (right side) for a dorsal closing gesture (left side).
other. Three self-organizing maps (size: M = 15 15 = 225 with neighborhood radius of bwinner. The additional step
neurons) form three phonetic submaps and are trained by dependent function r(t) is introduced to get a constantly
using the three training sets described above. Training decreasing neighborhood radius (see Eq. (6)).
leads to an adjustment of link weight values wij between
the N side layer neurons ai and the M central layer neurons t
rðtÞ ¼ 1:0 þ ðrinit 1:0Þ 1 ð6Þ
bj. The side layers consist of the motor plan map tmax
(i = 1, . . . , K) and the sensory (auditory and somatosen-
sory) maps (i = K + 1, . . . , N) while the central layer repre- For the babbling training an initial neighborhood radius
sents the phonetic map (j = 1, . . . , M). Link weight values rinit = 12 and an initial learning rate Linit = 0.8 are chosen.
are initialized by random values within the interval [0, 1] Proto-vocalic and proto-syllabic test sets were defined
(i.e. no activation to full activation). The link weights for testing the proto-vocalic and proto-syllabic training
wij(tinit) are initialized using random values between 0 and results. The proto-vocalic test set comprises 270 proto-
1 (Eq. (2)). This adjustment of the link weights is done vocalic states which cover the language independent artic-
incrementally, i.e. step by step, using Hebbian learning ulatory vowel space between the cardinal vowel qualities
(Eq. (3)). When a new stimulus I with I = (x0, . . . , xN) is [i], [a], and [u]. This proto-vocalic test set is organized in
presented, the winner neuron bwinner is identified in the cen- the same way as the proto-vocalic training set but the test
tral layer by calculating the minimum of Euclidian norm set exhibits a much lower density within the articulatory or
between I and Wj, j = 1, . . . , M; i.e. winner = arg minj auditory vowel space. This also results in different training
(kI Wjk), where Wj is a vector containing the link weights and test items. Both proto-syllabic test sets are based on a
of all links from the central layer neuron bj to the side layer set of 22 quasi-vocalic motor plan states covering the whole
neurons ai, i.e. Wj = (w1j, . . . , wNj). Once the winner neuron language independent articulatory vowel space. Both
bwinner is identified the link weights for a step t with proto-syllabic test sets are organized in the same way as
tinit < t < tmax are updated as the proto-syllabic training sets but the test sets exhibit a
lower density within the articulatory or auditory vowel
wij ðtinit Þ ¼ randð0; 1Þ ð2Þ space for the proto-vocalic starting or ending positions of
wij ðt þ 1Þ ¼ wij ðtÞ þ N winner;j ðtÞ LðtÞ ðI i wij ðtÞÞ ð3Þ the VC- or CV-proto-syllables. Both proto-syllabic test sets
comprise 198 items. The test items were different from the
where 0 < L(t) < 1 is a constantly decreasing learning fac- training items defined above.
tor defined as An estimation of the quality of the proto-vocalic and the
t proto-syllabic training results is done by calculating a mean
LðtÞ ¼ 0:00001 þ ðLinit 0:00001Þ 1 ð4Þ error over all test set items for estimating an articulatory
tmax
state of a test set item from its auditory state. The calcula-
and Nwinner,j(t) is a neighborhood kernel (see Eq. (5)). Only tion of the error value for each test item comprises six
the link weights of the neurons in the neighborhood around steps: In a first step the motor plan state of a test item is
the winner neuron are updated. A 1-neighborhood is applied to the motor execution module for calculating the
defined as all 8 neurons around the winner neuron, if they appropriate articulatory patterns (i.e. the time course of
exist. A (n + 1)-neighborhood contains all neurons of a articulatory parameters for a speech item) by using the
n-neighborhood and their 1-neighbors, if they exist. Thus feedforward part of the model. This calculated articulatory
a neighborhood kernel Nwinner,j(t) is defined as pattern is called initial articulatory pattern. In a second step
the appropriate auditory state pattern is calculated by
1 if bj 2 rðtÞ-neighborhood
N winner;j ðtÞ ¼ ð5Þ using the output of the three-dimensional articulatory–
0 if bj R rðtÞ-neighborhood acoustic model for the initial articulatory pattern and by
800 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809
applying this output to the auditory feedback pathway of proto-vocalic training in Fig. 6 and for the proto-CV-syl-
the model. In a third step the motor plan state is recalcu- labic training in Fig. 7. It appears that motor plan states
lated from the auditory state pattern calculated in the sec- are organized with respect to phonetic categories. In the
ond step. Note that the trained self-organizing network is case of the vocalic phonetic submap vocalic states are
used for this step. This step leads to an estimated motor ordered continuously with respect to the motor plan
plan state which results from the sensorimotor knowledge parameters high–low and front–back. Experimental evi-
stored within the self-organizing network, i.e. which results dence for this kind of ordering is given by Obleser et al.
from the learning or training procedure. In a fourth step (2006). In the case of the syllabic submap three regions
the estimated articulatory pattern is calculated for the esti- occur which represent the gesture-performing articulator
mated motor plan states by reusing the feedforward part of (labial, apical, and dorsal), i.e. an ordering occurs with
the model. In a fifth step the estimated and initial articula- respect to the motor-plan parameter gesture-performing
tory patterns are compared. An error value is calculated for articulator. This neural behavior resulting from self-organi-
each test item which is the difference between estimated and zation of vocalic and consonantal or syllabic states with
initial articulatory pattern. This difference is normalized respect to phonetic categories (high–low, front–back, ges-
with respect to the initial articulatory pattern. In a sixth ture-performing articulator) can be labeled as phonetotopy
step the mean error over all test set items is calculated in parallel to tonotopy for the cortical ordering of auditory
for the trained network. states with respect to their fundamental frequency (Kandel
500,000 training steps are sufficient for predicting asso- et al., 2000, p. 609) or in parallel to somatotopy for the
ciated articulatory states from the auditory states of the ordering of somatosensory states with respect to their loca-
test items with a precision below 2% error rate on the pri- tion on the body surface (Kandel et al., 2000, p. 460f).
mary motor level in the case of the proto-vocalic training It should be kept in mind at this point that the general
(using the proto-vocalic training set) and 280,000 training phonetic sensorimotor knowledge stored in these phonetic
steps are sufficient for predicting the articulatory states maps is knowledge of sensorimotor relations exclusively
from the auditory states with a precision below 5% error generated by the three-dimensional articulatory and acous-
rate in the case of both proto-syllabic trainings (using both tic vocal tract model. Thus it is important for the perfor-
proto-syllabic training sets). Thus the complete babbling mance or quality of neurocomputational models of
training requires less than five minutes on standard PC’s. speech production and perception that these models com-
The resulting link weight values for the neurons con- prise realistic articulatory and acoustic vocal tract models
necting the self-organizing phonetic maps with the motor as front–end modules which are capable of generating
plan and auditory map are graphically displayed for the high-quality articulatory and acoustic signals, since the sig-
Fig. 6. Motor plan and auditory link weight values after vocalic babbling and imitation training for each neuron within the vocalic phonetic map (15 15
neurons). Link weight values are given for two motor plan parameters within each neuron box: back–front (left bar) and low–high (right bar). Link weight
values are given for three auditory parameters: bark scaled F1, F2, and F3 (horizontal lines within each neuron box). The outlined boxes indicate the
association of neurons with vowel phoneme categories. These associations are established during imitation training (see text).
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 801
Fig. 7. Motor plan and auditory link weight values after CV-syllabic babbling and imitation training for each neuron within the CV-phonetic map
(15 15 neurons). Link weight values are given for five motor plan parameters within each neuron box. First three columns: vocal tract organ which
performs the closing gesture (labial, apical, dorsal); two last columns: back–front value (forth column) and low–high value (fifth column) of the vowel
within the CV-sequence. Link weight values are given for three auditory parameters: bark scaled F1, F2, and F3 (formant transitions within each neuron
box). The outlined boxes indicate the association of neurons with consonant phoneme categories /b/, /d/, and /g/; each of these three regions comprises the
appropriate consonant in all vocalic contexts. These associations are established during imitation training (see text).
nals generated by the articulatory–acoustic model are the during this training stage the neurocomputational model
basis for the calculation of all sensory signals. mainly learns to link neurons which represent different
After babbling training the neurocomputational model phonemes or phonemic descriptions of syllables with the
is capable of reproducing (or imitating) the motor plan motor plan states and with the sensory states of their
state (i.e. the articulation) of any pre-linguistic speech item appropriate typical realizations. In parallel to babbling
– in our case of any proto-vowel, proto-CV-syllable and training also imitation training can be subdivided into
proto-VC-syllable (with C = proto-consonantal closing training procedures for vowels, CV- and for VC-syllables.
gestures) – from their acoustic (or auditory) state patterns. The vowel imitation training set comprises a set of 100
Thus the neurocomputational model is now ready for lan- acoustic vowel realizations per phoneme for a typical five
guage-specific imitation training. For imitation training the vowel phoneme system /i/, /e/, /a/, /o/, and /u/ (e.g. Brad-
training sets comprise language-specific speech items; in our low, 1995 and Cervera et al., 2001). A three-dimensional
case vocalic and syllabic speech items. Beside the adjust- Gaussian distribution was chosen for each phoneme for
ment of link weights of the mapping between the phonetic distributing the 100 realizations per phoneme over the
map and the sensory maps and of the mapping between the F1–F2–F3-space (Fig. 8 for the F1–F2-space). The distri-
phonetic map and the motor plan map, which is mainly bution of the phoneme realizations in the acoustic vowel
done during babbling training, now in addition the link space (F1–F2–F3-space) is chosen as realistically as possi-
weights of the mapping between the phonetic map and ble. The acoustic vowel realizations within the acoustic
the phonemic map are adjusted. Language-specific imita- vowel space slightly overlap. These 500 vowel realizations
tion training results in (i) specifying regions of typical pho- are supposed to be realizations given by different external
neme realizations (phone regions) within the phonetic map, speakers, but matched with respect to the models babbling
i.e. in specifying regions of neurons within the phonetic vowel space. It should be noted that vowel phonemes nor-
map, which represent typical realizations of a phoneme mally are learned in the context of words during speech
or of a syllable phoneme chain (see Figs. 6 and 7) and in acquisition. This is replaced in this model by training of
(ii) fine-tuning of the sensorimotor link weights already isolated vowels by reason of simplicity. More complex
trained during babbling. This fine-tuning mainly occurs training scenarios are beyond the scope of this paper.
at the phone regions. Thus the knowledge which is gained During vowel imitation training each external acoustic
during imitation is language dependent. In other words (or auditory) vowel item is processed by the proto-vocalic
802 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809
The production pathway (phonemic map ? phonetic model trained thus far for vowels and simple CV- and
map ? motor plan map ? primary motor map ? articu- VC-syllables is capable of producing categorical perception
lation) has been introduced in Section 2. The speech items for vowels and in an even stronger way for consonants (i.e.
which were trained in this study can be labeled as frequent voiced plosives in the case of our model). The auditory
syllables. The description of the processing of infrequent pathway for perception of external speech items (auditory
syllables is beyond the scope of this paper. Our training receptors ? auditory map ? phonetic map ? phonemic
results given above indicate strong neural connections from map) has already been introduced in Section 2 (auditory-
a phonemic state within the phonemic map to a set of neu- to-motor pathway, see Hickok and Poeppel, 2000, 2004).
rons within the phonetic map. Each of these sets of neurons Thus the phonetic map is not only a central neural repre-
within the phonetic map represent a region of phoneme sentation in speech production but also in speech percep-
realizations (phone regions) and thus represent production tion at least for sublexical speech units like speech sounds
variability since neighboring neurons within the phonetic and syllables. In order to show that the current neuron-
map represent slightly different motor and sensory states computational production–perception model perceives
(for natural variation in vowel realizations, see Perkell vowels (for the five vowel system /i/, /e/, /a/, /o/, and /u/)
et al., 1993). If a phonemic speech item is activated (phone- and consonants (for the voiced plosives /b/, /d/, and /g/)
mic map) this leads to an activation of several neurons in a speech-like categorical way, speech identification and
within the phonetic map (see the outlined boxes or phone discrimination experiments were carried out using the
regions for example for the vocalic phonetic map; Fig. 6). model. In order to be able to perform these experiments
Thus in our model the maximal activated neuron within using the model, 20 different instances of the model were
the phonetic map can differ from realization to realization. trained using (i) different sets of training data due to differ-
Therefore the motor plan and the subsequent articulatory ent randomization procedures for determining the vocalic
realization of a phonemic item are allowed to vary within items within all training sets, using (ii) a different ordering
a perceptually acceptable region. These regions for phone- of training stimuli during each training stage, and using (iii)
mic items are the phoneme realization regions or phone different sets of randomly generated initial link weight val-
regions and they are language-specific and are defined dur- ues for each of the 20 instances. The resulting 20 instances
ing imitation training (see Figs. 6 and 7). of the model are called virtual listeners.
Furthermore coarticulation is introduced in our neuro- Identification of an external acoustic stimulus is per-
computational model. Two sources of coarticulation are formed in our model by a virtual listener by identifying
implemented in our model. Firstly, coarticulation results the most excited neuron within the phonemic map. Dis-
from the fact that the exact coordination of articulators crimination of two external acoustic stimuli is performed
for executing a speech gesture is controlled by the motor in our model by calculating the most activated neuron on
execution module and that a speech gesture is not encoded the level of the phonetic map for each acoustic stimulus
in all details on the motor plan level. That leads to variabil- and subsequently by calculating the city block distance
ity in gesture execution with respect to context. For exam- between these both neurons for each virtual listener. The
ple the realization of /b/ in /ibi/ or /aba/ is different in our phonetotopic ordering of speech items on the level of the
model. In /aba/ the lower jaw is more involved in the exe- phonetic map (see above) is a first hint that distance
cution of the labial closing gesture than in /ibi/ because of between speech items (states) on the level of this map indi-
the wide mouth opening occurring in /a/ in comparison to cates phonetic similarity or dissimilarity. Moreover we
/i/. Because of this wide mouth opening in /a/ it would be assume that the sensory resolution of two states (i.e. the
ineffective to execute the closing gesture in /aba/ just by capability for discrimination between these states) is gov-
using the lips. It is more effective to add a synergetic eleva- erned by the spatial distance of these two states on the level
tion of the lower jaw. Thus, the lower jaw elevation and the of the phonetic map. This assumption holds for tonotopic
lower lip elevation form a labial closing gesture in a syner- ordering and thus for F0-discrimination of auditory stimuli
getic way. Secondly, coarticulation results from the fact (see the discussion of tonotopic cortical maps, Kandel
that gesture specifications can vary even on the level of et al., 2000, p. 609) and this assumption also holds for
the motor plan. For example lip protrusion is allowed to somatotopic ordering and thus for the spatial discrimina-
vary for a consonantal labial closing gesture since lip pro- tion of tactile stimuli (see the discussion of somatotopic
trusion is a non-relevant phonemic feature in the case of a maps, Kandel et al., 2000, p. 460ff). Consequently it can
labial closing gesture in our target language. Since the be hypothesized that two stimuli can be discriminated if
labial closing gesture within a CV-syllable temporarily the distance of the activated neurons representing the stim-
overlaps with the following vocalic gesture (e.g. for a uli on the level of the phonetic map exceeds a certain neu-
gesture for realizing an /i/ or /u/) our simulations show ron distance within this map and it can be hypothesized
anticipatory lip protrusion on the motor execution level that discrimination becomes stronger with increasing neu-
in /pu/ while lips are not protruded during the labial ron distance.
closure in /pi/. Vocalic and consonantal identification and discrimina-
In the case of language-specific perception of speech tion tests were performed on the basis of quasi-continuous
items it can easily be shown that the neurocomputational acoustic stimulus continua (for an introduction to speech
804 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809
Fig. 9. Bark-scaled formant pattern for 13 vocalic stimuli (/i/-/e/-/a/-continuum) for the vocalic perceptual identification and discrimination tests.
Fig. 10. Bark-scaled formant pattern for 13 CV-stimuli (/ba/-/da/-/ga/-continuum) for the consonantal perceptual identification and discrimination tests.
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 805
nation scores indicate the complete discrimination of two 2000; Hartsuiker and Kolk, 2001). In addition imaging
stimuli based on all available auditory information given studies focusing on speech perception have demonstrated
by these stimuli; not just the linguistic, phonemic, or cate- that perception is capable of activating parts of the speech
gorical information, needed for (categorical) identification. production cortical networks (Fadiga et al., 2002; Wilson
It can be seen from Figs. 11 and 12 that measured discrim- et al., 2004; Hickok and Poeppel, 2004, 2007).
ination rates are always higher than calculated discrimina- Bidirectional mappings between phonemic and phonetic
tion rates. That is in agreement with identification and and between sensory and phonetic maps are introduced in
discrimination scores extracted from identification and dis- our neural model in order to illustrate the close relation-
crimination experiments carried out with humans and can ship between production and perception. The introduction
be interpreted in the way that acoustic speech stimuli al- of these bidirectional mappings is the basis for important
ways convey categorical (linguistic) and non-categorical features of the model like categorical perception. Physio-
(para-linguistic or non-linguistic extra) information. While logically a bidirectional mapping comprises two related
measured and calculated discrimination scores are nearly unidirectional mappings since neurons always forward
identical in the case of consonants, it comes out from our their firing pulses in one direction (Kandel et al., 2000).
modeling data that measured discrimination is better than Thus physiologically bidirectional mappings are repre-
calculated discrimination especially in the case of vowels. sented by two neural paths connecting the maps in both
This is in agreement with result of natural speech percep- directions (see the separate arrows in Fig. 1). The pho-
tion (Fry et al., 1962; Eimas, 1963) and reflects the typical netic map – which forms the central map for all bidirec-
differences in categorical perception of consonants and tional mappings in our model (see Fig. 1) can be
vowels. interpreted as the central part of the mental syllabary
(Levelt and Wheeldon, 1994; Levelt et al., 1999). Neural
5. Discussion and conclusions cortico-cortical connections exist in both directions
between this part of the frontal cortex and the sensory
The experimental results presented in this paper indicate areas as well as between this part of the frontal cortex
that a model of speech production and perception which is and those temporal regions which process phonemic
shaped with respect to basic neurophysiological facts is information (Kandel et al., 2000).
capable of embedding important features of speech pro- Other computer-implemented models of speech produc-
duction and speech perception in a straight forward way tion (Bailly, 1997; Guenther, 1994, 1995, 2006; Guenther
even if the neurocomputational modeling is relatively basic et al., 2006) as well as the model introduced here reflect
as is here by using simple standard self-organizing net- the relationship between perception and production by
works. Typical and therefore important features of speech incorporating perceptual feedback control loops or by incor-
production and perception like production variability of porating production–perception pathways for self-moni-
phoneme realizations and categorical speech perception toring processes (Indefrey and Levelt, 2004). Dual stream
and especially the fact of different degrees of categorical models of speech perception have recently been published
perception for consonants and vowels, occur in a straight- which introduce a ventral stream for passive auditory pro-
forward way in this production–perception model. Since cessing and a dorsal stream activating auditory-motor net-
human speech production and perception easily outper- works (e.g. Hickok and Poeppel, 2004, 2007) but passive
forms speech synthesis and speech recognition systems at models of speech perception that do not refer to produc-
least in difficult conditions, it could be useful to include tion processes can also be found (McClelland and Elman,
human-like speech processing routines into such technical 1986; Gaskell and Marslen-Wilson, 1997; Luce et al.,
speech processing systems. This may help to increase the 2000; Norris et al., 2006). The model introduced here
quality and the level of performance of technical speech reflects the close relationship between speech production
processing systems. and speech perception since on the one hand our model
Furthermore this modeling study indicates the close comprises basic features of speech production models (cf.
relationship of speech production and speech perception. Guenther et al., 2006) and since on the other hand our
Speech perception theories such as the motor theory of model is capable of incorporating in addition the dual
speech perception (Liberman et al., 1967; Liberman and stream idea (Hickok and Poeppel, 2007) in a straight for-
Mattingly, 1985) or the direct-realist theory (Fowler, ward way (see the labels ‘‘ventral stream” and ‘‘dorsal
1986) have already postulated this close relationship. And stream” in Fig. 1).
recent experimental results provide support for this claim Mirror neurons (visual and audio–visual mirror neuron
and suggest that the development of an integrative model system) appear to be one of the neural systems that are
on speech production and perception is highly desirable. involved in the association of production and perception
For example perceptual feedback loops (also called self- processes (Rizzolatti and Arbib, 1998; Studdert-Kennedy,
monitoring processes) are known to activate parts of the 2002; Kohler et al., 2002; Fadiga and Craighero, 2004; Riz-
speech perception mechanism during overt (external per- zolatti and Craighero, 2004; Wilson et al., 2004; Iacoboni,
ceptual loop) as well as covert speech production (internal 2005; Wilson and Knoblich, 2005; Arbib, 2005). Systems of
perceptual loop, cf. Indefrey and Levelt, 2004; Postma, mirror neurons have been detected which code the abstract
806 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809
meaning of goal-directed actions (e.g. grasping) and which netotopic ordering presented in this paper for the different
are capable of co-activating motor and sensory (visual and submaps discussed here.
audio–visual) representations of these actions by neural It is unclear whether the training sets used here consti-
cortico-cortical associations. These visual and audio–visual tute a representative natural model of babbling and imita-
mirror neuron systems also co-activate abstract concepts tion training during early states of human speech
(preferably for action words) and thus are capable of asso- acquisition. Our training sets comprise a widespread set
ciating higher order linguistic representations with goal- of vocalic vocal tract positions and a widespread set of
directed actions. A speech mirror neuron system (‘‘mirror opening and closing movements. At least these sets com-
resonant system” after Fadiga and Craighero, 2004, p. prise all vocal tract positions and all opening and closing
167, ‘‘auditory mirror neuron system” or ‘‘echo neurons” movements which are physiologically possible. But it is
after Rizzolatti and Craighero, 2004, p. 185f) is postulated conceivable that toddlers very quickly reduce their set of
which is newer from the viewpoint of evolution in compar- training items from all physiological possible positions
ison to the mirror neuron system introduced above and and movements towards a subset of positions and move-
which is directly linked with the capacity of humans to ments which are especially important for speech.
learn speech items by imitation. It can be assumed that this It should be noted that our neural modeling approach
speech mirror neuron system in parallel co-activates motor does not include modeling of temporal aspects of neural
representations, sensory representations, and phonemic functioning. Rather the temporal aspects of production
representations of speech items. Given that from a pho- and perception are included in the speech items and thus
netic viewpoint speech items also are built up by goal-direc- in the sensory, motor, phonetic, and phonemic states. In
ted actions (called speech gestures) which build-up the our production–perception model sensory and motor states
motor plans for speech items in our model (see Section of vowels and syllables are processed as a whole. Our mod-
2), it can be hypothesized that a mirror neuron layer also eling approach thus is sufficient as long as only a descrip-
exists for the association of motor, sensory, and phonemic tion of the training and processing of syllables is wanted.
representations of speech gestures (see also Westerman and In contrast a detailed temporal organization becomes
Miranda, 2004). important if speech items comprise more than one syllable.
Self-organization is a central principle of learning and In this case processing delays must be introduced for all
self-organizing maps are used for modeling cortical net- pathways postulated in the model (cf. Guenther et al.,
works (Kohonen, 2001). Within our neurocomputational 2006) and temporal aspects of neural activity need to be
model artificial self-organizing neural networks are imple- considered (cf. Maass and Schmitt, 1999).
mented since self-organizing neural networks are biologi- The two training stages identified by our modeling study
cally plausible and have been used successfully for distinguish between babbling (i.e. the build-up stage for
modeling semantic lexical networks (Ritter and Kohonen, sensorimotor representations of pre-linguistic proto-voca-
1989), for (i) modeling semantic and phonological aspects lic and proto-consonantal speech gestures) and imitation
during early lexical development (Li et al., 2004), and for (i.e. the build-up stage for language-specific perceptual,
(ii) modeling the generation and recognition of goal-direc- motor, phonetic, and phonemic representations of speech
ted movements (Bullock et al., 1993; Tani et al., 2004). A items). A closer modeling of early stages of speech acquisi-
further argument for using self-organizing maps is their tion (Oller et al., 1999) is beyond the scope of this paper.
success in modeling the mapping between phonemic and Furthermore in reality the two training stages introduced
phonetic aspects of speech production as demonstrated here overlap in time. This is partially realized in our
by the learning experiments for vowels and syllables approach, since babbling and imitation training items are
described in this study. applied in parallel during the imitation training stage after
In our current model different submaps are used for dif- a short babbling training stage.
ferent classes of speech items (V, CV, VC) and separate The next important step would be to introduce processes
training procedures were introduced for training these clas- for building up the mental lexicon and for modeling the
ses of speech items. This separation of the phonetic map in process of word segmentation and identification (cf. Batch-
submaps as well as the separation of training procedures elder, 2002; Werker and Yeung, 2005; Jusczyk, 1999; Brent,
for different speech items was done in order to simplify 1999). The representation of the mental lexicon of the tar-
the modeling of the speech acquisition procedure for these get language is very important for including top-down pro-
three classes of speech items from the computational view- cesses of speech perception and thus for speech recognition.
point. But in principle all types of speech items (i.e. all However consideration of these processes currently goes
types of syllables and words or word components) can be beyond the scope of the current implementation of our
trained simultaneously by introducing just one comprehen- model. But the model in generally is open for integrating
sive learning task and by using one single phonetic map. a mental lexicon.
Recent preliminary experiments indicate that a comprehen- Last but not least it has to be stated that the neurocom-
sive single phonetic map shapes different subregions repre- putational production–perception model developed thus
senting different classes of speech items. The ordering of far by no means is an alternative solution for high-perfor-
speech items within these subregions is similar to the pho- mance speech recognition or speech synthesis systems. At
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 807
present the model described here is capable of producing Boatman, D., 2004. Cortical bases of speech perception: evidence from
and perceiving simple CV- and VC-syllables under ideal functional lesion studies. Cognition 92, 47–65.
Bookheimer, S.Y., Zeffiro, T.A., Blaxton, T.A., Gaillard, W., Theodore,
conditions. Concerning a further development of the model W.H., 2000. Activation of langauge cortex with automatic speech
introduced here two different strategies are imaginable. On tasks. Neurology 55, 1151–1157.
the one hand, this model can be further developed in order Bradlow, A.R., 1995. A comparative acoustic study of English and
to handle more complex classes of speech items (words, Spanish vowels. J. Acoust. Soc. Amer. 97, 1916–1924.
sentences, or a whole discourse) under ideal and non-ideal Brent, M.B., 1999. Speech segmentation and word discovery: a compu-
tational perspective. Trends Cognit. Sci. 3, 294–301.
conditions (e.g. different speakers, different emotional Browman, C., Goldstein, L., 1989. Articulatory gestures as phonological
states, external noise). On the other hand, the organization units. Phonology 6, 201–251.
of the neurocomputational model outlined in this paper Browman, C., Goldstein, L., 1992. Articulatory phonology: an overview.
could be integrated at least partially into the architecture Phonetica 49, 155–180.
of current or new speech recognition and speech synthesis Bullock, D., Grossberg, S., Guenther, F., 1993. A self-organizing neural
model of motor equivalent reaching and tool use by a multijoint arm.
systems. J. Cognit. Neurosci. 5, 408–435.
Callan, D.E., Tsytsarev, V., Hanakawa, T., Callan, A.K., Katsuhara, M.,
Acknowledgement Fukuyama, H., Turner, R., 2006. Song and speech: brain regions
involved with perception and covert production. Neuroimage 31,
This work was supported in part by the German Re- 1327–1342.
Cervera, T., Miralles, J.L., Gonzales-Alvarez, J., 2001. Acoustical analysis
search Council Grant No. KR 1439/13-1. of Spanish vowels produced by laryngectomized subjects. J. Speech
Lang. Hear. Res. 44, 988–996.
References Clark, R.A.J., Richmond, K., King, S., 2007. Multisyn: open-domain unit
selection for the Festival speech synthesis system. Speech Comm. 49,
Ackermann, H., Riecker, A., 2003. The contribution of the insula to 317–330.
motor aspects of speech production: a review and a hypothesis. Brain Damper, R.I., Harnad, S.R., 2000. Neural network models of categorical
Lang. 89, 320–328. perception. Percept. Psychophys. 62, 843–867.
Arbib, M.A., 2005. From monkey-like action recognition to human Dell, G.S., Chang, F., Griffin, Z.M., 1999. Connectionist models of
language: an evolutionary framework for neurolinguists. Behav. Brain language production: lexical access and grammatical encoding. Cognit.
Sci. 28, 105–167. Sci. 23, 517–541.
Bailly, G., 1997. Learning to speak: sensory-motor control of speech Eimas, P.D., 1963. The relation between identification and discrimination
movements. Speech Comm. 22, 251–267. along speech and non-speech continua. Lang. Speech 6, 206–217.
Batchelder, E.O., 2002. Bootstrapping the lexicon: a computational model Fadiga, L., Craighero, L., 2004. Electrophysiology of action representa-
of infant speech segmentation. Cognition 83, 167–206. tion. J. Clin. Neurophysiol. 21, 157–168.
Benson, R.R., Whalen, D.H., Richardson, M., Swainson, B., Clark, V.P., Fadiga, L., Craighero, L., Buccino, G., Rizzolatti, G., 2002. Speech
Lai, S., Liberman, A.M., 2001. Parametrically dissociating speech and listening specifically modulates the excitability of tongue muscles: a
nonspeech perception in the brain using fMRI. Brain Lang. 78, 364– TMS study. Eur. J. Neurosci. 15, 399–402.
396. Fowler, C.A., 1986. An event approach to the study of speech perception
Benzeghiba, M., De Mori, R., Deroo, O., Dupont, S., Erbes, T., Jouvet, from a direct-realist perspective. J. Phonetics 14, 3–28.
D., Fissore, L., Laface, P., Mertins, A., Ris, C., Rose, R., Tyagi, V., Fry, D.B., Abramson, A.S., Eimas, P.D., Liberman, A.M., 1962. The
Wellekens, C., 2007. Automatic speech recognition and speech identification and discrimination of synthetic vowels. Lang. Speech 5,
variability: a review. Speech Comm. 49, 763–786. 171–189.
Binder, J.R., Frost, J.A., Hammeke, T.A., Bellgowan, P.S.F., Springer, Gaskell, M.G., Marslen-Wilson, W.D., 1997. Integrating form and
J.A., Kaufman, J.N., Possing, E.T., 2000. Human temporal lobe meaning: a distributed model of speech perception. Lang. Cognit.
activation by speech and nonspeech sounds. Cereb. Cortex 10, 512– Process. 12, 613–656.
528. Goldstein, L., Byrd, D., Saltzman, E., 2006. The role of vocal tract action
Birkholz, P., Jackèl, D., 2004. Influence of temporal discretization units in understanding the evolution of phonology. In: Arbib, M.A.
schemes on formant frequencies and bandwidths in time domain (Ed.), Action to Language via the Mirror Neuron System. Cambridge
simulations of the vocal tract system. In: Proc. Internat. Conf. on University Press, Cambridge, pp. 215–249.
Speech and Language Processing (Interspeech 2004, Jeju, Korea), Goldstein, L., Marianne, Pouplier, Larissa, Chen, Elliot, Saltzman, Dani,
pp. 1125–1128. Byrd, 2007. Dynamic action units slip in speech production errors.
Birkholz, P., Kröger, B.J., 2006. Vocal tract model adaptation using Cognition 103, 386–412.
magnetic resonance imaging. In: Proc. 7th Internat. Seminar on Speech Grossberg, S., 2003. Resonant neural dynamics of speech perception. J.
Production (Belo Horizonte, Brazil), pp. 493–500. Phonetics 31, 423–445.
Birkholz, P., Kröger, B.J., 2007. Simulation of vocal tract growth for Guenther, F.H., 1994. A neural network model of speech acquisition and
articulatory speech synthesis. In: Proc. 16th Internat. Congress of motor equivalent speech production. Biological Cybernet. 72, 43–53.
Phonetic Sciences (Saarbrücken, Germany), pp. 377–380. Guenther, F.H., 1995. Speech sound acquisition, coarticulation, and rate
Birkholz, P., Jackèl, D., Kröger, B.J., 2006. Construction and control of a effects in a neural model of speech production. Psychol. Rev. 102, 594–
three-dimensional vocal tract model. In: Proc. Internat. Conf. on 621.
Acoustics, Speech, and Signal Processing (ICASSP 2006, Toulouse, Guenther, F.H., 2006. Cortical interaction underlying the production of
France), pp. 873–876. speech sounds. J. Comm. Disorders 39, 350–365.
Birkholz, P., Jackèl, D., Kröger, B.J., 2007. Simulation of losses due to Guenther, F.H., Ghosh, S.S., Tourville, J.A., 2006. Neural modeling and
turbulence in the time-varying vocal system. IEEE Trans. Audio imaging of the cortical interactions underlying syllable production.
Speech Lang. Process. 15, 1218–1225. Brain Lang. 96, 280–301.
Blank, S.C., Scott, S.K., Murphy, K., Warburton, E., Wise, R.J.S., 2002. Hartsuiker, R.J., Kolk, H.H.J., 2001. Error monitoring in speech
Speech production: Wernike, Broca and beyond. Brain 125, 1829– production: a computational test of the perceptual loop theory.
1838. Cognit. Psychol. 42, 113–157.
808 B.J. Kröger et al. / Speech Communication 51 (2009) 793–809
Heim, S., Opitz, B., Müller, K., Friederici, A.D., 2003. Phonological Levelt, W.J.M., Wheeldon, L., 1994. Do speakers have access to a mental
processing during language production: fMRI evidence for a shared syllabary?. Cognition 50 239–269.
production-comprehension network. Cognit. Brain Res. 16, 285–296. Levelt, W.J.M., Roelofs, A., Meyer, A., 1999. A theory of lexical access in
Hickok, G., Poeppel, D., 2000. Towards a functional neuroanatomy of speech production. Behav. Brain Sci. 22, 1–75.
speech perception. Trends Cognit. Sci. 4, 131–138. Liberman, A.M., Mattingly, I.G., 1985. The motor theory of speech
Hickok, G., Poeppel, D., 2004. Dorsal and ventral streams: a framework perception revised. Cognition 21, 1–36.
for understanding aspects of the functional anatomy of language. Liberman, A.M., Harris, K.S., Hoffman, H.S., Griffith, B.C., 1957. The
Cognition 92, 67–99. discrimination of speech sounds within and across phoneme bound-
Hickok, G., Poeppel, D., 2007. Towards a functional neuroanatomy of aries. J. Exp. Psychol. 54, 358–368.
speech perception. Trends Cognit. Sci. 4, 131–138. Liberman, A.M., Cooper, F.S., Shankweiler, D.P., Studdert-Kennedy,
Hillis, A.E., Work, M., Barker, P.B., Jacobs, M.A., Breese, E.L., Maurer, M., 1967. Perception of the speech code. Psychol. Rev. 74, 431–461.
K., 2004. Re-examining the brain regions crucial for orchestrating Liebenthal, E., Binder, J.R., Spitzer, S.M., Possing, E.T., Medler, D.A.,
speech articulation. Brain 127, 1479–1487. 2005. Neural substrates of phonemic perception. Cereb. Cortex 15,
Huang, J., Carr, T.H., Cao, Y., 2001. Comparing cortical activations for 1621–1631.
silent and overt speech using event-related fMRI. Hum. Brain Mapp. Li, P., Farkas, I., MacWhinney, B., 2004. Early lexical development in a
15, 39–53. self-organizing neural network. Neural Networks 17, 1345–1362.
Iacoboni, M., 2005. Neural mechanisms of imitation. Curr. Opin. Luce, P.A., Goldinger, S.D., Auer, E.T., Vitevitch, M.S., 2000. Phonetic
Neurobiol. 15, 632–637. priming, neighborhood activation, and PARSYN. Percept. Psycho-
Indefrey, P., Levelt, W.J.M., 2004. The spatial and temporal signatures of phys. 62, 615–625.
word production components. Cognition 92, 101–144. Maass, W., Schmitt, M., 1999. On the complexity of learning for spiking
Ito, T., Gomi, H., Honda, M., 2004. Dynamical simulation of speech neurons with temporal coding. Inform. Comput. 153, 26–46.
cooperative articulation by muscle linkages. Biological Cybernet. 91, McClelland, J.L., Elman, J.L., 1986. The TRACE model of speech
275–282. perception. Cognit. Psychol. 18, 1–86.
Jardri, R., Pins, D., Bubrovszky, M., Despretz, P., Pruvo, J.P., Steinling, Murphy, K., Corfield, D.R., Guz, A., Fink, G.R., Wise, R.J.S., Harrison,
M., Thomas, P., 2007. Self awareness and speech processing: an fMRI J., Adams, L., 1997. Cerebral areas associated with motor control of
study. Neuroimage 35, 1645–1653. speech in humans. J. Appl. Physiol. 83, 1438–1447.
Jusczyk, P.W., 1999. How infants begin to extract words from speech. Nasir, S.M., Ostry, D.J., 2006. Somatosensory precision in speech
Trends Cognit. Sci. 3, 323–328. production. Curr. Biology 16, 1918–1923.
Kandel, E.R., Schwartz, J.H., Jessell, T.M., 2000. Principles of Neural Norris, D., Cutler, A., McQueen, J.M., Butterfield, S., 2006. Phonological
Science, fourth ed. McGraw-Hill, New York. and conceptual activation in speech comprehension. Cognit. Psychol.
Kemeny, S., Ye, F.Q., Birn, R., Braun, A.R., 2005. Comparison of 53, 146–193.
continuous overt speech fMRI using BOLD and arerial spin labeling. Obleser, J., Boecker, H., Drzezga, A., Haslinger, B., Hennenlotter, A.,
Hum. Brain Mapp. 24, 173–183. Roettinger, M., Eulitz, C., Rauschecker, J.P., 2006. Vowel sound
Kohler, E., Keysers, C., Umilta, M.A., Fogassi, L., Gallese, V., Rizzolatti, extraction in anterior superior temporal cortex. Hum. Brain Mapp. 27,
G., 2002. Hearing sounds, understanding actions: action representa- 562–571.
tion in mirror neurons. Science 297, 846–848. Obleser, J., Wise, R.J.S., Dresner, M.A., Scott, S.K., 2007. Functional
Kohonen, T., 2001. Self-Organizing Maps. Springer, Berlin NewYork. integration across brain regions improves speech perception under
Kröger, B.J., 1993. A gestural production model and its application to adverse listening conditions. J. Neurosci. 27, 2283–2289.
reduction in German. Phonetica 50, 213–233. Okada, K., Hickok, G., 2006. Left posterior auditory-related cortices
Kröger, B.J., Birkholz, P., 2007. A gesture-based concept for speech participate both in speech perception and speech production: neural
movement control in articulatory speech synthesis. In: Esposito, A., overlap revealed by fMRI. Brain Lang. 98, 112–117.
Faundez-Zanuy, M., Keller, E., Marinaro, M. (Eds.), Verbal and Oller, D.K., Eilers, R.E., Neal, A.R., Schwartz, H.K., 1999. Precursors to
Nonverbal Communication Behaviours, LNAI 4775. Springer-Verlag, speech in infancy: the prediction of speech and language disorders. J.
Berlin, Heidelberg, pp. 174–189. Comm. Disorders 32, 223–245.
Kröger, B.J., Birkholz, P., Kannampuzha, J., Neuschaefer-Rube, C., 2006a. Perkell, J.S., Matthies, M.L., Svirsky, M.A., Jordan, M.I., 1993. Trading
Modeling sensory-to-motor mappings using neural nets and a 3D relations between tongue-body raising and lip rounding in production
articulatory speech synthesizer. In: Proc. 9th Internat. Conf. on Spoken of the vowel /u/: a pilot ‘‘motor equivalence” study. J. Acoust. Soc.
Language Processing (Interspeech 2006 – ICSLP), pp. 565–568. Amer. 93, 2948–2961.
Kröger, B.J., Birkholz, P., Kannampuzha, J., Neuschaefer-Rube, C., Poeppel, D., Guillemin, A., Thompson, J., Fritz, J., Bavelier, D., Braun,
2006b. Learning to associate speech-like sensory and motor states A.R., 2004. Auditory lexical decision, categorical perception, and FM
during babbling. In: Proc. 7th Internat. Seminar on Speech Production direction discrimination differentially engage left and right auditory
(Belo Horizonte, Brazil), pp. 67–74. cortex. Neurophysiologia 42, 183–200.
Kröger, B.J., Birkholz, P., Kannampuzha, J., Neuschaefer-Rube, C., Postma, A., 2000. Detection of errors during speech production: a review
2006c. Spatial-to-joint mapping in a neural model of speech produc- of speech monitoring models. Cognition 77, 97–131.
tion. In: DAGA-Proc. 32th Annu. Meet. German Acoustical Society Raphael, L.J., Borden, G.J., Harris, K.S., 2007. Speech Science Primer:
(Braunschweig, Germany), pp. 561–562. <http:// Physiology, Acoustics, and Perception of Speech, fifth ed. Lippincott
www.speechtrainer.eu>. Williams & Wilkins.
Kröger, B.J., Lowit, A., Schnitker, R., 2008. The organization of a Riecker, A., Kassubek, J., Gröschel, K., Grodd, W., Ackermann, H.,
neurocomputational control model for articulatory speech synthesis. 2006. The cerebral control of speech tempo: opposite relationship
In: Esposito, A., Bourbakis, N., Avouris, N., Hatzilygeroudis, I. between speaking rate and BOLD signal change at stratal and
(Eds.), Verbal and Nonverbal Features of Human–Human and cerebellar structures. Neuroimage 29, 46–53.
Human–Machine Interaction. Selected papers from COST Action Rimol, L.M., Specht, K., Weis, S., Savoy, R., Hugdahl, K., 2005.
2102 International Workshop. Springer-Verlag, pp. 121–135. Processing of sub-syllabic speech units in the posterior temporal lobe:
Kuriki, S., Mori, T., Hirata, Y., 1999. Motor planning center for speech an fMRI study. Neuroimage 26, 1059–1067.
articulation in the normal human brain. Neuroreport 10, 765–769. Ritter, H., Kohonen, T., 1989. Self-organizing semantic maps. Biological
Latorre, J., Iwano, K., Furui, S., 2006. New approach to the polyglot Cybernet. 61, 241–254.
speech generation by means of an HMM-based speaker adaptable Rizzolatti, G., Arbib, M.A., 1998. Language within our grasp. Trends
synthesizer. Speech Comm. 48, 1227–1242. Neurosci. 21, 188–194.
B.J. Kröger et al. / Speech Communication 51 (2009) 793–809 809
Rizzolatti, G., Craighero, L., 2004. The mirror neuron system. Annu. Rev. Tani, J., Masato, I., Sugita, Y., 2004. Self-organization of distributed
Neurosci. 27, 169–192. represented multiple behavior schemata in a mirror system: reviews of
Rosen, H.J., Ojemann, J.G., Ollinger, J.M., Petersen, S.E., 2000. robot experiments using RNNPB. Neural Networks 17, 1273–1289.
Comparison of brain activation during word retrieval done silently Todorov, E., 2004. Optimality principles in sensorimotor control. Nature
and aloud using fMRI. Brain Cognit. 42, 201–217. Neurosci. 7, 907–915.
Saltzman, E., 1979. Levels of sensorimotor representation. J. Math. Tremblay, S., Shiller, D.M., Ostry, D.J., 2003. Somatosensory basis of
Psychol. 20, 91–163. speech production. Nature 423, 866–869.
Saltzman, E., Byrd, D., 2000. Task-dynamics of gestural timing: phase Ullman, M.T., 2001. A neurocognitive perspective on language: the
windows and multifrequency rhythms. Hum. Movement Sci. 19, 499– declarative/procedural model. Nature Rev. Neurosci. 2, 717–726.
526. Uppenkamp, S., Johnsrude, I.S., Norris, D., Marslen-Wilson, W.,
Saltzman, E., Munhall, K.G., 1989. A dynamic approach to gestural Patterson, R.D., 2006. Locating the initial stages of speech-sound
patterning in speech production. Ecol. Psychol. 1, 333–382. processing in human temporal cortex. Neuroimage 31, 1284–1296.
Sanguineti, V., Laboissiere, R., Payan, Y., 1997. A control model of Vanlancker-Sidtis, D., McIntosh, A.R., Grafton, S., 2003. PET activation
human tongue movements in speech. Biological Cybernet. 77, 11–22. studies comparing two speech tasks widely used in surgical mapping.
Scharenborg, O., 2007. Reaching over the gap: a review of efforts to link Brain Lang. 85, 245–261.
human and automatic speech recognition research. Speech Comm. 49, Varley, R., Whiteside, S., 2001. What is the underlying impairment in
336–347. acquired apraxia of speech. Aphasiology 15, 39–49.
Scott, S.K., Blank, C.C., Rosen, S., Wise, R.J.S., 2000. Identification of a Werker, J.F., Yeung, H.H., 2005. Infant speech perception bootstraps
pathway for intelligible speech in the left temporal lobe. Brain 123, word learning. Trends Cognit. Sci. 9, 519–527.
2400–2406. Westerman, G., Miranda, E.R., 2004. A new model of sensorimotor
Shadmehr, R., Mussa-Ivaldi, A., 1994. Adaptive representation of coupling in the development of speech. Brain Lang. 89, 393–400.
dynamics during learning of a motor task. J. Neurosci. 14, 3208–3224. Wilson, M., Knoblich, G., 2005. The case for motor involvement in
Shuster, L.I., Lemieux, S.K., 2005. An fMRI investigation of covertly and perceiving conspecifics. Psychol. Bull. 131, 460–473.
overtly produced mono- and multisyllabic words. Brain Lang. 93, 20–31. Wilson, S.M., Saygin, A.P., Sereno, M.I., Iacoboni, M., 2004. Listening to
Sober, S.J., Sabes, P.N., 2003. Multisensory integration during motor speech activates motor areas involved in speech production. Nature
planning. J. Neurosci. 23, 6982–6992. Neurosci. 7, 701–702.
Sörös, R., Guttman Sakoloff, L., Bose, A., McIntosh, A.R., Graham, S.J., Wise, R.J.S., Greene, J., Büchel, C., Scott, S.K., 1999. Brain regions
Stuss, D.T., 2006. Clustered functional MRI of overt speech produc- involved in articulation. The Lancet 353, 1057–1061.
tion. Neuroimage 32, 376–387. Zekveld, A., Heslenfeld, D.J., Festen, J.M., Schoonhoven, R., 2006. Top-
Studdert-Kennedy, M., 2002. Mirror neurons, vocal imitation, and the down and bottom-up processes in speech comprehension. Neuroimage
evolution of particulate speech. In: Stamenov, M.I., Gallese, V. (Eds.), 32, 1826–1836.
Mirror Neurons and the Evolution of Brain and Language. Benjamin, Zell, A., 2003. Simulation neuronaler Netze. Oldenbourg Verlag, Mün-
Philadelphia, pp. 207–227. chen, Wien.