0% found this document useful (0 votes)

63 views25 pages

A Fuzzy Decision Tree-Based Duration Model For Standard Yoru'ba Text-To-Speech Synthesis

This paper presents a fuzzy decision tree based duration model for Standard Yoruba text-to-speech synthesis. The model aims to accurately time syllables to align with the intonation contour. It applies fuzzy decision trees to compute syllable durations and compares results to a classification and regression tree based model. Evaluation shows the fuzzy decision tree better extrapolates to new data and produces more natural sounding speech.

Uploaded by

Marcos Verdugo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views25 pages

A Fuzzy Decision Tree-Based Duration Model For Standard Yoru'ba Text-To-Speech Synthesis

Uploaded by

Marcos Verdugo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

COMPUTER

SPEECH AND
LANGUAGE

Computer Speech and Language 21 (2007) 325349

www.elsevier.com/locate/csl

A fuzzy decision tree-based duration model for Standard

Yoru`ba text-to-speech synthesis
O: de: tunj A. O: de: jo: b

a,b,1

, Shun Ha Sylvia Wong

a,*

, Anthony J. Beaumont

Computer Science, Aston University, Aston Triangle, Birmingham B4 7ET, UK

Room 109, Computer Buildings, Computer Science and Engineering Department, O: bafe: mi Awo: lo: wo: University, Ile-If e: , Nigeria
Received 13 April 2005; received in revised form 13 June 2006; accepted 17 June 2006
Available online 10 August 2006

Abstract
In this paper, we present syllable-based duration modelling in the context of a prosody model for Standard Yoru`ba
(SY) text-to-speech (TTS) synthesis applications. Our prosody model is conceptualised around a modular holistic framework. This framework is implemented using the Relational Tree (R-Tree) techniques. An important feature of our R-Tree
framework is its exibility in that it facilitates the independent implementation of the dierent dimensions of prosody, i.e.
duration, intonation, and intensity, using dierent techniques and their subsequent integration. We applied the Fuzzy
Decision Tree (FDT) technique to model the duration dimension. In order to evaluate the eectiveness of FDT in duration
modelling, we have also developed a Classication And Regression Tree (CART) based duration model using the same
speech data. Each of these models was integrated into our R-Tree based prosody model.
We performed both quantitative (i.e. Root Mean Square Error (RMSE) and Correlation (Corr)) and qualitative (i.e.
intelligibility and naturalness) evaluations on the two duration models. The results show that CART models the training
data more accurately than FDT. The FDT model, however, shows a better ability to extrapolate from the training data
since it achieved a better accuracy for the test data set. Our qualitative evaluation results show that our FDT model produces synthesised speech that is perceived to be more natural than our CART model. In addition, we also observed that the
expressiveness of FDT is much better than that of CART. That is because the representation in FDT is not restricted to a
set of piece-wise or discrete constant approximation. We, therefore, conclude that the FDT approach is a practical
approach for duration modelling in SY TTS applications.
2006 Elsevier Ltd. All rights reserved.

1. Introduction
The main problem with modern text-to-speech (TTS) synthesis systems is the poor quality of the generated
speech sound. This poor quality results from the inability of traditional TTS systems to account for speech

Corresponding author. Tel.: +44 121 204 3473; fax: +44 121 204 3681.
E-mail addresses: oodejobi@yahoo.com (O: .A. O: de: jo: b), s.h.s.wong@aston.ac.uk (S.H.S. Wong), a.j.beaumont@aston.ac.uk (A.J.
Beaumont).
1
Supported by the Commonwealth Scholarship Commission, UK.
0885-2308/$ - see front matter 2006 Elsevier Ltd. All rights reserved.
doi:10.1016/j.csl.2006.06.005

326

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

prosody at an acceptable level of accuracy when compared to what is obtained in natural speech. Prosody is an
inherent super-segmental feature of human speech (Shen et al., 1993; Minematsu et al., 2003) which has generated a lot of research interest in TTS synthesis. That is because natural speech prosody comprises many
dimensions and they interact in very complex ways. In addition to this complexity, information about speech
prosody is not directly encoded in written text and has to be predicted.
Putatively, there are three dimensions of speech prosody: intonation, duration and intensity. The aim in
TTS synthesis is to reproduce these three dimensions of speech as accurately as possible by implementing a
model that predicts all prosody phenomena that are known to be perceptually relevant in each of the identied dimensions from an input text. We have previously reported on a modular holistic approach to prosody modelling in the context of TTS for tone languages (O: de: jo: b et al., 2004a). The prosody modelling
framework is implemented using the R-Tree techniques (Ehrich and Foith, 1976). The construction of an
R-Tree involves using algorithms based on tone phonological rules to generate an abstract structure, called
the skeletal tree (S-Tree), which represents the intonation contour of an utterance. The dimensions of the
perceptually signicant points on the skeletal tree (S-Tree) are then computed to synthesise the prosody
of the target utterance. A major attribute of our modelling paradigm is its exibility. It enables the implementation and evaluation of various dimensions of prosody using dierent techniques independently. This
provides a good test-bed for experimenting with various modelling techniques for each individual dimension
of speech prosody with the aim of selecting the best approach. A computational model for realising the
intonation dimension has already been developed and demonstrated using the Standard Yoru`ba (SY) language (O: de: jo: b et al., 2004b). That model uses fuzzy logic to compute the numerical values of the peaks and
valleys on the intonation contour of an utterance. In this paper, we present the modelling of the duration
dimension for Standard Yoru`ba TTS.
The duration modelling problem in SY diers from those in non-tone languages like English. SY is a syllable-time language in which the syllable is the basic perceptual unit of an utterance. We have shown in a previous work that the locations of the peak and valley on the tone of an SY syllable encode the most
perceptually signicant points on the tone f0 curve (O: de: jo: b et al., 2004b). It is clear from the ndings of
research works on SY (Ladd, 2000) that the timing of such perceptually signicant points is central to the
intelligibility quality and semantic interpretation of speech sound in tone languages. Therefore, the problem
in SY duration modelling is not how to account for the duration of segments, but how to accurately align the
f0 contour of syllables through the appropriate timing of the voiced portions of syllables with an utterance
intonation contour. Within our prosody modelling framework, it is necessary to model the acoustic evolution
of speech sound by accurately timing the sequence of discrete description of speech sound. The model depends
on the numerical data obtained from acoustic analysis of speech. The linguistic data obtained from perceptual
experiments are also used to establish the suciency or relative potency of individual acoustic cues. Thus, our
duration modelling relies on two kinds of data: (i) numerical and (ii) linguistic. The structure and pattern of
these data determine the strategy for designing the model.
In order to exploit these two types of data, we applied the Fuzzy Decision Tree (FDT) technique to compute the relative duration of each syllable in an utterance. FDT has been applied to the modelling of various
problems, such as power system security assessment (Boyen and Wehenkel, 1999), weather forecasting (Yuan
and Shaw, 1995; Dong and Kothari, 2001) as well as in software quality models (Pedrycz and Sosnowski,
2001). Suarez and Lutsko (1999) have assessed the performance of the FDT technique in real world problems
such as classication of diabetes data, breast cancer data, heart disease data as well as in waveform recognition. For example, Mitra et al. (2002) applied FDT to the recognition of vowels produced by a group of male
speakers in a ConsonantVowelConsonant context. The results of these applications suggest that FDTs are
better in extrapolating from training data when compared with binary decision trees such as the Classication
And Regression Tree (CART).
A survey of the literature on duration modelling, in the context of prosody modelling for TTS applications,
suggests that CART is the most frequently used modelling technique. There is no reported work on the application of FDT to duration modelling. In this paper, we illustrate the application of FDTs to duration modelling and compare the results of our FDT-based model with that of a CART-based duration model. We
demonstrate these duration models within the context of our R-Tree based prosody model using the Standard
Yoru`ba language as a case study.

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

327

In Section 2, we provide a brief overview of the Standard Yoru`ba (SY) language. Section 3 provides a
description of the data used for creating our duration models. Section 4 gives an overview of the factors aecting duration in SY speech and our SY duration modelling. Section 5 contains a review of the literature on
duration modelling in TTS applications. The FDT and CART based duration models are discussed in Sections
6 and 7, respectively. An evaluation and discussion of our models is provided in Section 8. Section 9 concludes
this paper.
2. A brief description of the Standard Yoru`ba language
Yoru`ba is one of the four major languages spoken in Africa and it has a speaker population of more than
30 million in West Africa alone (Crozier and Blench, 1976; Taylor, 2000). There are many dialects of the language, but all speakers can communicate eectively using Standard Yoru`ba (SY). SY is used in language education, mass media and everyday communication. The present study is based on the SY language.
The SY alphabet has 25 letters which is made up of 18 consonants (represented by the graphemes: b, d, f,
g, gb, h, j, k, l, m, n, p, r, s, s: , t, w, y) and seven vowels (a, e, e: , i, o, o: , u). Note that the consonant gb is a
diagraph, i.e. a consonant written in two letters. There are ve nasalised vowels in the language (an, en, in,
o: n, un) and two pure syllabic nasals (m, n). SY has three phonologically contrastive tones: High (H), Mid (M)
and Low (L). Phonetically, however, there are two additional allotones or tone variants namely, rising (R) and
falling (F) (Connell and Ladd, 1990; Akinlab, 1993). A rising tone occurs when an L tone is followed by an H
tone, while a falling tone occurs when an H tone is followed by an L tone. This situation normally occurs during assimilation, elision or deletion of phonological object as a result of co-articulation phenomenon in uent
speech.
A valid SY syllable can be formed from any combination of a consonant and a vowel or a consonant and a
nasalised vowel. When each of the 18 consonants is combined with a simple vowel, we will have a total of 126
CV type syllables. When each consonant is combined with a nasalised vowel, we have a total of 90 CVn type
syllables. SY also has two syllabic nasals n and m. Table 3 shows the distribution of the components of the
phonological structure of SY syllables.
Table 1
Segmental phonemes of Standard Yoru`ba consonant
Manner of articulation

Stops
Fricate
Aricates
Nasal
Flap
Lateral
Semi-vowel

Place of articulation
Bilabial

Alveolar

Palato-Alveolar

b
f

td
s

n
r
l

Palatal

Velar
kg

Labio-Velar
_ _
kp gb

Table 2
Standard Yoru`ba vowel system
Oral Vowels

glottal

Nasal Vowels

328

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

It should be noted that although a CVn syllable ends with a consonant, the consonant and its preceding
vowels are the orthographic equivalent of a nasalised vowel. There is no closed syllable and there is no consonant cluster in the SY language. The SY consonant and vowel systems are shown in Tables 1 and 2, respectively. The phonetic attribute of the nucleus is important to the f0 curve of the syllable because we view that the
tone is anchored to the nucleus of the syllable.
3. Experimental data
There is no language resource developed for SY in the context of speech technology. We therefore developed a speech database for the purpose of this research. We selected four popular SY newspapers and three
SY textbooks for creating our text corpus. The newspapers are: (i) Alaro`ye, (ii) Alala`ye, (iii) I`ro`y`n Yoru`ba,
and (iv) Akede Agbaye. The three textbooks are two SY language education textbooks (Bamgbos: e, 1990; Owo` gunb: wale, 1966). In addition to these, we also composed a short SY
lab, 1998) and a book on SY culture (O
story and added the text to our SY text corpus. The purpose of composing the story is to add typical dialogue
domain text into the already collected texts. It also allows us to compare the tonal and linguistic distributions
in the dierent domains of SY text. The resulting corpus contains 95 sentences.
The analysis of the text database informed the selection of the text for our speech corpus. Out of the 690 SY
syllables (cf. Table 3), we selected 456 syllables. These syllables are carefully selected to reect the coverage of
all syllable types in terms of phonetic and phonological distributions. For example, in the CV syllable type, the
manner of articulation of the onset is considered. The onset consonants are selected from each manner of
articulation classes, i.e. stop, labio-velars, fricates, aricate, sonorants or semivowels. The selected onset is
combined with each vowel type, e.g. Close rounded, Half-closed front, etc., in order to select the syllable
for each class of utterance. The same process is repeated for all the syllable types. The data set adequately
represents all SY syllable types (i.e. CV, CVn, Vn, V and N).
Of the 95 sentences in our corpus, 60 sentences were selected for training our duration model. Forty of them
are one-phrase sentences and each of the remaining 20 sentences contains two phrases. For the test data set, we
Table 3
Phonological structure of SY syllables
Tone syllables (690)2
Base syllables (230)
ONSET (18)

Tones(3)
RYHME(14)
Nucleus

Coda

Consonant

Vocalic

Non-Vocalic

V(7)

N(2)

n(1)

H, M, L

The numbers within a parenthesis indicates the total number of the specied unit.
Table 4
Statistics for the characteristics of the training and test data sets
Category

Description

Training set

Test set

Sentences

One-phrase
Two-phrase

40 (60%)
20 (40%)

21 (67%)
9 (33%)

Words

Total word count

793

386

Syllable types

CV
CVn
Vn
V
N

1007
368
291
755
38

414
234
58
292
20

Tone types

H tone
L tone
M tone

984
980
492

458
305
255

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

329

have chosen 30 sentences from the remaining 35 sentences in our corpus. Twenty-one of them are one-phrase
sentences and the remaining nine are two-phrase sentences. The distribution of syllable and tone types in these
sentences is shown in Table 4. The occurrence counts for each syllable type in context in both training and test
data sets are shown in Fig. 1.
Both the training and test data sets contain semantically well-formed statement sentences and they are
selected to reect common, everyday use of SY. Within the training data set, the minimum number of syllables
per sentence is 6 and the maximum is 24 syllables. The H and L tone syllables account for 40% each while the
M tone syllables account for the remaining 20%. For the test data set, the minimum number of syllables per
sentence is 4 and the maximum is 19 syllables. The H, L and M tone syllables account for 45%, 30% and 25%
each, respectively. Our training and test data sets contain statement sentences only because research on SY
language intonation has shown that the mode of the sentences does not aect the intonation (Connell and
Ladd, 1990). Since intonation has the closest proximity to sentence mode, we assume that its eects on the
other dimensions of prosody, i.e. duration and intensity, if present, will be minimal.
To obtain the timing information of each syllable in context for our duration modelling and evaluation
experiments, we have recorded and annotated all of these 90 sentences. Of the 456 recorded syllables, 350
of them form the training set and 106 form the test set. All of the 456 syllables were read by six participants
and their voices were recorded as discussed in the following section. However, only the recorded data of one
adult male speaker was used for the duration modelling. The other data were used in experimentation in order
to determine the factors that aect duration as well as in verication of our models.
3.1. Speech data recording and annotation
Six nave adult native SY speakers, three females and three males read the speech for the selected syllables
and sentences in our training data set. The age of the speakers ranged from 21 to 36 years old. An ANDREANC
61 microphone was used for recording on a Pentium 4.2 GHz microcomputer system with on board sound
card. The recording took place in a quiet laboratory environment. Two freeware software products were used
in this experiment: Wavesurfer (Sjolander and Beskow, 2004) and Praat (Boersma and Weenink, 2004).
In all, 2100 syllables and 360 sentences were recorded for the training data set. Each of the speakers read 60
sentences and 350 syllables. For the test set, the recording was carried out using only one male speaker.
In order to achieve good recording, recorded speech signals were inspected for the following defects:
distortion arising from clippings,
aliasing via spectral analysis,
noise eects arising from low signal amplitude as a result of quantisation noise or poor signal-to-noise ratio
(SNR),
large amplitude variation, and
transient contamination (due to extraneous noise).
Recorded speech that had any of the above listed defects was discarded and the sound recording repeated.

Fig. 1. Occurrence counts for syllable type in context.

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

330

During the speech annotation, the recordings of both syllables and sentences from only one male speaker
were annotated. That data was used for the duration modelling. With the other speakers, all syllable recordings were fully annotated and they were used to determine the duration aecting factors. In addition, some
sentence recordings from these speakers were also annotated for verication purposes.
In the annotation of the syllable speech les, only one tier is specied, i.e. the syllable tier. The symbol * is
used to annotate syllable boundaries. Each syllable is labelled with its letters, with its associated tone enclosed
in parenthesis. For example, the syllable ba is labelled as ba(H) where ba are the graphemes of the syllable and
H represents the high tone associated with the syllable.
Four tiers are specied for the sentence annotation: (i) syllable, (ii) word, (iii) phrase and (iv) sentence. When
annotating the sentence speech les, the labelling order is: (i) sentence, (ii) phrase (if more than one), (iii) word
and (iv) syllable. This labelling order simplies the detection of boundaries in smaller linguistic units since their
annotation is guided by the annotation of the larger tiers. For example, after annotation of the word tier in a
two-syllable word, the beginning of the rst syllable and the end of the last syllable can be easily determined.
This approach also reduces annotation error since larger units are much easier to identify physically from speech
sound signals and perceptually from listening to a replay of the sound segment. Fig. 2 shows an example of anno pe: k o to de, ko` te`te` lo: . (meaning He came late, and did not go early.).
tation data for the SY sentence O
Both the spectrogram and the waveform are used to determine syllable and word boundaries. For certain
types of syllables in the continuous speech for sentences, we found that boundaries between syllables were
hard to determine. This occurred between VV pairs or between VCV pairs where C is a semi-vowel such
as /y/ or /w/. In this situation, we employed listening tests in addition to speech spectrograph and waveform
characteristics for syllable boundary detection. Where the boundaries were in doubt, we found the earliest reasonable position and the latest reasonable position, then placed the boundary half-way in between. For
unvoiced plosives, i.e. /p/, /t/ and /k/, we placed the syllable boundary in the centre of the closure. We note
that the voiced plosives show strong segmental eects on the f0 curves of the syllable in which they occur.
3.2. Factors aecting syllable duration in SY
An ideal duration model in a TTS system should consider all contextual factors that aect duration and
account for the timing of all speech sounds. However, the consensus in the literature is that this task is
practically impossible (Campbell and Isard, 1991; Brinckmann and Trouvain, 2003) and there is the need
to simplify the duration modelling problem to a manageable complexity without compromising crucial
perceptual information. To establish the factors that aect SY syllable duration, we have conducted informal

170

Frequency (Hz)

45
0

1.6175
Time (s)

O(H) pe(H) Ki(H) o(H)

*
*
*

to(H)

de(H)

p K t d

, Ko(L) te(L) te(L)

p K t d, K tt lo
Time (s)

tt
K tt lo

lo(M)

*
*
*
1.6175

pe: k o to de, ko` te`te` lo: .

Fig. 2. Example annotation data for the sentence O

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

331

experiments on 60 recorded sentences and syllables from six native SY speakers (cf. Section 3). In these experiments, we considered nine factors that aect duration. If we represent the target syllable (i.e. syllable for
which duration is to be computed) as Stag and the word in which it occurs as Wtag, the factors that we select
for our duration data analysis are:
(1)
(2)
(3)
(4)
(5)
(6)
(7)
(8)
(9)

The position of syllable in Wtag, S PoW

tag .
Position of Wtag in the sentence, W PoS
tag .
Length of Wtag, W len
,
calculated
as
the number of syllables it contains.
tag
0
The peak of the f0 curve on Stag, S ftag
.
The phonetic structure of the target syllable Stag, i.e. S pho
tag .
The phonetic structure of the preceding syllable Spre, S pho
pre .
The phonetic structure of the following syllable Sfol, S pho
fol .
0
The peak of the f0 curve on preceding syllable Spre, S fpre
.
The peak of the f0 curve on following syllable Sfol, S ffol0 .

The f0 curve on each syllable is approximated by a third degree polynomial using the stylisation (dAlessandro and Mertens, 1995) technique. The results of our experiments show that the tone of the preceding and
following syllables, in terms of the peak f0 values, do not aect the duration of the target syllable. Thus,
we considered only the rst seven factors listed above in our duration modelling.
4. SY syllable duration modelling
The duration of a syllable spoken in isolation diers from its duration when it occurs in the context of an
utterance. This implies that the factors that aect the duration of syllables in the context of uent speech actually modify the canonical duration. This modication can produce three eects on the duration of the canonical syllable: (i) decrease the duration, i.e. compress the syllable, (ii) leave the duration unchanged, or (iii)
increase the duration, i.e. stretch the syllable.
Let g be the scaling factor for the duration of a canonical syllable. The value of g can be calculated using a
multiplicative model which simply multiplies the value that each factor contributes to the change in the duration of a canonical syllable. Chen et al. (2003) have shown that such multiplicative models perform better than
additive models. However, such a simplistic multiplicative model may introduce further problems, if the calculation is not restricted to the factors whose contribution is a positive non-innite value.
Let Lr and Lc be the realised and the canonical duration of a syllable, respectively. Let LI be the amount of
increase or decrease of a canonical duration of a syllable. If g is dened such that 1 < g 6 1.0, we can formulate the equation for computing LI, given Lc, as:
LI gLc

where g denotes the syllable duration modier. Lr can then be computed using:
Lr Lc LI

In Eq. (1), g acts as a scaling factor for the duration of a syllable. If g = 0, it implies that the realised syllable
duration is the same as the canonical duration of the syllable in question. That is the case when monosyllabic
words are spoken in isolation or at the beginning of a short sentence. When g < 0, the realised duration is reduced by the factor specied by g. For example, if g = 0.5, it implies that the realised duration is 50% shorter
(compressed to half of its canonical size) than the canonical duration. Likewise, if g > 0, the canonical duration
of the syllable is increased. Our aim is to develop a model that predicts g by establishing a relationship between
the set of factors that aect syllable duration and the duration of a syllable in the context of an utterance.
5. Contemporary approaches to duration modelling
Reported works in the literature (e.g. Allen, 1994; Bellegarda et al., 2001; Huckvale, 2002) all agree that
the present knowledge about speech duration, and the state of the art in speech technology in general, is still

332

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

rudimentary and that our understanding of duration patterns and the many sources of variability which
aect them is still sparse (Mobius, 2003). Since the acoustic signal is best represented quantitatively, we need
a computational model that is capable of capturing and relating the numerical (quantitative) and the rich
qualitative knowledge underlying the linguistic structure of speech timing. To incorporate this knowledge
into a TTS duration model, we need to recognise that speech is primarily a linguistic entity and its acoustic
manifestation as waveforms is meant to communicate the embedded linguistic message. That explains the
desire for duration models to capture the relationship between linguistic information and a huge range
of values assigned to the durational features of speech. An engineering approach to this problem will be
to design a model which employs a strategy in which durations are related to the perception of acoustic
waveform.
There are two principal classes of methods applied in the design of duration models for modern TTS systems. They include: (i) rule-based methods and (ii) data-driven methods. The data-driven methods can be further divided into two groups: (a) iterative optimisation methods and (b) machine learning methods. The
duration models resulting from these methods dier in structure as well as in the manner in which they use
duration aecting factors to assign duration to units of utterance.
In rule-driven methods, a set of If-Then rules are designed based on the durational pattern observed in a
study of natural speech waveforms (Hohne et al., 1983). These rules are used to modify the duration of segments with the aim of producing a quality of match between the natural and synthetic speech. Underpinning
this approach is the idea that, by experimenting with a number of sentences, and speakers, one could hopefully
make a major improvement in the predicted duration and hence the quality of synthetic speech obtained. The
Klatt (1987) duration model is perhaps the most popular rule-driven duration model. This model predicts segmental duration by starting from some intrinsic values. The intrinsic duration is modied by successively
applying rules which are intended to reect contextual factors, such as positional and prosodic factors, to
lengthen or shorten the segment.
The Klatt duration model is specied by an equation which takes into account the inherent and minimum duration of a segment, measured in milliseconds. The percentage increase or decrease in the duration
of segments is determined by applying If-Then rules. A major weakness of this model is that rule parameters are determined by a manual trial-and-error process. Manually exploring the eect of mutual interactions among linguistic features of dierent levels is a highly complex and error-prone process. Moreover, the
model does not provide a systematic structure for determining how to include or exclude a factor that
aects duration. Hence, the rule inference process usually involves a controlled experiment, in which only
a limited number of contextual factors are examined. In addition, the application of this model in a syllable-based duration model, such as ours, will require that we treat syllables as segments. That will limit the
exibility of our model because syllable-sized durations are generally less variable than segment duration
(Keller and Zellner, 1995). Using this approach will also introduce some inaccuracies in the representation
of f0 anchor points which are crucial to the location of f0 peaks and valleys on the intonation contour of
our prosody model.
In iterative optimisation methods, a basic mathematical model that describes the duration pattern is rst
derived and then optimised using speech duration data. The Sum-Of-Product (SOP) method (van Santen,
1992, 1994) is a typical data-driven iterative optimisation method which applies both addition and multiplication to the computation of speech unit duration. The SOP model has been used in many TTS applications.
It is particularly suitable for computing syllable segment duration. The idea underlying the design of this
model is that the regularity in the interaction of factors aecting duration can be described by a class of simple
arithmetic equations. An SOP model treats the factors that aect duration as independent variables in a formula that computes a dependent variable, i.e. duration. To achieve this, a monotonically increasing transformation function, F(), is used in conjunction with another function, D(), that can be decomposed as a sum of
product of single factor parameters. The strength of the SOP model is that the number of parameters required
in the model is suciently small, and the arithmetic operation of multiplication and addition are mathematically suciently well behaved that parameters can be estimated even under conditions of severe frequency
imbalance (van Santen, 1994; Chung and Huckvale, 2001). However, Bellegarda et al. (2001) have observed
that the diagnosis of an N-variable function, on the basis of joint independence, requires the testing of
(N 1) tuples of variables for independence of the Nth. Such diagnosis is not always successful, because, apart

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

333

from requiring a considerable eort in generating the model, it has been shown that the sum-of-product function is not a generalised additive function and the choice of the usual log function for the monotonically
increasing transformation function, F(), is probably not optimal (Bellegarda et al., 2001).
In machine learning approaches, the aim is to automatically generate a duration model from a large annotated speech corpus, usually with the aid of statistical methods, such as the Classication And Regression Tree
(CART) (Lee and Oh, 1999; Chung, 2002), or automatic machine learning techniques, such as Articial Neural Network (ANN) (Fletcher and McVeigh, 1993; Chen et al., 1998; Vainio, 2001), Bayesian Model (Goubanova and Taylor, 2000) or Hidden Markov Model (HMM) (Levinson, 1986; Donovan, 1996). CART is
perhaps the most popular data-driven method for duration modelling in TTS applications. CARTs are particularly attractive because standard tools for their generation are widely available, and, in contrast to other
data-driven methods, the computed regression tree is interpretable. An additional strength of CART is the
ease with which trees may be built from duration data and also the speed of classication of new data. Lee
and Oh (1999) have shown that CARTs can cope with complex confounding interaction between factors that
aect duration because it makes very few assumptions about the structure of the data.
CART embodies a binary branching tree with questions about the inuencing factors at the nodes and predicted values at the leaves (Riley, 1992; Breiman et al., 1984). The tree itself contains yes/no questions about
features and ultimately provides either a probability distribution, when predicting categorical values (classication tree), or a mean and standard deviation when predicting continuous values (regression tree). Welldened techniques can be used to construct an optimal tree from a set of training data. Furthermore, CART
induced trees can easily be converted into rules by viewing all the nodes which lead from the root to a leaf as
the antecedent of a rule and the corresponding leaf as the consequence. Therefore, a major strength that the
CART approach has over other data-driven methods is that CART output is more readable and often understandable by humans. This feature is particularly important when developing a duration model for a new language as it makes it possible to iteratively evaluate and improve the model.
However, it is well-known that CART is unable to accurately extrapolate from known to unknown
contexts (Riley, 1992). Furthermore, due to the way that a CART is structured, CART allows either a single
feature or a linear combination of features at each internal node. This makes CART, like other binary decision-tree algorithms, biased towards generality. Another well-known weakness of CART is its inability to handle sparse data.
van Santen (1994) has shown that SOP models can successfully handle the data sparsity problem. Therefore, a model that contains a mixture of probabilistic and prescriptive elements would be better suited for
our duration modelling. The Fuzzy Decision Tree (FDT) modelling technique meets this requirement (Janikow, 1998; Huang and Liang, 2002). The major strengths of fuzzy logic algorithms are that they are robust
and exible and that they are able to cope well with interactions of linguistic attributes. Hence, they can be
easily tailored to cope with small disjuncts, which are associated with large degrees of attribute interaction
(Carvalho and Freitas, 2002). In light of the characteristics of FDT, we hypothesise that FDT could be a suitable technique for duration modelling in the context of TTS. In this paper, we describe our application of
FDT in duration modelling.
6. FDT in duration modelling
The motivation for selecting the fuzzy decision tree approach for our duration model is founded upon the
hypothesis that proportionate relationship among confounding factors that aect duration at various phonological levels can be captured by an appropriately-designed model. Such a model must, on one hand, establish
a relationship between the linguistic levels and qualitative description of duration phenomena. On the other
hand, it must facilitate a transparent link between qualitative descriptions and quantitative values that is
responsible for the timing of speech waveforms.
The FDT is an appropriate model in this context because it does not impose any arithmetic or multiplicative restrictions (relationship), or any inherent linearity by way of empirical rules. Thus it is able to exploit a
very important property of interaction between factors that aect duration, i.e. these interactions are often
regular in the sense that the eects of one factor does not reverse that of another (van Santen, 1994; Campbell,
2000).

334

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

An FDT model facilitates the computation of a more globally optimal result because it has the ability to
compute the relative eects of all child nodes corresponding to factors aecting duration on the duration of a
syllable, before subsequently combining and aggregating them through the defuzzication process.
6.1. Problem formulation
We can formulate the duration modelling problem as a classication/regression problem. This is because a
number of independent variables (i.e. the factors that aect duration) are used to compute the duration of a
syllable.
Given a set of training samples composed of observed input/output pairs that consists of N labelled
examples, {(xn,yn);n = 1, 2, . . . , N}, our aim is to derive a general model which can be used to compute
output values for any new set of inputs. The new inputs may be in the training set or test set. In the
context of duration modelling, the input variables (i.e. the attributes) are the relevant parameters describing the factors aecting the duration of the unit of utterance in focus, e.g. syllable or phone, and the output would be a numerical value specifying the actual duration or modication to each speech unit in an
utterance (i.e. g).
To formulate this problem as a fuzzy classication/regression problem, let U = {uj}, j = 1, . . . ,n represent
the universe of objects that describe the factors aecting the duration of a syllable. Each of these n objects is
described by a collection of attributes A = {A1, A2, . . . ,Ar}. Each attribute Ak measures some important features of an object and can be limited to a set of m linguistic terms T Ak T k1 ; T k2 ; . . . ; T km . T(A) is the domain
of the attribute of Ak. Each numerical attribute Ak can be dened as a linguistic variable which takes linguistic
values from T(Ak). Each linguistic value T kj is also a fuzzy set dened over the range of the numerical values of
the variable, i.e., its Universe of Discourse (UoD). The membership function lT kj indicates the degree to which
object us attribute Ak belong to T kj . The membership of a linguistic value can be subjectively assigned or
inferred by a membership function dened over its UoD.
6.2. Fuzzy Decision Tree (FDT) design
The potential of fuzzy decision trees in improving the robustness and generalisation in classication is due
to the use of fuzzy reasoning. Underlying fuzzy reasoning is the concept of a fuzzy set. A fuzzy set is represented by a membership function which maps numerical data onto the closed interval [0, 1]. While in classical
logic, the result of the operations of conjunction and implication are unique, in fuzzy logic there is an innite
number of possibilities. When a crisp number from the universe of the class variable is sought and the number
of other restrictions on fuzzy sets and operators are applied, rules can be evaluated individually and then combined. This approach, called local inference, gives a compositionally simple alternative, with good approximation characteristics, even when all the necessary conditions are not satised.
A fuzzy decision tree gives results within the closed interval [0,1], as the possibility degree of an object
matching a class. Fuzzy decision trees therefore provide a more robust way to avoid misclassication. Each
path of a fuzzy decision tree, from the root to a leaf, forms a decision rule, which can be represented in
the form: IF(x1 IS A1) AND (x2 IS A2) . . . AND (xn IS An) THEN (class = Cj). In the case of our model, each
xi represents a factor that aects duration and the Ai are the constraints dened over the universe of discourse
of the factor. The Ci is the duration scaling class (i.e. Increase or Decrease).
Our aim is to exploit the Fuzzy ID3 algorithm developed by Janikow (1998) for addressing the problem of
duration modelling in the context of SY prosody modelling. The Fuzzy ID3 which we have adopted (Janikow,
1998) diers from the traditional ID3 algorithms (e.g. Quinlan, 1986) in that the algorithm does not create a
leaf node only if all data belong to the same class, but it also does so in the following cases: (i) if the proportion
of a data set of a class Ck is greater than or equal to a threshold, (ii) if the number of elements in a data set is
less than a threshold, or (iii) if there are no more attributes for classication. More than one class name may be
assigned to one leaf node. In addition to these, the fuzzy set of all attributes are dened depending on the pattern of the data. Each attribute is processed as a linguistic variable using fuzzy restrictions such as X1 IS Low,
X1 IS Medium, etc. Our FDT duration model implementation follows the steps in the literature (e.g. Yuan and
Shaw, 1995; Olaru and Wehenkel, 2003):

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

(1)
(2)
(3)
(4)

335

Fuzzify the training data.

Build a set of fuzzy decision trees.
Obtain an optimal tree using pruning techniques.
Apply the FDT for predicting duration.

In the following subsections, we describe how these steps are applied to the design of our FDT based duration model.
6.2.1. Fuzzication of the input space
The FDT is an approximation structure that computes the degree of membership of the duration aecting
factors to a particular syllable duration scaling class (i.e. Increase or Decrease). There are two types of data in
our duration model: categorical and numerical. As shown in Table 5, four of the seven input variables in our
duration model are numerical and are treated as continuous variables. The numerical data must be fuzzied
into linguistic terms through the fuzzication process. The fuzzy membership functions used to fuzzify the
numerical data are derived as follows.
We assume that these variables are factorable such that fuzzy subsets can be dened over their Universe of
Discourse (UoD). Since all the factors are normalised, their UoD is dened over the closed interval [0,1]. We
rst partition the UoD for each of the numerical variables into subranges, with each subrange labelled with a
linguistic term. For simplicity, we restrict the number of linguistic terms to 3 for continuous input variables
and to 2 for the output variables (i.e. Increase and Decrease). We used the trapezoidal function to model
our membership functions because it is simple and there are algorithms for deriving and implementing them
(Kosko, 1994). In addition, the trapezoidal membership function is frequently used in fuzzy theory to model
relatively stable data such as syllable duration. The algorithm for generating the membership functions in our
model is described as follows.
Assume that a factor that aects duration, A, has numerical value x. The numerical value of attribute A for
all linguistic terms u 2 U can be represented by l = X = {x(u), u 2 U}. We dened the trapezoidal function
for each variable as a four-tuple (Mitaim and Kosko, 2001; Kosko, 1994) (lj,mlj,mrj,rj) where mlj 6 mrj 2 R.
The variables lj > 0 and rj > 0 denote the distance of the support of a function to the left and right of mlj and
mrj, the centre of which is mj = 1/2(mlj + mrj). The degree to which a crisp value x belongs to the fuzzy set uj,
i.e., luj x 2 0; 1, is computed using the membership function:
8
ml x
>
1:0 ljj
>
>
>
<
1:0
luj x
xmrj
>
>
> 1:0 rj
>
:
0:0

if mlj lj 6 x 6 mlj
if mlj 6 x 6 mrj
if mrj < x 6 mrj rj

otherwise

The graphical representation of Eq. (3) is shown in Fig. 3. The membership functions of each of the four
input variables are shown in Fig. 4. Fig. 5 depicts the membership function of the output variable described
in Table 6.

Table 5
Factors aecting syllable duration
No.

Aecting factor

Type

Values/fuzzy terms

1.
2.
3.
4.
5.
6.
7.

Length of word in which the syllable occurs

Position of the syllable in the word
Position of the word in the sentence
Value of f0 peak of syllable
Structure of preceding syllable
Structure of target syllable
Structure of following syllable

Numerical
Numerical
Numerical
Numerical
Categorical
Categorical
Categorical

Short, Medium, Long

Initial, Medial, Final
Initial, Medial, Final
Low, Mid, High
CV, V, CVn, Vn, N
Blank/pause, CV, V, CVn, Vn, N
Blank/pause, CV, V, CVn, Vn, N

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

336

Fig. 3. Graphical representation of membership function.

(a)

(b)

(c)

(d)

Fig. 4. Membership function of continuous duration aecting factors. (a) Membership function for word length. (b) Membership function
for position of syllable in word. (c) Membership function for position of word in sentence. (d) Membership function for peak f0 values of
tone.

6.2.2. Building the FDT

We used the FID3.3 software developed by Janikow (2004) for building the FDT. The variables and parameters required for implementing our FDT-based duration model are dened in Table 7. From our training data
set of 60 SY statement sentences (cf. Section 3.2), we generated a set of 250 data items. Each data item corresponds to a syllable in a sentence and it comprises the values of the seven factors listed in Table 5. The data
set is split into two disjoint parts: 220 data items were used to build our FDT model and the remaining 30 were
used for cross validation. Out of the 220 data items, we rst built our FDT using 200 items. To obtain an
optimum tree, the resulting tree was then pruned using the remaining 20 data items. The pruning process is
described in Section 6.2.3. The algorithm depicted in Fig. 6 implements the tree building process.

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

337

Fig. 5. Membership function for the output.

Table 6
Syllable duration predicted
No.

Predicted output

Fuzzy restrictions

Degree to which the syllable is stretched or compressed

Increase, Decrease

Table 7
A summary of our FDT variables, functions and parameters
Variable/function/parameter

Description

Vi
V ip
l()

A variable to represent one, i.e. the ith, of the duration aecting factors
Word Lengthi
A fuzzy term p dened for variable Vi e:g: V Short

The membership function lvi x for variable Vi dened over the crisp input u. It determines how the
Length
x determines the degree
crisp value for variable Vi satises the restriction [Vi is vij ]. E.g. lWord
Long
to which the value x satises the fuzzy restriction [Word_Length IS Long]. The derivation of the
membership functions is explained in Section 6.2.1
An aggregation function that combines the level of satisfaction of the fuzzy restrictions of the
conjunctive antecedent
A function that propagates the satisfaction of the antecedent to the consequence
The membership of examples ej in the node N. It is computed incrementally using f0 and f1.
{XN} is the set of memberships in node N for all training examples
Fuzzy set for the input variable Vi. E.g. Di = {Short, Medium, Long} for Vi = Word_Length
Cardinality of fuzzy set Di i.e. the number of linguistic terms dened over Vi. For all of our input
variables, jDij = 3
Example count for decision V ck 2 Dc in node N
The total count of examples in node N with unknown values for Vi
The total count of examples in node N with V i V ip
The information contents in node N with V i V ip
The set of attribute appearing on the path leading to node N
N
Information gain computed as I N I S V i

f1()
f2()
X Nj
XN
Di
jDij
P NK
i
P Nju unknown
NjV ip
P
i
I N jV p
N
V
GNi

A number of trees were generated by varying the parameters for running the FID3.3 program. The fuzzy
decision tree shown in Fig. 7 illustrates the structure of such trees. Each nonterminal node of the FDT contains: (i) the attribute used to split the node (i.e. Attr) and (ii) the total example count for each decision (i.e.
increase and decrease) in the node. The two values in the terminal nodes indicate the example counts for each
of the two possible decisions.
The example count N Nj is computed as the membership of example ej in N. It implies the membership in the
multidimensional fuzzy set dened by the fuzzy restrictions found in FN. It is computed incrementally using
the functions l() (cf. Eq. (3)) and f1() as explained in Table 7.

338

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

Fig. 6. FDT building algorithm.

The information gain is used to determine the candidate input factors that will be used to partition the data
set. To determine the factor that would create an optimal partition of the data, we compute the weighted information content for each factor aecting duration. To compute the information gain, we rst compute the standard information content of a factor, IN (cf. Table
7). The weighted information content of the factor over the
SN
speech Ndata, adjusting for missing values, i.e. I V i , is also computed. The dierence of these two values (i.e.
S
I N I V i ) is the information gain for the factor under consideration. The data is partitioned based on the factor that has the highest information gain. This partitioning process is repeated until the remaining data items
do not yield a unique classication.
As shown in Fig. 7, the position of word in sentence, W PoS
tag , produced the highest information gain over the
entire data set and it is at the root of the FDT. The path along the Final linguistic term dened over W PoS
tag leads
directly to a terminal node whose example count for decreasing the syllable duration (i.e. Dec = 0.22) is far less
than that for increasing the syllable duration (i.e. Inc = 106.21). This shows that the duration of a syllable in
the word at the nal position of a sentence has a very high degree of increase. This tree pattern conrms the
well-known nal lengthening phenomenon in SY (Connell and Ladd, 1990). The high degree of increase that
the nal syllable undergoes caused the partition to end for syllables at the nal position as predicted by the
FDT. The exact amount of the increase that the syllable duration will undergo is computed by the defuzzication process explained in Section 6.2.4. Note that the tree in Fig. 7 is built using only those duration aecting
factors with numerical values.

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

339

Fig. 7. FDT for numerical values duration factor.

6.2.3. Fuzzy decision tree pruning

Since the FDT3.3 program does not incorporate a pruning
vary in size and structure. This inuences the performance of
extracted from it. There is the need to prune the decision tree
mance. In order to evaluate the eciency of the decision trees,
et al. (2002). The criteria underlying the T-measure are:

algorithm, the decision trees generated above

both the tree and the fuzzy rules that will be
generated in order to achieve optimal perforwe applied the T-measure developed by Mitra

(1) The shallower the depth of the tree, the better it is since it will take less time to reach a decision.
(2) The presence of an unresolved terminal node is undesirable.
(3) The distribution of labelled leaf nodes at dierent depths aects the performance of the tree. A tree
whose frequently-accessed leaf nodes are at shallower depths is more ecient in terms of time.
The T-measure for a decision tree is computed using Eq. (4).
2n
T
wi

NP
lnodes

wi d i

( N2n 1
i
for a resolved leaf node
N
2N i
N

otherwise

4
5

where n = 7 is the number of attributes of a pattern, di is the depth of a leaf node, Nlnodes is the number of
terminal (leaf/unresolved) nodes, N = 200 is the total number of patterns in the training set and Ni is the total
number of training patterns that percolate down to the ith leaf node. The value of T lies in the interval [0,1). A
value of 0 for T is undesirable and a value close to 1 signies a good decision tree. Using this measure, we
select the best decision tree among those generated by the FID3.3 software discussed above. Fig. 8 shows
the resulting tree.

340

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

Fig. 8. Fuzzy decision tree for the duration model.

6.2.4. Applying FDT to duration modelling

The solution provided by FDT is based on estimates made at all leaf nodes of the tree. The nal decision is
obtained by a collection of alternative decision paths that branch out from the root node and end at a leaf
node of the FDT (Suarez and Lutsko, 1999).
To compute the eect on syllable duration of a given factor, es, the FDT algorithm evaluates the succession
of tests from the root node, following a path that is determined by the result of those tests at each of the internal nodes. Eventually this path leads to one terminal node, say tl. The degree F Nl , which the duration sample es
belongs to the leaf node tl, is then computed. The F Nl values for all paths that start from the root to a leaf node
is computed in this manner. The nal prediction is made by combining or aggregating these values. For any
given vector of factors that aect duration, the value of the predicted duration modier is equal to the
weighted average of the F Nl values given by each of the leaves. The weight of a given leaf in the average is
the degree of membership of the example to the leaf in question. The computation of the nal duration modication factor is achieved by the defuzzication process.
FID3.3 provides a number of defuzzication schemes for achieving this goal. They include: (i) the best
majority class, (ii) centre of gravity, (iii) maximum majority class. We adopt the best majority class scheme
in our model because it produces better accuracy.
7. CART duration model
In order to compare the performance of the FDT-based duration model with the standard CART method,
we implemented a CART-based duration model using the Edinburgh University CART building software
Wagon (Black et al., 1999). The development of a duration model based on CART involves building a tree
by training it on the input (i.e. aecting factors)-output (i.e. syllable duration modier) data collected in
respect of speech duration. The tree building algorithm successively divides the feature space to minimise
the prediction error in duration values. After the tree construction phase, a relatively large tree Tmax is
obtained. Some branches of Tmax are successively pruned resulting in a sequence of trees. The best among
these trees is selected using a test sample that is independent of the training sample. That results in a tree with
optimal performance. The pruning process is done automatically.

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

341

Our CART model was built using the same set of training data and test data as in the development of the
FDT. The sets are presented in the form {(xn,yn);n = 1, 2, . . . , N}, where xn are feature vectors of the corresponding aecting vector and yn are the scaling value for syllable duration. The contents of the input le is
shown in Fig. 9.
The variables PSylType, TSylType and FSylType correspond to the structure of the preceding, target and
following syllables, respectively. The variables NumberOfSyllable, PositionOfSyllable, and PositionOfWord
are the number of syllables in the word (word length), the position of the syllable in the word, and the position
of the word in the sentence, respectively. The variable DegOfIncren is the dependent variable and it is the
degree of stretch or compression of the target syllable. F0Value is the f0 peak of the target syllable. The f0 curve
on each citation syllable was stylised using a third degree polynomial (dAlessandro and Mertens, 1995). The
peak of the stylised curve (i.e. the F0Value in our CART input description le) is taken as a numerical value to
represent the tone of the syllable. That value (when compared with discrete tone types) gives more information
about the tone on the syllable.
The tree building process starts with the tree consisting of only the root node t1 containing all cases. The
task is to nd the optimal binary split of the data. For real value features, i, all splits of the form xni < s are
tested, where s denotes a predened threshold value. For the M-value categorical feature i, the splits have the
form xi 2 h, where h goes through all subsets of the set of all possible values of features i. The best split across
all features is selected and the data in the root node is split into left and right nodes, i.e. (tL,tR). This procedure
is applied recursively to all descendants until a stopping condition is fullled.
The CART tree is built in an incremental fashion. We set aside some of the training data for cross validation. The tree building process begins with a small stop value of 8. The stop value is the minimum number of
samples required in a tree partition before a split is attempted. The stop values are varied during each iteration
of the tree building process. During each iteration, the generated tree is pruned back to where it best matches
the set aside data. We have used the stop values 8, 9, 10 and 12 and found that the stop value of 9 gave an
optimum tree.
We expect our CART-based duration model to predict the value of the scale factors for the duration of a
syllable which is then used to compute the realised duration. The syllable duration is calculated by the
equation:
Duration Durationc Durationr PrecdScale

where Durationc and Durationr are the canonical and realised duration, respectively. PrecdScale is the predicted scaling factor for compressing/stretching the syllable duration. For example, if PrecdScale is 0.25,
the syllable is reduced by 25% of its original duration. If PrecdScale equals 1.0, the syllable duration is doubled
(cf. Eq. (1)). A typical tree for the numerical factors that aect duration, generated by the CART is shown in
Fig. 10. The optimal tree generated by the CART algorithm, comprising all the duration aecting factors, is
shown in Fig. 11.

Fig. 9. CART input description le.

342

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

Fig. 10. CART Tree for numerical duration aecting factors.

Fig. 11. Optimal CART for duration model.

8. Evaluation and discussion

In terms of theoretical computational complexity, the CART model should outperform our FDT model.
That is because, in the worst case, i.e. with completely overlapping subsets, the complexity of building a balanced fuzzy decision tree will be O(kGSk2 kak) where kGSk is the number of learning instances used for
building the tree and kak is the number of candidate attributes. This evaluates to O(k300k2 k7k) =
O(6.3 105) for our FDT model. That is signicantly worse than O(kGSklogkGSk kak), i.e.
(O(k300klogk300k k7k) = O(5.2 103)), which is the complexity of building a crisp decision tree. Also,
the search for an optimal dichotomy will be signicantly more demanding in FDT than in the crisp discretisation procedure (Boyen and Wehenkel, 1999; Olaru and Wehenkel, 2003).
However, the theoretical evaluation does not necessarily correlate with practical performance. To access the
practical performance of the models, we carried out qualitative and quantitative evaluations on both duration
models. We have used both the training and test data sets discussed in Section 3 for our evaluations. For the
quantitative evaluation, we applied the Root Mean Square Error (RMSE) (Hermes, 1998; Clark and Dusterho, 1999) and the Pearsons correlation of the actual versus predicted duration for the two models. The transcription accuracy and the Mean Opinion Scores (MOS) (Donovan, 2003; Sakurai et al., 2003) were used to
evaluate the intelligibility and naturalness, respectively, in the qualitative evaluation.

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

343

8.1. Quantitative evaluation

The quantitative evaluation provides a performance index of how the model ts the data. A high correlation and low RMSE indicate a good t. The results of the quantitative evaluation of the two duration
models are shown in Fig. 12. When considering the quantitative evaluation results from individual syllable
types, the FDT-based duration model produces lower RMSE and higher correlation for the CV and N
type syllables from the training data set. For example, while the FDT model produced an RMSE of
15.11 ms and a correlation of 0.91 for the training set for CV type syllables (see Fig. 12(a) and (b)),
the CART model produced an RMSE of 17.65 ms and a correlation value of 0.87 (see Fig. 12(c) and
(d)). This pattern is repeated for the N type syllables where the FDT model (RMSE = 10.51 ms,
Corr = 0.91) is better than the CART model (RMSE = 10.99 ms, Corr = 0.87). When the overall duration
database is considered, the CART model (RMSE = 13.92 ms, Corr = 0.88) performs slightly better than
the FDT model (RMSE = 14.12 ms, Corr = 0.87) in training data but the FDT model (RMSE = 17.59 ms,
Corr = 0.79) outperforms the CART model (RMSE = 22.15 ms, Corr = 0.75) on test data. We observed
that the dierence in this quantitative performance is consistent but relatively small. Our results conrm
the observations from Riley (1992) that CART is weak in extrapolating from known to unknown contexts
accurately.
To put our evaluation results in the context of contemporary work on duration modelling for other languages, we have included the quantitative results of those models in Table 8. The results show that our
FDT and CART models compare well with other state-of-the-art models. However, it is well-known that
quantitative results need not correspond to perceptual quality of the synthesised speech. In order to establish
the practical performance of our models, we performed qualitative evaluations.

Fig. 12. Quantitative evaluation of the FDT and CART based duration models.

344

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

Table 8
Quantitative results for various duration models (based on test set results)
Language

Model type

RMSE (ms)

Corr.

American English (van Santen, 1994)

Korean (Lee and Oh, 1999)
Korean (Chung, 2002)
Czech (Batusek, 2002)
Mandarin (Chen et al., 2003)
Mandarin (Chen et al., 2003)
Mandarin (Lin et al., 2003)
Our SY FDT model
Our SY CART model

SOP
CART
CART
CART
Regression
Hybrid statistical/regression
Recurrent fuzzy neural network
FDT
CART

22.00
25.11
20.30
15.47
11.18
20.16
17.59
22.15

0.90
0.82
0.77
0.79

0.72
0.75

8.2. Qualitative evaluation

Our preliminary qualitative evaluation involves a measure of how the perceptual quality of the synthesised
speech mimics that of the natural speech in terms of the intelligibility and naturalness. The same training and
test sets used for the quantitative evaluation were also used for the qualitative evaluation.
Nineteen nave adult native SY speakers were invited to participate in the qualitative tests. To ascertain
their hearing ability, they were all subjected to an initial screening process. This process involves playing some
natural speech sound to them and asking them to write down what they heard. Those who failed to produce
100% accuracy in this test were excluded from the evaluation experiment. Other participants were removed
because their response were inconsistent. For example, some of them rated the quality of some synthetic
speech higher than the natural speech. As a result, a total of seven participants were removed and 12 participated in the nal qualitative evaluations. Each of the 12 participants took about 45 min to complete the evaluation. The intelligibility evaluation was done rst, and after a 5 min break the naturalness evaluation
followed.
In carrying out the qualitative evaluations, we used two kinds of stimuli: modied and unmodied (Wu and
Chen, 2001; Sakurai et al., 2003). The unmodied stimuli are naturally produced utterances recorded without
any modication to the acoustic data. The modied stimuli are versions of the same naturally produced utterances in which the duration data has been replaced by those generated by our FDT and CART based models.
The duration tier manipulation for the modied stimuli was achieved using the Praat speech processing software. In all, 90 stimuli were created, 30 for each of the natural (unmodied), the FDT model and the CART
model. The Pitch Synchronous Overlap (PSOLA) method was then used to synthesise the utterance (Moulines
and Charpentier, 1990).
8.3. Intelligibility evaluation
For our intelligibility test, only the speech sound synthesised using computed syllable durations were played
to each participant. After a speech sound is played, the participant is asked to repeat what they heard. The
transcription error, in terms of the number of syllables in the original sentences that are wrongly identied
by the participants are recorded. Our intelligibility evaluation is very rigorous in that, the participants were

Table 9
Results for the intelligibility evaluation
Data set

Duration models

Intelligibly score

Signicance

Training set

FDT
CART

4.50 (0.63)
3.80 (0.67)

Not Signicant (p > 0.05)

Test set

FDT
CART

4.10 (0.71)
3.60 (0.58)

Not Signicant (p > 0.05)

Standard deviations are shown in parentheses.

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

345

not only required to identify the tones on each synthesised utterance, they were also required to accurately
identify the syllables associated with each tone. We then obtained the transcription error using Eq. (7).

T All T Wrong
Intelligibility
5:0
7
T All
where TAll is the total number of syllables in a sentence and TWrong is the number of syllables that had been
wrongly identied.
The results of the intelligibility tests are shown in Table 9. For the FDT-based duration model, a transcription accuracy of 4.50 (SD 0.63) was obtained for the training set. For the test set, the transcription accuracy is
4.10 (SD 0.71). The CART-based duration model produced a transcription accuracy of 3.60 (SD 0.58) for the
test set and transcription accuracy of 3.80 (SD 0.67) for the training set.
We used the sign test (Anderson et al., 2002; Rana et al., 2005) to assess the statistical dierence in the perceived quality of the synthetic speech produced by the two models. The results show that the intelligibility
scores for these duration models are not signicantly dierent (p > 0.05). These results indicate that the listeners do not have preference for the intelligibility of the synthetic speech produced using the FDT based model
over the CART based model.
8.4. Naturalness evaluation
For the naturalness test, the participants were asked to rank the naturalness of the utterance using a scale of
15 as shown in Table 10. The results for the training set (see Table 11) show that the naturalness quality of
the unmodied speech, with an MOS score of 5.0, is higher than that of the CART (3.4) and FDT (3.7). A sign
test shows that the unmodied speech is highly preferred (p 6 0.001) by the listeners when compared with the
speech generated using the CART based duration model.
Although the FDT has a higher MOS score, the naturalness quality of speech synthesised using the FDT
based duration model is not signicantly (p > 0.05) better than that of the CART model. This indicates that,
for the training data set, there is no evidence that the listeners preferred the synthetic speech generated using
the FDT based duration model over that generated using the CART based model.
A similar pattern is repeated for the test set when the naturalness of the unmodied speech is compared
with that of the synthesised speech produced by the two duration models (see Table 12). However, the naturalness of the synthetic speech generated by the FDT model is signicantly (p 6 0.05) more preferred over that

Table 10
Qualitative evaluation scores
Value

Description

5
4
3
2
1

Perfect, indistinguishable from natural speech quality

Very good
Average
Poor
Weak or not acceptable

Table 11
Results for naturalness evaluation (Training set)
Comparison

Duration model

MOS score

Signicance

Natural Vs. CART

Natural
CART

5.0 (0.001)
3.4 (0.440)

Signicant (p 6 0.001)

Natural Vs. FDT

Natural
FDT

5.0 (0.001)
3.7 (0.130)

Signicant (p 6 0.001)

FDT Vs. CART

FDT
CART

3.7 (0.130)
3.4 (0.440)

Not signicant (p > 0.05)

346

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

Table 12
Results for naturalness evaluation (Test set)
Comparison

Duration model

MOS score

Signicance

Natural Vs. CART

Natural
CART

4.9 (0.01)
3.1 (0.54)

Signicant (p 6 0.001)

Natural Vs. FDT

Natural
FDT

4.8 (0.02)
3.4 (0.39)

Signicant (p 6 0.001)

FDT Vs. CART

FDT
CART

3.4 (0.39)
3.1 (0.54)

Signicant (p < 0.05)

of the CART model. We can therefore conclude that, for the test data set, the synthetic speech generated using
the FDT based duration model is relatively more natural.
8.5. Discussion
The training and test data sets contain similar combinations of factors that aect duration (cf. Section 3).
Our evaluation results show that, when compared with our FDT model, the CART model performs better for
the training data set in the quantitative evaluation since it has a higher correlation and lower RMSE. That
implies that CART models the duration data in the training data set more accurately. However, its lower accuracy for modelling the test data set suggests that FDT is more capable of extrapolating from the training data
to a new or unknown data set. On the other hand, our qualitative evaluations show that the FDT model performed better than the CART model in the training data set (although not signicantly) and performed marginally better than the CART on test data set (cf. Tables 11 and 12).
Our results are in line with the ndings that the objective performance statistics (i.e. RMSE and correlation) do not always predict the subjective perception judgements correctly (Brinckmann and Trouvain,
2003). The result of our analysis can be interpreted from two perspectives. First, it is well known that CART
is very good at modelling training data accurately, but poor at extrapolating to unknown data (Riley, 1992).
The results of our quantitative evaluations conrm this fact.
Second, based on our results, we can speculate that FDT is able to capture some salient aspects of the duration data that have greater perceptual signicance. That speculation is based on the fact that our FDT model
exploits linguistically meaningful terms in the partitioning of the aecting factors for the duration variables.
These linguistic terms are determined based on a subjective treatment of how the duration variables inuence
the perception of the speech sound. This leads to an inclusion of factors according to their degree of relevance
on the overall duration pattern predicated by the FDT based model. Hence, all of the potential factors that
aect duration were taken into account with dierent weights in the interval [0.0, 1.0]. On the other hand,
CART performs a binary partitioning of the variables that represent the factors aecting duration. That
results in an all-or-nothing situation whereby a factor is either included or rejected.
Furthermore, our corpus only contains short to medium length SY sentences (i.e. sentences with 624 syllables). Our results indicate that FDT performs slightly better than CART in modelling short to medium
length SY sentences. Nonetheless, the present investigation is biased towards that kind of sentences. Based
on a comparison with the KLATT duration model, the CART model has a tendency to perform better when
modelling duration for long sentences (i.e. >30 syllables) (Brinckmann and Trouvain, 2003), our evaluation
results may be dierent when long sentences are considered.
9. Conclusion
We have presented duration models based on a Fuzzy Decision Tree (FDT) and a Classication And
Regression Tree (CART) in the context of prosody modelling for SY text-to-speech synthesis. Since the duration modelling is syllable-based, we rst carried out a set of exploratory analytical experiments to determine
the most important factors aecting the duration of SY syllables. The results of that experiment led to the
selection of seven duration factors which were then used to produce the duration models. The duration of

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

347

citation versus contextual syllables are used to predict a scale factor by which the duration of a citation
syllable will be multiplied to reect the perceptual quality of its contextual equivalent.
Results of our qualitative and quantitative evaluations show that CART models the training data more
accurately than FDT. The FDT model, however, is better in extrapolating from the training data since it produced a better accuracy for the test data set. Synthesised speech produced by the FDT duration model was
also ranked better on quality than the CART model. These results conrmed the well-known fact that CART
possesses very good interpolation but poor extrapolation capabilities (Breiman et al., 1984; Barbosa and
Bailly, 1994). The good extrapolation capability of FDT makes it an ideal model for implementing duration
for TTS application due to the sparseness of duration data.
We also observed that the expressiveness of FDT is better than that of CART. This is because the representation in FDT is not restricted to a set of piece-wise or discrete constant approximation. In addition, fuzzication of the input data imposes a continuity constraint at the boundaries of node splits in the FDT. This
acts as a mechanism to limit the degree of overtting of the FDT. Furthermore, fuzzication and global optimisation provide a continuous representation with the exibility necessary to reproduce duration pattern at a
ner granularity.
According to our qualitative and quantitative evaluations, CART produces better objective results than
FDT, but FDT produces non-signicantly better subjective results. This shows that neither model is precise
enough to distinguish each other. One may therefore speculate that the presented results perhaps imply that
both modelling techniques are not well correlated with subjective scores. However, further work is required to
conrm this speculation.
We can conclude, therefore, that the resulting fuzzy decision trees exhibit high comprehensibility, and that
fuzzy set and approximate reasoning methods provide a natural means to deal with continuous domains, subjective linguistic terms as well as noisy data which are the characteristics of the duration dimension of speech
signal. When compared with the CART-based approach, our FDT-based duration model captures some salient aspect of the speech signal that has more perceptual signicance. In this regard, the FDT model is more
appropriate for modelling duration in the context of TTS application for the Standard Yoru`ba language.
Arguably, the data set used in the present study is relatively small and our corpus contains relatively short
statement sentences. However, our data contain all important duration aecting factors. Furthermore, this is a
preliminary investigation. In order to carry out more extensive analysis and develop a more robust model, we
plan to extend the scope of the sentences to include other domains and sentence modes in future.
References
Akinlab, A., 1993. Underspecication and phonology of Yoru`ba /r/. Linguistic Inquiry 24 (1), 139160.
Allen, J.B., 1994. How do humans process and recognize speech?. IEEE Transactions on Speech & Audio Processing 2 (4) 567577.
Anderson, D.R., Sweeney, D.J., Williams, T.A. 2002. Statistics for bussiness and economics, 8th ed., South-western, United Kingdom.
Bamgbos: e, A., 1990. Fon
o: l
o: j` a`ti Grama` Yoru`ba. University Press PLC, I`ba`da`n.
Barbosa, P.A., Bailly, G., 1994. Characterisation of rhythmic patterns for text-to-speech synthesis. Speech Communication 15, 127137.
Batusek, R. 2002. A duration model for Czech text-to-speech synthesis. In: Proceedings of the rst International Conference on Speech
Prosody, Aix-en-Provence, pp. 167170. Available: from http://www.ipl.univ-aix.fr/sp2002/pdf/bastusek.pdf. Visited: Sep 2004.
Bellegarda, J.R., Silverman, K.E.A., Lenzo, K., Anderson, V., 2001. Statistical prosody modelling: from corpus design to parameter
estimation. IEEE Transactions on Speech & Audio Processing 9 (1), 5266.
Black, A., Clark, R., King, S., Heiga, Z., Taylor, P., Caley, R. 1999, The Festival speech synthesis system: system documentation, version
1.4.0. Available from: http://www.cstr.ed.ac.uk/projects/festival/manual/festival-25.html#SEC112. Visited: Apr 2004.
Boersma, P., Weenink, D., 2004. Praat, doing phonetics by computer. Available from: http://www.fon.hum.uva.nl/praat/. Visited: Mar
2004.
Boyen, X., Wehenkel, L., 1999. Automatic induction of fuzzy decision trees and its application to power systems security assessment.
Fuzzy Sets & Systems 102, 319.
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classication and Regression Tree. Wadworth, CA, USA.
Brinckmann, C., Trouvain, J., 2003. The role of duration models and symbolic representation for timing in synthetic speech. International
Journal of Speech Technology 6, 2131.
Campbell, N., 2000. Timing in speech: a multilevel process. In: Prosody: Theory and Experiment. Kluwer, Dordrecht, pp. 281334.
Campbell, N., Isard, S.D., 1991. Segmental durations in a syllable frame. Journal of Phonetics 19 (1), 3747.
Carvalho, D.R., Freitas, A.A., 2002. A genetic-algorithm for discovering small-disjunct rules in data mining. Applied Soft Computing 2,
7588.

348

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

Chen, S.-H., Hwang, S.-H., Wang, Y.-R., 1998. An RNN-based prosodic information synthesiser for Mandarin text-to-speech. IEEE
Transactions on Speech & Audio Processing 6 (3), 226239.
Chen, S.-H., Lai, W.H., Wang, Y.-R., 2003. A new duration modelling approach for Mandarin speech. IEEE Transactions on Speech &
Audio Processing 11 (4), 308320.
Chung, H. 2002. Duration models and the perceptual evaluation of spoken Korean. In: International Conference on Speech Prosody, Aixen-Provence, France, pp. 219222.
Chung, H., Huckvale, M. 2001. Linguistic factors aecting timing in Korean with application to speech synthesis. In: Proceedings of
EuroSpeech01, Aalborg, Denmark, pp. 815818.
Clark, R.A.J., Dusterho, K.E. 1999. Objective methods for evaluating synthetic intonation. In: Proceedings of the Sixth European
Conference on Speech Communication Technology, vol. 4, Budapest, pp. 16231626.
Connell, B., Ladd, D.R., 1990. Aspect of pitch realisation in Yoru`ba. Phonology 7, 129.
Crozier, D.H., Blench, R.M., 1976. An Index of Nigerian Languages, second ed. Summer Institute of Linguistics, Dallas.
dAlessandro, C., Mertens, P., 1995. Automatic pitch contour stylization using a model of tonal perception. Computer Speech & Language
9, 257288.
Dong, M., Kothari, R., 2001. Look-ahead based fuzzy decision tree induction. IEEE Transactions on Fuzzy Systems 9 (3), 461468.
Donovan, R.E. 1996. Trainable Speech Synthesis, PhD thesis, Cambridge University Engineering Department, Cambridge.
Donovan, R.E., 2003. Topics in decision tree based speech synthesis. Computer Speech & Language 17, 4367.
Ehrich, R.W., Foith, J.P., 1976. Representation of random waveforms by relational trees. IEEE Transactions on Computers C-25 (7),
725736.
Fletcher, J., McVeigh, A., 1993. Segment and syllable duration in Australian English. Speech Communication 13, 355365.
Goubanova, O., Taylor, P. 2000. Using Bayesian belief networks for model duration in text-to-speech systems. In: Proceedings of
ICSLP2000.
Hermes, D.J., 1998. Measuring the perceptual similarity of pitch contour. Journal of Speech Language and Hearing Research 41, 7382.
Hohne, H.D., Coker, C., Levinson, S.E., Rabiner, L.R., 1983. On the temporal alignment of sentence of natural and synthetic speech.
IEEE Transactions on Speech & Audio Processing ASPP-31 (4), 807813.
Huang, H.-P., Liang, C.-C., 2002. Strategy-based decision making of a soccer robot system using a real-time self-organising fuzzy decision
tree. Fuzzy Sets & Systems 127, 4964.
Huckvale, M. 2002. Speech synthesis, speech simulation and speech science. In: Proceedings of the International Conference on Speech
and Language Processing, Denver, pp. 12611264.
Janikow, C.Z., 1998. Fuzzy decision trees: issues and methods. IEEE Transactions on Systems, Man, & Cybernetics 28 (1), 114.
Janikow, C.Z. (2004), FID33 fuzzy decision tree. Available from: http://www.cs.umsl.edu/~janikow/d/d32/overview.htm. Visited: Jan
2005.
Keller, E., Zellner, B. 1995. A statistical timing model for French, In: XIIIe`me Cong. Int. Des. Sci. Phon., vol. 3, Stockholm, pp. 302305.
Klatt, D.H., 1987. Review of text-to-speech conversion for English. Journal of the Acoustical Society of America 82 (3), 737793.
Kosko, B., 1994. Fuzzy systems as universal approximators. IEEE Transactions on Computers 43 (11), 13291333.
Ladd, D.R., 2000. Tones and turning points: Bruce, Pierrehumbert, and the elements of intonation phonology. In: Horne, M. (Ed.),
Prosody: Theory and Experiment Studies presented to Gosta Bruce. Kluwer, Dordrecht, pp. 3750.
Lee, S., Oh, Y.-H., 1999. Tree-based modelling of prosodic phrasing and segmental duration for Korean TTS systems. Speech
Communication 28 (4), 283300.
Levinson, S.E. 1986. Continuously variable duration Hidden Markov Models for speech analysis. In: Proceedings of IEEE ICASSP, pp.
12411244.
Lin, C.-H., Wu, R.-C., Chang, J.-Y., Liang, S.-F., 2003. A novel prosodic-information synthesizer based on recurrent fuzzy neural
networks for Chinese TTS system. IEEE Transactions on Systems, Man, & Cybernetics B, 116.
Minematsu, N., Kita, R., Hirose, K., 2003. Automatic estimation of accentual attribute values of words for accent sandhi rules of
Japanese text-to-speech conversion. IEICE Transactions on Information & Systems E86-D (3), 550557.
Mitaim, S., Kosko, B., 2001. The shape of fuzzy sets in adaptive function approximation. IEEE Transactions on Fuzzy Systems 9 (4), 637
656.
Mitra, S., Konwar, K.M., Pal, S.K., 2002. Fuzzy decision tree, linguistic rules and fuzzy knowledge-based network: generation and
evaluation. IEEE Transactions on Systems, Man, & Cybernetics 32 (4), 328339.
Mobius, B., 2003. Rare events and closed domains: Two delicate concepts in speech synthesis. International Journal of Speech Technology
6, 5771.
Moulines, E., Charpentier, F., 1990. Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones.
Speech Communication 9 (5-6), 453467.
` gunb: wale, P.O., 1966. A
sa I`b Yoru`ba. University Press Limited, Jericho, I`ba`da`n, Nigeria.
O
Olaru, C., Wehenkel, L., 2003. A complete fuzzy decision tree technique. Fuzzy Sets & Systems 138, 221254.
Owolab, K., 1998. I`j`nle: I`tupale: e`de` Yoru`ba: Fo`ne: t`k`a`ti fon
o: l
o: j`, rst ed. Onibonoje Press & Book Industries (Nig.) Ltd., I`ba`da`n.
O: de: jo: b, O.A., Beaumont, A.J., Wong, S.H.S., 2004a. A computational model of intonation for Yoru`ba text-to-speech synthesis: design
and analysis. In: Sojka, P., Kopecek, I., Pala, K. (Eds.), Lecture Notes in Articial Intelligence, Lecture Notes in Computer Science
(LNAI 3206). SpringerVerlag, Berlin, pp. 409416.
O: de: jo: b, O.A., Beaumont, A.J., Wong, S.H.S., 2004b. Experiments on stylisation of standard Yoru`ba language tones, Technical Report
KEG/2004/003. Aston University, Birmingham.

O: .A. O: de: jo: b et al. / Computer Speech and Language 21 (2007) 325349

349

Pedrycz, W., Sosnowski, Z.A., 2001. The design of decision trees in the framework of granular data and their application to software
quality models. Fuzzy Sets & Systems 123, 271290.
Quinlan, J.R., 1986. Induction on decision trees. Machine Learning 1, 81106.
Rana, D.S., Hurst, G., Shepstone, L., Pilling, J., Cockburn, J., Crawford, M., 2005. Voice recognition for radiology reporting: is it good
enough? Clinical Radiology 60, 12051212.
Riley, M.D., 1992. Tree-based modelling of segmental durations. In: Bailly, G., Benoit, C., Sawallis, T.R. (Eds.), Talking Machines:
Theories, Models and Designs. Elsevier, Amsterdam, pp. 265273.
Sakurai, A., Hirose, K., Minematsu, N., 2003. Data-driven generation of f0 contours using a superpositional model. Speech
Communication 40 (4), 535549.
Shen, X.S., Lin, M., Yan, J., 1993. f0 turning point as an f0 cue to tonal contrast: a case study of Mandarin tones 2 and 3. Journal of the
Acoustical Society of America 93 (4), 22412243.
Sjolander, K., Beskow, J. 2004. Wavesurfer 1.7. Available from: http://www.speech.kth.se/wavesurfer/. Visited: Jun 2004.
Suarez, A., Lutsko, J.F., 1999. Globally optimal fuzzy decision trees for classication and regression. IEEE Transactions on Pattern
Analysis & Machine Intelligence 21 (12), 12971311.
Taylor, C. 2000. Typesetting African languages. Available from: http://www.ideography.co.uk/library/afrolin gua.html. Visited: Apr
2004.
Vainio, M. 2001. Articial neural network based prosody models for Finnish text-to-speech synthesis, PhD thesis, Department of
Phonetics, University of Helsinki, Helsinki.
van Santen, J.P.H., 1992. Contextual eects on vowel duration. Speech Communication 11 (6), 513546.
van Santen, J.P.H., 1994. Assignment of segmental duration in text-to-speech synthesis. Computer Speech & Language 8, 95128.
Wu, C.-H., Chen, J.-H., 2001. Automatic generation of synthesis units and prosody information for Chinese concatenative synthesis.
Speech Communication 35, 219237.
Yuan, Y., Shaw, M.J., 1995. Induction of fuzzy decision trees. Fuzzy Sets & Systems 96, 125139.

October 2022 Ms
100% (6)
October 2022 Ms
10 pages
Teixeira 2004
No ratings yet
Teixeira 2004
244 pages
Phonetic Dissertation
100% (2)
Phonetic Dissertation
4 pages
Towards A Model of The Mapping
No ratings yet
Towards A Model of The Mapping
147 pages
Suoni
No ratings yet
Suoni
38 pages
Tesis Raul Montano
No ratings yet
Tesis Raul Montano
149 pages
Thesis
No ratings yet
Thesis
37 pages
Yeh 2012 Speech-Communication
No ratings yet
Yeh 2012 Speech-Communication
12 pages
Pmtools
No ratings yet
Pmtools
35 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
F5-TTS A Fairytaler That Fakes Fluent and Faithful Speech With Flow Matching 2410.06885v1
No ratings yet
F5-TTS A Fairytaler That Fakes Fluent and Faithful Speech With Flow Matching 2410.06885v1
18 pages
NaturalSpeech 3: Zero-Shot Speech Synthesis With Factorized Codec and Diffusion Models
No ratings yet
NaturalSpeech 3: Zero-Shot Speech Synthesis With Factorized Codec and Diffusion Models
22 pages
F5-TTS: A Fairytaler That Fakes Fluent and Faithful Speech With Flow Matching
No ratings yet
F5-TTS: A Fairytaler That Fakes Fluent and Faithful Speech With Flow Matching
18 pages
Alemayehu Yilma
No ratings yet
Alemayehu Yilma
67 pages
Doctoral Thesis: NAIST-IS-DT0161027
No ratings yet
Doctoral Thesis: NAIST-IS-DT0161027
133 pages
Deep Learning Based NLP Techniques
No ratings yet
Deep Learning Based NLP Techniques
7 pages
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
No ratings yet
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
12 pages
DiTTo TTS
No ratings yet
DiTTo TTS
34 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Arik 17 A
No ratings yet
Arik 17 A
10 pages
Modeling Improved Syllabification Algorithm For Amharic: October 2012
No ratings yet
Modeling Improved Syllabification Algorithm For Amharic: October 2012
7 pages
Tones Do Not Disappear in Singing The Duration of Mandarin Tones in The Music Context (Qianyutong Zhang)
No ratings yet
Tones Do Not Disappear in Singing The Duration of Mandarin Tones in The Music Context (Qianyutong Zhang)
5 pages
D P O N V T - S: ATA Rocessing For Ptimizing Aturalness of Ietnamese EXT TO Speech Ystem
No ratings yet
D P O N V T - S: ATA Rocessing For Ptimizing Aturalness of Ietnamese EXT TO Speech Ystem
8 pages
Lecture 10 - Text To Speech
No ratings yet
Lecture 10 - Text To Speech
76 pages
Bloodinthe Water
100% (2)
Bloodinthe Water
49 pages
A Dynamical System Model For Generating Fundamental Frequency For Speech Synthesis
No ratings yet
A Dynamical System Model For Generating Fundamental Frequency For Speech Synthesis
15 pages
VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis
No ratings yet
VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis
5 pages
styleTTS2205 15439
No ratings yet
styleTTS2205 15439
20 pages
2000decision Tree Based Text-To-Phoneme
No ratings yet
2000decision Tree Based Text-To-Phoneme
4 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
Neural Speech Synthesis
No ratings yet
Neural Speech Synthesis
63 pages
Prosody Phonology and Phonetics
100% (1)
Prosody Phonology and Phonetics
212 pages
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
No ratings yet
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
5 pages
Development of TTS For Palilanguage
No ratings yet
Development of TTS For Palilanguage
9 pages
Duration and Speed of Speech Events: A Selection of Methods: Dafydd Gibbon, Katarzyna Klessa & Jolanta Bachan
No ratings yet
Duration and Speed of Speech Events: A Selection of Methods: Dafydd Gibbon, Katarzyna Klessa & Jolanta Bachan
25 pages
ISCA Archive: Duration Modeling of Indian Languages Hindi and Telugu
No ratings yet
ISCA Archive: Duration Modeling of Indian Languages Hindi and Telugu
6 pages
WISP07 121 SineSegment
No ratings yet
WISP07 121 SineSegment
5 pages
Ee 2018
No ratings yet
Ee 2018
4 pages
Methods in Empirical Prosody Research
No ratings yet
Methods in Empirical Prosody Research
405 pages
01 2014
31% (13)
01 2014
142 pages
DONG 2010 Acta Automatica Sinica
No ratings yet
DONG 2010 Acta Automatica Sinica
6 pages
List of Past Participle
100% (2)
List of Past Participle
3 pages
05 2014
24% (17)
05 2014
150 pages
Cantonese Language Links - Resources and Lessons
100% (1)
Cantonese Language Links - Resources and Lessons
5 pages
Unit 4 Ppttsa
No ratings yet
Unit 4 Ppttsa
19 pages
Abhijit Pradhan, Aswin Shanmugam S, Anusha Prakash, Kamakoti Veezhinathan, Hema Murthy
No ratings yet
Abhijit Pradhan, Aswin Shanmugam S, Anusha Prakash, Kamakoti Veezhinathan, Hema Murthy
5 pages
Speech To Text Conversion System For Myanmar Alphabet
No ratings yet
Speech To Text Conversion System For Myanmar Alphabet
2 pages
06 2014
45% (11)
06 2014
148 pages
Neurocomputing: Mario Malcangi, David Frontini
No ratings yet
Neurocomputing: Mario Malcangi, David Frontini
10 pages
Imp Tts
No ratings yet
Imp Tts
4 pages
Redaction HTK Amazigh Speech
No ratings yet
Redaction HTK Amazigh Speech
15 pages
Text-to-Speech (TTS) System
No ratings yet
Text-to-Speech (TTS) System
11 pages
Evaluation of Kannada Text-To-Speech (KTTS) System
No ratings yet
Evaluation of Kannada Text-To-Speech (KTTS) System
5 pages
Speechsynthesis
No ratings yet
Speechsynthesis
6 pages
Text To Speech: A Simple Tutorial: D.Sasirekha, E.Chandra
No ratings yet
Text To Speech: A Simple Tutorial: D.Sasirekha, E.Chandra
4 pages
Corpus-Based Stochastic Finite-State Predictive Text Entry For Reduced Keyboards: Application To Catalan
No ratings yet
Corpus-Based Stochastic Finite-State Predictive Text Entry For Reduced Keyboards: Application To Catalan
6 pages
02 2014
40% (60)
02 2014
146 pages
Chapter-3: Theory of TTS
No ratings yet
Chapter-3: Theory of TTS
26 pages
General English Course
No ratings yet
General English Course
161 pages
CCS369 - TSS-Unit 4
No ratings yet
CCS369 - TSS-Unit 4
30 pages
An Automatic Speech Recognition System Using Neural Networks and Linear Dynamic Models To Recover and Model Articulatory Traces
No ratings yet
An Automatic Speech Recognition System Using Neural Networks and Linear Dynamic Models To Recover and Model Articulatory Traces
4 pages
Bhaashika: Telugu Tts System: Dr. K.V.N.Sunitha
No ratings yet
Bhaashika: Telugu Tts System: Dr. K.V.N.Sunitha
9 pages
The Gift of Knowledge / Ttnúwit Átawish Nch'inch'imamí: Reflections On Sahaptin Ways
100% (1)
The Gift of Knowledge / Ttnúwit Átawish Nch'inch'imamí: Reflections On Sahaptin Ways
28 pages
Posterior Based
No ratings yet
Posterior Based
11 pages
Amazon Interview Questions Resources
No ratings yet
Amazon Interview Questions Resources
7 pages
Implementation of Speech Synthesis System Using Neural Networks
No ratings yet
Implementation of Speech Synthesis System Using Neural Networks
4 pages
TTS Notes
No ratings yet
TTS Notes
3 pages
Text To Speech System For Punjabi Using Festival Framework
No ratings yet
Text To Speech System For Punjabi Using Festival Framework
5 pages
Gear Up: Division of Gen. Trias City
100% (1)
Gear Up: Division of Gen. Trias City
4 pages
Sier Tse Ma 1959
No ratings yet
Sier Tse Ma 1959
18 pages
Pulley Blank 1988
No ratings yet
Pulley Blank 1988
26 pages
Text-To-Speech Synthesis Using Concatena
No ratings yet
Text-To-Speech Synthesis Using Concatena
4 pages
University of Lagos
No ratings yet
University of Lagos
25 pages
Ekundayo 1983
No ratings yet
Ekundayo 1983
19 pages
San Tow 1978
No ratings yet
San Tow 1978
25 pages
Relative Clause English Exam Center
No ratings yet
Relative Clause English Exam Center
19 pages
Beliefs About Prematurely Erupted Teeth in Rural Yoruba Communities, Nigeria
No ratings yet
Beliefs About Prematurely Erupted Teeth in Rural Yoruba Communities, Nigeria
7 pages
Beliefs About Prematurely Erupted Teeth in Rural Yoruba Communities, Nigeria
No ratings yet
Beliefs About Prematurely Erupted Teeth in Rural Yoruba Communities, Nigeria
7 pages
Or Ubu Loye 1993
No ratings yet
Or Ubu Loye 1993
14 pages
Personality, Education and Society: A Yoruba Perspective
No ratings yet
Personality, Education and Society: A Yoruba Perspective
8 pages
Grammar Reference Answers - 124 PDF
100% (1)
Grammar Reference Answers - 124 PDF
1 page
Adetunji 1992
No ratings yet
Adetunji 1992
8 pages
Hasil Try Out SMA N 4 Magelang
No ratings yet
Hasil Try Out SMA N 4 Magelang
9 pages
The Magazine For Professional Testers
No ratings yet
The Magazine For Professional Testers
56 pages
Book Reviews / Religion 40 (2010) 62-80 69
No ratings yet
Book Reviews / Religion 40 (2010) 62-80 69
2 pages
Keller 01 Naturalness
No ratings yet
Keller 01 Naturalness
12 pages
5 Tips To Read Arabic Without Vowels Tashkeel PDF
100% (1)
5 Tips To Read Arabic Without Vowels Tashkeel PDF
19 pages
Code Tuning
No ratings yet
Code Tuning
38 pages
Problems Second Conditiona Ficha Interactiva 5
No ratings yet
Problems Second Conditiona Ficha Interactiva 5
3 pages
TVA BOK 0018060 Sound Correspondences Between Tamil Japanese-1
No ratings yet
TVA BOK 0018060 Sound Correspondences Between Tamil Japanese-1
116 pages
Impoliteness: Using Language To Cause Offence: 1. Background
No ratings yet
Impoliteness: Using Language To Cause Offence: 1. Background
11 pages
IPE 1st Year (Arabic)
No ratings yet
IPE 1st Year (Arabic)
2 pages
Jhs 3 18 - 19
No ratings yet
Jhs 3 18 - 19
2 pages
Adobe Scan 03 Dec 2024
No ratings yet
Adobe Scan 03 Dec 2024
8 pages
Classified by Megamatt09
100% (1)
Classified by Megamatt09
94 pages
Python Data Analytics: With Pandas, NumPy, and Matplotlib, 3rd Edition Fabio Nelli Download
No ratings yet
Python Data Analytics: With Pandas, NumPy, and Matplotlib, 3rd Edition Fabio Nelli Download
31 pages
English Ochn Keys 2016
No ratings yet
English Ochn Keys 2016
11 pages
English Paper
No ratings yet
English Paper
45 pages
Qualitative Data Processing: Qualidata Process Guide
No ratings yet
Qualitative Data Processing: Qualidata Process Guide
21 pages
Yazyk Dlya Spetsialnykh Tseley S1
No ratings yet
Yazyk Dlya Spetsialnykh Tseley S1
57 pages
Group 9
No ratings yet
Group 9
9 pages
APA 7 Student Paper Template
No ratings yet
APA 7 Student Paper Template
4 pages
Tiz
No ratings yet
Tiz
4 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

A Fuzzy Decision Tree-Based Duration Model For Standard Yoru'ba Text-To-Speech Synthesis

Uploaded by

A Fuzzy Decision Tree-Based Duration Model For Standard Yoru'ba Text-To-Speech Synthesis

Uploaded by

COMPUTER

Computer Speech and Language 21 (2007) 325349

A fuzzy decision tree-based duration model for Standard

, Shun Ha Sylvia Wong

Computer Science, Aston University, Aston Triangle, Birmingham B4 7ET, UK

Total word count

Fig. 1. Occurrence counts for syllable type in context.

O(H) pe(H) Ki(H) o(H)

, Ko(L) te(L) te(L)

pe: k o to de, ko` te`te` lo: .

The position of syllable in Wtag, S PoW

Fuzzify the training data.

Length of word in which the syllable occurs

Short, Medium, Long

Fig. 3. Graphical representation of membership function.

6.2.2. Building the FDT

Fig. 5. Membership function for the output.

Degree to which the syllable is stretched or compressed

Fig. 6. FDT building algorithm.

Fig. 7. FDT for numerical values duration factor.

6.2.3. Fuzzy decision tree pruning

algorithm, the decision trees generated above

Fig. 8. Fuzzy decision tree for the duration model.

6.2.4. Applying FDT to duration modelling

Fig. 9. CART input description le.

Fig. 10. CART Tree for numerical duration aecting factors.

Fig. 11. Optimal CART for duration model.

8. Evaluation and discussion

8.1. Quantitative evaluation

American English (van Santen, 1994)

8.2. Qualitative evaluation

Not Signicant (p > 0.05)

Not Signicant (p > 0.05)

Standard deviations are shown in parentheses.

Perfect, indistinguishable from natural speech quality

Natural Vs. CART

Natural Vs. FDT

FDT Vs. CART

Not signicant (p > 0.05)

Natural Vs. CART

Natural Vs. FDT

FDT Vs. CART

Signicant (p < 0.05)

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.