0% found this document useful (0 votes)
74 views4 pages

Precise Tone Generation For Vietnamese Text-To-Speech System

Vietnamese Text-To-Speech (VieTTS) system is a parametric and rule based speech synthesis system. Tone synthesis of Vietnamese is implemented by using fundamental frequency (F0) patterns and power pattern control. Applying power control for tone synthesis is effective and unique for Vietnamese compared to other tonal languages such as Chinese and Thai.

Uploaded by

thegioiyenbinh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views4 pages

Precise Tone Generation For Vietnamese Text-To-Speech System

Vietnamese Text-To-Speech (VieTTS) system is a parametric and rule based speech synthesis system. Tone synthesis of Vietnamese is implemented by using fundamental frequency (F0) patterns and power pattern control. Applying power control for tone synthesis is effective and unique for Vietnamese compared to other tonal languages such as Chinese and Thai.

Uploaded by

thegioiyenbinh
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

➠ ➡

PRECISE TONE GENERATION FOR VIETNAMESE TEXT-TO-SPEECH SYSTEM

Tu Trong DO, Tomio TAKARA

Department of Information Engineering,


University of the Ryukyus
1 Senbaru, Nishihara, Okinawa, 903-0213 JAPAN

ABSTRACT Table 1 The six Vietnamese tones


Name Tone mark Example
We propose a Vietnamese Text-To-Speech (VieTTS)
LEVEL (ngang) unmarked ma – ghost
system which is a parametric and rule based speech
FALLING (huyền) grave mà – that
synthesis system. Fundamental speech units of this
BROKEN (ngã ) tilde mã – horse
system are demisyllables with Level tone. VieTTS uses a
source-filter model for speech production and a Log CURVE (hỏi) hook above mả – tomb
Magnitude Approximation (LMA) filter as the vocal tract RISING (sắc) acute má – cheek
filter. We chose the Hanoi dialect for VieTTS. Tone DROP (nặng) dot below mạ – rise seedling
synthesis of Vietnamese is implemented by using
fundamental frequency (F0) patterns and power pattern
control. F0 is the most important factor in Vietnamese 2. VIETNAMESE LANGUAGE'S OVERVIEW
tone synthesis and the power control strongly affects
Broken and Drop tones. Applying power control for tone The Vietnamese alphabet consists of 29 letters. In the
synthesis is effective and unique for Vietnamese alphabet, there are seven special characters with diacritic
compared to other tonal languages such as Chinese and marks.
Thai. The six Vietnamese tones are shown in Table 1.
Tone has a suprasegmental feature and affects the whole
1. INTRODUCTION syllable. Different tones make words with the same
structure of phonemes contain different meanings. In the
In spite of the development of speech technology, there Vietnamese writing system, a tone is represented by a
are very few researches on Vietnamese speech processing diacritic mark. There are a total of six tones; but when a
[1, 2, 3], particularly on speech synthesis. In this paper, a syllable ends with an unvoiced consonant, only rising and
Vietnamese Text-To-Speech (VieTTS) system is Drop tones occur.
proposed. VieTTS uses a source-filter model [4] for Among these tones, Broken and Drop tones are
speech production and a Log Magnitude Approximation accompanied by a glottal stop [6,7] or by a glottal
(LMA) filter [5] as the vocal tract filter. constriction [2]. This feature will be examined in this
Vietnamese is the official language of Vietnam. We paper at the analysis and synthesis part of Vietnamese
choose the Hanoi dialect for VieTTS because it is mainly tones.
used for official activities such as education and
broadcast. Vietnamese is a tonal language which involves 3. VIETTS SYSTEM
six tones: Level (Ngang), Falling huyền), Broken(ngã),
Curve (hỏi), Rising (sắc), and Drop (nặng). Vietnamese This design is based on the general speech synthesis
has more tones than Chinese (four tones) and Thai (five system [9]. The input is Vietnamese text, and the output is
tones). Tones are usually considered as the time patterns synthetic speech. The text analysis sub-system converts
of pitch and synthesized by using fundamental frequency Vietnamese text into a sequence of mapped characters,
(F0) patterns. In Vietnamese, Broken and Drop tones are and then this sequence is used to get information for
accompanied by a glottal stop [2, 6, 7], which is different synthesis. The speech synthesis sub-system generates
from Chinese and Thai. For this feature, we propose the speech from a pre-stored database under the control of
power control for these two tones so that they are synthesis rules. The database contains data for rules and
synthesized by using not only F0 patterns but also power demisyllable parameters with suitable formats. To make
pattern control. The synthesized tones were evaluated by a system more generic, we use external definitions of interval
listening test. marks, intervals, tone patterns, and a character table code.

0-7803-7663-3/03/$17.00 ©2003 IEEE I - 504 ICASSP 2003


➡ ➡

3.1 Speech analysis and synthesis Vietnamese Speech Parameters

Pitch Amplitude V/UV decision Cepstrum

The fundamental speech units of VieTTS are the


Pulse
demisyllables which are acquired by dividing a syllable Generator
x
into half with the cut point at the middle of the vowel. LMA
There are about 500 demisyllables in Vietnamese. As a Filter
White Noise
speech database, Vietnamese demisyllables are collected Generator
x

and their sounds are prepared by recording on digital


audio tape (DAT) at a 48 kHz sampling rate and 16-bit Fig. 1 VieTTS’s speech synthesis sub-system.
resolution. After that, they are down-sampled to 10 kHz
for analyzing. All speech units are recorded with Level
tone which is a kind of natural pitch level. 1.0

Log Frequency [Oct.]


Level

Cepstral analysis 0.5 Faling

VieTTS adopts short-time cepstral analysis. In the 0.0 Broken


VieTTS system, the frame length is 25.6 ms and the frame Curve
-0.5
interval, or frame shifting time, is 10 ms. A time-domain Rising
Hamming window with a length of 25.6 ms is used in -1.0
Drop
analysis part. 0 100 200 300
Cepstrum is defined as the inverse Fourier transform Time [norm-sec]
of the short time log-magnitude spectrum [10].
Fig. 2 The average F0 contours of the six Vietnamese tones.
Speech synthesis
Fig. 1 shows the structure of speech synthesis sub- After analyzing, the F0 contours of each tone are
system in VieTTS. For voiced sounds, excitations are normalized to have same length, and same pitch level.
impulse train created by the pulse generator. These Fig.2 shows the average F0 contours of the six
impulses have an interval equal to their pitch period. For Vietnamese tones. Zero level of log frequency here is
unvoiced sounds, excitations are random noises that have equal to 128 Hz. As far as we know, the contours of six
a flat spectrum. The voiced/unvoiced sound decision analyzed tones using multiple data of Vietnamese are not
controls the switching between two kinds of excitation shown in any documentation before.
generators. Among six tones, Broken and Drop tones have glottal
stop feature [2, 6, 7]. Glottal stop in a speech synthesis
3.2 System rules system has been studied by Takara [11]. By observing the
waveform of Vietnamese tones, we found the changes of
Connection power at the point in which such feature occurred. From
A syllable is constructed from corresponding this idea, we propose the power control in VieTTS system
demisyllables and a tone. for Broken and Drop tones. Then the hypothesis is: a tone
will be applied by an F0 pattern and a power pattern
Interval control. Power control is implemented by changing the
The interval rule is defined externally in database. first cepstral coefficient c[0]. The c[0] parameters of
This makes VieTTS system more generic, or easy to frames are weighted by some factors to make the changes
modify to be suitable. Currently, VieTTS has four kinds of the signal's power. Fig. 3 and Fig. 4 explain two power
of interval marks. patterns for the above two tones. The power control also
effectively shortens the length of Drop tone words. A
Tones Drop tone word's length is usually shorter than the others.
Tone is strongly related to fundamental frequency These two patterns are simple but they show very
(F0). The six Vietnamese tones are analyzed to get F0 effective results when we evaluated the system by using
patterns. The set of words for analyzing tones is selected the listening test.
with the following conditions:(i) meaning words; (ii) all Power control is a new implementation in Vietnamese
phonemes are voiced. We selected eight initial tone synthesis. This implementation is unique compared
consonants: “b”, “d”, “g”, “l”, “m”, “n”, “ng / ngh”, “v”, to other tonal languages such as Chinese and Thai since
and two vowels “a” and “i / y”. Then we got 81 words for these languages never adopt power control in their tone
the analysis. synthesis; only pitch is examined [12, 13].

I - 505
➡ ➡

0.90
0.85
210

1.00 190
Weighting factor

Level
170 Faling
0.80
150 Broken

130 Curve
Rising
110
Drop
90
N Time [frame] 70
Adjust point 0 20 40

Fig. 3 Rule for power control in Broken tone.

Fig. 5 Linear F0 pattern for the listening test


1.00
Weighting factor

The word set includes two vowels “a” and “i” with initial
consonants “m”, “b”, “d”, and final consonants “m”, “n”.
Sixty data are chosen, in which there are ten data for each
tone. In this list, all sounds are utter-able, and most are
0.10
meaning words. Since there are 60 sounds for each speech
N
Adjust point
Time [frame] type, total sounds for each listening test is 360 data. The
mixture of all six sound types together puts the test in a
Fig. 4 Rule for power control in Drop tone. more natural situation.
In the test, each sound is played once and randomly,
Intonation
then the listener have to choose which word his/her had
The intonation is implemented by applying a simple
heard within a two-second period. After this period, a
declination line in log frequency domain [14].
warning is displayed to force the user to make his/her
decision. There are five listeners, four males and one
4. EVALUATION AND DISCUSSION
female. All listeners are from northern Vietnam and are in
their twenties with a normal hearing ability.
4.1 Evaluation test
4.2 Result and discussion
Presently, the purpose of the evaluation is to test the tones'
intelligibility of synthetic speech of Vietnamese syllable
After collecting the results from the five listeners, we
with generated tones and to assert the effect of power
removed the results of the two listeners with the highest
control in Vietnamese tone synthesis. VieTTS system is
and lowest correct rate. The overall result of the listening
evaluated through a listening test.
tests is shown in Fig. 6 and partly in Table 2. Fig. 6
Six types of speech are prepared for the listening test:
describes how many percentages of the synthesized tones
• Type 0: Original sounds are recognized correctly, while Table 2 consists of the
• Type 1: Analysis – Synthesis sounds confusion matrices in this evaluation.
• Type 2: Synthetic sounds: The average F0 pattern From Fig. 6, we see the correct rate as follows:
(Fig. 2) with power control • The proposed method (type 2 - average F0 with
• Type 3: Synthetic sounds: The linear F0 pattern c[0] control) is acceptable with around 95%
(Fig. 5) with power control correct rate.
• Type 4: Synthetic sounds: The average F0 pattern • The analysis-synthesis (type1) sounds are 2%
without power control lower than that of the original sounds (type 0).
• Type 5: Synthetic sounds: The linear F0 pattern This is thought to be caused by noises during
without power control analysis and/or synthesis procedures.
• Control of c[0], or power control, is effective in
All synthetic sounds use cepstra from speech units Vietnamese tone synthesis. The average F0 with
with Level tone. Power control here means the control of c[0] control (type 2) is 21% higher than the
c[0] coefficient. Linear F0 patterns [6,7] are shown in Fig. average F0 without c[0] control (type 4). The
5, in which the horizontal axis is time in frame and the linear F0 with c[0] control (type 3) is also 17%
vertical axis is frequency in Hz. higher than that without c[0] control (type 5).

I - 506
➡ ➠

• Linear F0 with c[0] control (type 3) is not so low Hanoi dialect, which is used as the standard Vietnamese
intelligibility. It is only 9% less than average F0 in this system.
with c[0] control (type 2). Tone synthesis of Vietnamese is implemented. Four
tones (Level, Falling, Curve and Rising) are synthesized
100 by using fundamental frequency (F0) patterns. For
95 synthesizing the others (Broken and Drop) tones that have
90
85 glottal features, we used both F0 patterns and power
Correct Rate [%]

80
75
pattern control. As a result, we found that F0 is the most
70 important in Vietnamese tone synthesis, and power
65
60
pattern control strongly affects Broken and Drop tones.
55
50
Type 0 Type 1 Type 2 Type 3 Type 4 Type 5 Acknowledgement
Sound type

This study is supported by Grant-in-Aid for Scientific


Fig. 6 Correct rate of tone synthesis. Research (C) of Japan Society for the Promotion of
Science No. 12680419.

Table 2. Confusion matrices of tone synthesis. REFERENCES


Unit:%. Lt:Level tone, Ft:Falling tone, Bt:Broken tone,
Ct:Curve:Tone, Rt:Rising tone, Dt:Drop tone. [1] T. T. Doan, “Vietnamese Phonetics”, Hanoi National University
Type Lt Ft Bt Ct Rt Dt Publishing, 1999. (in Vietnamese)
Lt
[2] M. Shimizu and M. Dantsuji, “A New Proposal of Laryngeal
100 0 0 0 0 0
Features for the Tonal System of Vietnamese”, Proc. of ICSLP,
Ft 23 70 0 0 0 7 2, pp. 519-522, 2000.
Bt 0 0 100 0 0 0 [3] M. S. Han and K.-O. Kim, “Phonetic variation of Vietnamese
TYPE 2

Ct 0 0 0 100 0 0 tones in disyllabic utterances”, Jour. of Phonetics, 2, pp.223-


Rt
232 ,1974.
0 0 3 0 97 0
[4] S. Furui, “Digital Speech Processing, Synthesis, and
Dt 0 0 0 0 0 100 Recognition”, Second Edition Marcel Dekker, Inc., pp. 30-31,
Lt 100 0 0 0 0 0 2001.
Ft 20 73 0 0 0 7 [5] S. Imai, “Log Magnitude Approximation (LMA) filter”, Trans.
Bt
of IECE Japan, J63-A, 12, pp. 886—893, 1980. (in Japanese)
3 0 60 0 37 0
TYPE 4

[6]T. T. Doan, “Vietnamese Phonetics”, Hanoi National University


Ct 0 0 0 100 0 0 Publishing, pp. 100-111, 1999. (in Vietnamese)
Rt 0 0 20 0 80 0 [7] B. N. Ngo, “Elementary Vietnamese”, Tuttle Publishing, p. 27,
Dt 7 63 0 0 0 30 1999.
[8] B. N. Ngo, “Elementary Vietnamese”, Tuttle Publishing, p. 17,
1999.
From the confusion matrices, the error rates of the [9] T. Takara and T. Kochi, “General Speech Synthesis System for
Broken tone recognized as Rising tone were 0%, 3%, Japanese Ryukyu Dialect”, Proc. of the 7th WestPRAC, pp.
37%, 50% for type 2, type 3, type 4 and type 5, 173-176, 2000.
respectively. It shows us that the power control makes [10] S. Furui, “Digital Speech Processing, Synthesis, and
Broken tone dramatically clear. Similarly, we can explain Recognition”, Second Edition, Marcel Dekker, Inc., pp.62-66,
in the case that the error rates of Drop tone recognized as 2001.
[11] T. Takara, “Experimental Study on Perception of the Glottal
Falling tone were 0%, 0%, 63%, 70% for type 2, type 3,
Explosive of the Japanese Ryukyu Dialect”, Proc. of
type 4 and type 5, respectively. These affirm the dramatic EuroSpeech'95, pp. 953-956, 1995.
effectiveness of power control on Vietnamese tones. [12] C.-H. Wu and J.-H. Chen, “Automatic generation of synthesis
units and prosodic information for Chinese concatenative
. 5. CONCLUSION synthesis”, Jour. of Speech Communication, 35, pp. 219-237,
2001.
We have introduced a Text-To-Speech system for [13] P. Seresangtakul and T. Takara, “Analysis of Pitch Contour of
Vietnamese, VieTTS, which is a rule-based synthesis Thai Tone Using Fujisaki's Model”, Proc. of ICASSP'02, 1, pp.
system using a cepstral method with speech units are 505-508, 2002.
demisyllables. VieTTS system could synthesize speech [14] T. Takara and J. Oshiro, “Continuous Speech Synthesis by
Rule of Ryukyu Dialect”, Trans. IEE of Japan, 108-C, 10, pp.
from Vietnamese text with precisely generated six tones.
773-780, 1988. (in Japanese)
We have introduced a Text-To-Speech system for the

I - 507

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy