Precise Tone Generation For Vietnamese Text-To-Speech System
Precise Tone Generation For Vietnamese Text-To-Speech System
I - 505
➡ ➡
0.90
0.85
210
1.00 190
Weighting factor
Level
170 Faling
0.80
150 Broken
130 Curve
Rising
110
Drop
90
N Time [frame] 70
Adjust point 0 20 40
The word set includes two vowels “a” and “i” with initial
consonants “m”, “b”, “d”, and final consonants “m”, “n”.
Sixty data are chosen, in which there are ten data for each
tone. In this list, all sounds are utter-able, and most are
0.10
meaning words. Since there are 60 sounds for each speech
N
Adjust point
Time [frame] type, total sounds for each listening test is 360 data. The
mixture of all six sound types together puts the test in a
Fig. 4 Rule for power control in Drop tone. more natural situation.
In the test, each sound is played once and randomly,
Intonation
then the listener have to choose which word his/her had
The intonation is implemented by applying a simple
heard within a two-second period. After this period, a
declination line in log frequency domain [14].
warning is displayed to force the user to make his/her
decision. There are five listeners, four males and one
4. EVALUATION AND DISCUSSION
female. All listeners are from northern Vietnam and are in
their twenties with a normal hearing ability.
4.1 Evaluation test
4.2 Result and discussion
Presently, the purpose of the evaluation is to test the tones'
intelligibility of synthetic speech of Vietnamese syllable
After collecting the results from the five listeners, we
with generated tones and to assert the effect of power
removed the results of the two listeners with the highest
control in Vietnamese tone synthesis. VieTTS system is
and lowest correct rate. The overall result of the listening
evaluated through a listening test.
tests is shown in Fig. 6 and partly in Table 2. Fig. 6
Six types of speech are prepared for the listening test:
describes how many percentages of the synthesized tones
• Type 0: Original sounds are recognized correctly, while Table 2 consists of the
• Type 1: Analysis – Synthesis sounds confusion matrices in this evaluation.
• Type 2: Synthetic sounds: The average F0 pattern From Fig. 6, we see the correct rate as follows:
(Fig. 2) with power control • The proposed method (type 2 - average F0 with
• Type 3: Synthetic sounds: The linear F0 pattern c[0] control) is acceptable with around 95%
(Fig. 5) with power control correct rate.
• Type 4: Synthetic sounds: The average F0 pattern • The analysis-synthesis (type1) sounds are 2%
without power control lower than that of the original sounds (type 0).
• Type 5: Synthetic sounds: The linear F0 pattern This is thought to be caused by noises during
without power control analysis and/or synthesis procedures.
• Control of c[0], or power control, is effective in
All synthetic sounds use cepstra from speech units Vietnamese tone synthesis. The average F0 with
with Level tone. Power control here means the control of c[0] control (type 2) is 21% higher than the
c[0] coefficient. Linear F0 patterns [6,7] are shown in Fig. average F0 without c[0] control (type 4). The
5, in which the horizontal axis is time in frame and the linear F0 with c[0] control (type 3) is also 17%
vertical axis is frequency in Hz. higher than that without c[0] control (type 5).
I - 506
➡ ➠
• Linear F0 with c[0] control (type 3) is not so low Hanoi dialect, which is used as the standard Vietnamese
intelligibility. It is only 9% less than average F0 in this system.
with c[0] control (type 2). Tone synthesis of Vietnamese is implemented. Four
tones (Level, Falling, Curve and Rising) are synthesized
100 by using fundamental frequency (F0) patterns. For
95 synthesizing the others (Broken and Drop) tones that have
90
85 glottal features, we used both F0 patterns and power
Correct Rate [%]
80
75
pattern control. As a result, we found that F0 is the most
70 important in Vietnamese tone synthesis, and power
65
60
pattern control strongly affects Broken and Drop tones.
55
50
Type 0 Type 1 Type 2 Type 3 Type 4 Type 5 Acknowledgement
Sound type
I - 507