0% found this document useful (0 votes)
74 views29 pages

Unit 2 Sound or Audio System

The document outlines the fundamentals of sound and audio systems, including the definition of sound, its properties such as frequency and amplitude, and the science of acoustics. It discusses speech generation techniques, including concatenative synthesis, parametric synthesis, and neural network-based models, highlighting advancements in natural-sounding speech. Additionally, it covers speech analysis, transmission methods, and the importance of efficient coding for maintaining sound quality during transmission.

Uploaded by

Arun shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views29 pages

Unit 2 Sound or Audio System

The document outlines the fundamentals of sound and audio systems, including the definition of sound, its properties such as frequency and amplitude, and the science of acoustics. It discusses speech generation techniques, including concatenative synthesis, parametric synthesis, and neural network-based models, highlighting advancements in natural-sounding speech. Additionally, it covers speech analysis, transmission methods, and the importance of efficient coding for maintaining sound quality during transmission.

Uploaded by

Arun shrestha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Unit 2

Sound/Audio System
syllabus
Concept of sound system, Music and speech, speech analysis, speech
transformation

Prepared by: Er. Hemanta Bohora


Introduction
Sound is a type of energy that travels through matter in waves, usually through air, but
also through liquids and solids. At its core, sound is produced when an object vibrates,
causing the surrounding molecules to vibrate as well. This vibration travels in waves until
it reaches a listener's ear, where it’s perceived as sound.
• Definition: Sound is energy that moves in waves through a medium like air, water, or solids.
• Frequency: Determines pitch; higher frequency means a higher pitch, and lower frequency
means a lower pitch
• Creation: It’s generated by vibrations in an object, which cause surrounding molecules to vibrate
and transmit the sound wave.
• Amplitude: Relates to volume; higher amplitude results in louder sounds, while lower amplitude
results in quieter sounds.
• Human Perception: Sound waves reach the ear, causing the eardrum to vibrate, which the
brain interprets as sound.
• Medium Requirement: Sound needs a medium (like air or water) to travel; it cannot move
through a vacuum.
• Field of Study (Acoustics): Acoustics is the science of sound, exploring its production,
transmission, and effects in various applications.
.
.
.
Speech Generation
• Speech Generation is the process of creating spoken language from
text or other forms of input.
• It encompasses technologies like text-to-speech (TTS) systems,
conversational AI, and voice cloning, and is critical in applications such
as virtual assistants, accessibility tools, and robotics.
• The field has seen significant advancements with the introduction of
neural network-based models like WaveNet and Tacotron, which
produce highly natural, expressive speech.
• Challenges remain in making speech generation more natural, context-
sensitive, and scalable, especially for multilingual or emotionally varied
speech. Despite these challenges, speech generation continues to
improve, enhancing human-computer interaction and accessibility
across numerous industries
Techniques Used in Speech Generation
1.Concatenative Synthesis:
1. A popular method for generating speech where prerecorded human speech segments (like
syllables or phonemes) are stitched together to form sentences.
2. Limitations: While this approach can produce highly natural-sounding speech, it requires a
large database of recorded voice data and may sound robotic if not done correctly.
2.Parametric Synthesis:
1. Speech generation models that create audio waves based on parameters like pitch,
duration, and voice quality. Formant synthesis and HMM-based synthesis are examples of
parametric synthesis methods.
2. Limitations: Though efficient, the speech generated is often less natural-sounding
compared to concatenative synthesis.
3.Neural Network-Based Models:
1. WaveNet (by DeepMind): A deep neural network that directly generates raw audio waves, achieving much
higher naturalness in speech than traditional methods.
2. Tacotron and Tacotron 2: Text-to-speech systems that convert text into spectrograms, which are then turned
into speech waveforms using another model (such as a vocoder).
3. Prosody Prediction: Neural networks are also employed to predict and generate natural-sounding prosody
(intonation, pitch, rhythm) in generated speech.
4. End-to-End Models:
1. Recent advances involve end-to-end neural models that can directly generate
speech from text without the need for separate steps like phonetic transcription,
linguistic analysis, or waveform generation. These models learn the entire process
in one unified pipeline.
2. Example: FastSpeech and FastSpeech 2 are end-to-end systems that produce
speech with high efficiency and naturalness.
5. Voice Cloning:
1. Voice cloning involves generating speech that mimics a specific person’s voice using
a limited amount of recorded data. This is typically done through deep learning
models that capture the unique characteristics of a person’s voice and speech
patterns.
2. Applications: Personalized virtual assistants, content creation, and accessibility
tools.
Basic Notations:
- The lowest periodic spectral component of the speech signal is called the
fundamental frequency. It is presented in a voiced sound.
- A phone is the smallest speech unit, such as the m of mat and b of bat in
English.
- Allophones mark the variants of phone. For example, the aspirated p of pit and
the unaspirated p of spit are allophones of the English phenome P.
- The morph marks the smallest unit which carries a meaning itself.
- A voiced sound is generated through the vocal cords; m, v and l are examples
of voiced sound. The pronunciation of a voiced depends strongly on each
speaker.
- During the generation of unvoiced sound, the vocal cords are opened f and s is
unvoiced sound.
- Vowels – a speech created by the relatively free passage of breath through the
larynx and oral charity. Example, a, e, I, o and u
- Consonants – a speech sound produced by a partial or complete obstruction of
the air stream by any of the various contradictions of the speech organs.
Example, m from mother, ch from chew.
Speech Analysis:
Speech analysis/input deals with the research areas which are as follows:

(1) Who?
-Human speech has certain characteristics determined by a speaker. Hence
speech analysis can serve to analyze who is speaking i.e. to recognize a
speaker for his/her identification and verification.
(2) What?
- Another main task of speech analysis is to analyze what has been said i.e. to
recognize and understand the speech signal itself.
(3) How?
Another area of speech analysis tries to research speech patterns with respect
to how a certain statement was said.
Figure: - Speech recognition system: task division into system components, using the
basic principle “Data Reduction through Property Extraction”

Speech Transmission:
- The area of speech transmission deals with efficient coding of the speech
signal allow speech / sound transmission at low transmission rates over
networks.
- The goal is to provide the receiver with the same speech/sound quality as was
generated at the sender side.
Some Techniques for Speech Transmission:
(1) Pulse Code Modulation:
A straight forward technique for digitizing an analog signal is pulse code
modulation. It meets the right quality demand stereo audio signals in the data rate
used for CD. Its rate is 176400 bytes/s.
(2) Source Encoding:

Figure: - Component of a speech transmission system using source encoding


In source encoding transmission depends on the original signal has certain characteristics that can
be exploited in compression.
(3) Recognition-Synthesis Method:

Figure: - Component of a recognition Synthesis for speech transmission

This method conducts a speech analysis and speech synthesis during reconstruction speech
elements are characterized by bits and transmitted over multimedia system. The data rate defines
the quality.
Example:
Calculate the file size in bytes for 60 second recording at 44.1 KHz, 8 bits resolution stereo
sound.
Sound Types and their number of Channel

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy