0% found this document useful (0 votes)
54 views17 pages

Automatic Speech Recognition

Automatic speech recognition is the task of getting a computer to understand spoken language by either reacting appropriately or converting speech to text. Humans do this through the ear and brain processing sound waves produced during articulation. Computers do it by digitizing the acoustic signal, analyzing it acoustically, matching it to a phoneme dictionary using a language model. Multilingual speech recognition systems use techniques like universal speech models, language identification classifiers, and monolingual speech recognizers with dynamic confidence scoring to recognize multiple languages. The end-to-end multilingual ASR system has client, frontend, and backend components including an LID backend, speech recognizer backend, web search backend, and voice synthesizer backend. HMM-

Uploaded by

Mayank Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views17 pages

Automatic Speech Recognition

Automatic speech recognition is the task of getting a computer to understand spoken language by either reacting appropriately or converting speech to text. Humans do this through the ear and brain processing sound waves produced during articulation. Computers do it by digitizing the acoustic signal, analyzing it acoustically, matching it to a phoneme dictionary using a language model. Multilingual speech recognition systems use techniques like universal speech models, language identification classifiers, and monolingual speech recognizers with dynamic confidence scoring to recognize multiple languages. The end-to-end multilingual ASR system has client, frontend, and backend components including an LID backend, speech recognizer backend, web search backend, and voice synthesizer backend. HMM-

Uploaded by

Mayank Kulkarni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 17

Automatic Speech Recognition

What is the task?

Getting a computer to understand spoken


language
By understand we might mean

React appropriately
Convert the input speech into another medium,
e.g. text

How do humans do it?

Articulation produces
sound waves which
the ear conveys to the brain
for processing
3

How computers do it?


Acoustic waveform

Acoustic signal

Digitization
Acoustic analysis of the speech
signal
Phoneme dictionary
Language model

Speech recognition

Multilingual Architecture

Multilingual speakers already out-number


monolingual speakers.
The capacity to transparently recognize multiple
spoken languages is a desirable feature of ASR
systems.
eg. OK GOOGLE, SIRI

Multilingual Techniques

Universal Speech Model

Language Identification (LID) classifiers

Monolingual speech recognizers decode along


with LID (Confidence Score)
Dynamic confidence score and LID decision

ASR Multilingual Design

The end-to-end multilingual speech recognition system consists of the


following components:
1. Client
2. Frontend
-Recognize
-Recognize+Search+Synthesis
-Multi-recognize+Search+Synthesis
3. Backend
-LID Backend
-Speech Recognizer Backend
-Web Search Backend
-Voice Synthesizer Backend
9

10

Multirecognizer Module

11

Representation of Speech & Speech


Signal

Grammar & Syntax

-How the occurrence of words in sequence is governed

Lexicon or Dictionary

- How a word is supposed to be pronounced as a


sequence of unitary sounds

Acoustic-phonetics

-How a unitary sound and/or a sequence of unitary sounds


are supposed to be produced with the articulatory
apparatus
12

THE HIDDEN MAROV MODEL

The input audio waveform from a microphone is converted into a sequence of


fixed size acoustic vectors Y 1: T = y 1. . . y T in a process called feature
extraction[3]. The decoder then attempts to find the sequence of words w 1: L =
w 1. . . w L which is most likely to have generated Y, i.e. the decoder tries to
find,
w = arg max {P (w|Y)}.
However, since P (w|Y) is difficult to model directly, Bayes Rule is used
to transform above equation into the equivalent problem of finding:
w = arg max {p(Y |w) P (w)}

13

Arcgitecture of HMM Based


Recognizer

14

The overall recognition system of speech recognition using HMM includes :

Feature Analysis

Unit Matching System

Lexical Decoding

Syntactic analysis

Semantic Analysis

15

Phoneme and Topologies

16

Composite HMM for Vertibri Recogition (Pronunciation Dictionary)

17

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy