0% found this document useful (0 votes)
85 views4 pages

Speech Recognition

The document discusses speech recognition, including the challenges it faces, the disciplines involved, and the paradigms and algorithms used. It describes how speech recognition works by converting sounds into text using algorithms to analyze sounds and determine the most likely words. Challenges include style, environment, speaker characteristics, and task specifics. Relevant disciplines are signal processing, acoustics, pattern recognition, linguistics, computer science, and psychology. The paradigms involve word recognition models and higher-level processors. Algorithms include acoustic models, language models, hidden Markov models, n-grams, neural networks, and speaker diarization.

Uploaded by

Dinesh Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
85 views4 pages

Speech Recognition

The document discusses speech recognition, including the challenges it faces, the disciplines involved, and the paradigms and algorithms used. It describes how speech recognition works by converting sounds into text using algorithms to analyze sounds and determine the most likely words. Challenges include style, environment, speaker characteristics, and task specifics. Relevant disciplines are signal processing, acoustics, pattern recognition, linguistics, computer science, and psychology. The paradigms involve word recognition models and higher-level processors. Algorithms include acoustic models, language models, hidden Markov models, n-grams, neural networks, and speaker diarization.

Uploaded by

Dinesh Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

08/09/2022

Speech Recognition
A program's capacity to convert spoken language into written language is known as speech
recognition, also known as automatic speech recognition (ASR), computer voice recognition,
or speech-to-text.
It functions by dissecting a speech recording into its separate sounds, analysing each sound,
using algorithms to determine which words are most likely to fit that sound in the target
language, and then transcribing those sounds into text.

Different Challenges in Speech Recognition


1. Style:
Conversational(or casual) speech or read speech, continuous speech or isolated
words.
2. Environment:
Background noise, Channel Condition, Room acoustics
3. Speaker Characteristics :
Rate of speech , Accent
4. Task Specifics:
Number of words in vocabulary,language,and other constraints

The Disciplines that applied to most of the speech recognition problems:


1. Signal Processing:
The process of extracting relevant information from the speech signal in an efficient
and robust manner. Using this process we can characterise the time-varying
properties of the speech signal as well as various types of signal preprocessing and
post processing to make the speech signal robust.
2. Physics(acoustics):
The science of understanding the relationship between the physical speech signal
and physiological mechanisms or we can say that human vocal tract mechanism that
produces speech and with which the speech is perceived.
3. Pattern recognition:
The set of algorithms used to cluster data to create prototypical patterns and to
compare a pair of patterns on the basis of feature measurement.
4. Communication and information theory:
The methods for detecting the presence of a particular speech pattern, a set of
coding and decoding algorithms used to search a large but finite grid for the best
path corresponding to a “best” recognized sequence of words.
5. Linguistics:
The relationship between sounds (phonology), words in a language (syntax),
meaning of spoken words (semantics), and sense derived from the meaning
(pragmatics).
6. Physiology:
Understanding of the higher-order mechanisms within the human central nervous
system that account for speech production and perception in human beings.
7. Computer science:
The study of efficient algorithms for implementing, in software and hardware, the
various methods used in a practical speech recognition system.
8. Psychology:
The science of understanding the factors that enable a technology to be used by
human beings in practical tasks.

Paradigm for Speech Recognition

The above diagram it consist of:

Word recognition model:


First the spoken o/p is recognized
Then the speech signal is decoded into a series of words that are meaningful
according to syntax, semantics, and pragmatics.

Higher-level processor:
The meaning of the recognized words is obtained.
The processor uses a dynamic knowledge representation to modify the syntax,
semantics and the pragmatics according to the context of what it has previously been
recognized.
The feedback from Higher-level processor reduces complexity of recognition mode by
limiting the search for valid input sentences from the user.
Speech recognition algorithms/ models
Various algorithms and computation techniques are used to recognize speech into text and
improve the accuracy of transcription. Both acoustic modelling and language modelling are
important parts of modern statistically based speech recognition algorithms. Hidden Markov
models (HMMs) are widely used in many systems. Language modelling is also used in many
other natural language processing applications such as document classification or statistical
machine translation.

Acoustic Model:
An acoustic model is used in automatic speech recognition to represent the relationship
between an audio signal and the phonemes or other linguistic units that make up speech.
The model is learned from a set of audio recordings and their corresponding transcripts. It is
created by taking audio recordings of speech, and their text transcriptions, and using
software to create statistical representations of the sounds that make up each word.

Language model:
A language model is a probability distribution over sequences of words.[1] Given such a
sequence of length m, a language model assigns a probability P(w1,....wm) to the whole
sequence. Language models generate probabilities by training on text corpora in one or
many languages

Natural language processing (NLP):


While NLP isn’t necessarily a specific algorithm used in speech recognition, it is the area of
artificial intelligence which focuses on the interaction between humans and machines
through language through speech and text. Many mobile devices incorporate speech
recognition into their systems to conduct voice search—e.g. Siri—or provide more
accessibility around texting.

Hidden markov models (HMM):


Modern general-purpose speech recognition systems are based on hidden Markov models.
These are statistical models that output a sequence of symbols or quantities. HMMs are
used in speech recognition because a speech signal can be viewed as a piecewise
stationary signal or a short-time stationary signal. In a short time scale (e.g., 10
milliseconds), speech can be approximated as a stationary process.

N-grams:
This is the simplest type of language model (LM), which assigns probabilities to sentences or
phrases. An N-gram is sequence of N-words. For example, “order the pizza” is a trigram or
3-gram and “please order the pizza” is a 4-gram. Grammar and the probability of certain
word sequences are used to improve recognition and accuracy.

Neural networks:
Primarily leveraged for deep learning algorithms, neural networks process training data by
mimicking the interconnectivity of the human brain through layers of nodes. Each node is
made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds
a given threshold, it “fires” or activates the node, passing data to the next layer in the
network. Neural networks learn this mapping function through supervised learning, adjusting
based on the loss function through the process of gradient descent. While neural networks
tend to be more accurate and can accept more data, this comes at a performance efficiency
cost as they tend to be slower to train compared to traditional language models.

Speaker Diarization (SD):


Speaker diarization algorithms identify and segment speech by speaker identity. This helps
programs better distinguish individuals in a conversation and is frequently applied at call
centres distinguishing customers and sales agents.

From Acoustics to Linguistics


The ability to distinguish between speakers was not the only advancement made during this
time. Scientists started abandoning the notion that speech recognition had to be purely
acoustically based.

Instead, they moved more towards natural language processing (NLP). Instead of just using
sounds, scientists turned to algorithms to program systems with the rules of the English
language.

So, if you were speaking to a system that had trouble recognizing a word you said, it would
be able to give an educated guess by assessing its options against correct syntactic,
semantic, and tonal rules.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy