Speech Recognition
Speech Recognition
Speech Recognition
A program's capacity to convert spoken language into written language is known as speech
recognition, also known as automatic speech recognition (ASR), computer voice recognition,
or speech-to-text.
It functions by dissecting a speech recording into its separate sounds, analysing each sound,
using algorithms to determine which words are most likely to fit that sound in the target
language, and then transcribing those sounds into text.
Higher-level processor:
The meaning of the recognized words is obtained.
The processor uses a dynamic knowledge representation to modify the syntax,
semantics and the pragmatics according to the context of what it has previously been
recognized.
The feedback from Higher-level processor reduces complexity of recognition mode by
limiting the search for valid input sentences from the user.
Speech recognition algorithms/ models
Various algorithms and computation techniques are used to recognize speech into text and
improve the accuracy of transcription. Both acoustic modelling and language modelling are
important parts of modern statistically based speech recognition algorithms. Hidden Markov
models (HMMs) are widely used in many systems. Language modelling is also used in many
other natural language processing applications such as document classification or statistical
machine translation.
Acoustic Model:
An acoustic model is used in automatic speech recognition to represent the relationship
between an audio signal and the phonemes or other linguistic units that make up speech.
The model is learned from a set of audio recordings and their corresponding transcripts. It is
created by taking audio recordings of speech, and their text transcriptions, and using
software to create statistical representations of the sounds that make up each word.
Language model:
A language model is a probability distribution over sequences of words.[1] Given such a
sequence of length m, a language model assigns a probability P(w1,....wm) to the whole
sequence. Language models generate probabilities by training on text corpora in one or
many languages
N-grams:
This is the simplest type of language model (LM), which assigns probabilities to sentences or
phrases. An N-gram is sequence of N-words. For example, “order the pizza” is a trigram or
3-gram and “please order the pizza” is a 4-gram. Grammar and the probability of certain
word sequences are used to improve recognition and accuracy.
Neural networks:
Primarily leveraged for deep learning algorithms, neural networks process training data by
mimicking the interconnectivity of the human brain through layers of nodes. Each node is
made up of inputs, weights, a bias (or threshold) and an output. If that output value exceeds
a given threshold, it “fires” or activates the node, passing data to the next layer in the
network. Neural networks learn this mapping function through supervised learning, adjusting
based on the loss function through the process of gradient descent. While neural networks
tend to be more accurate and can accept more data, this comes at a performance efficiency
cost as they tend to be slower to train compared to traditional language models.
Instead, they moved more towards natural language processing (NLP). Instead of just using
sounds, scientists turned to algorithms to program systems with the rules of the English
language.
So, if you were speaking to a system that had trouble recognizing a word you said, it would
be able to give an educated guess by assessing its options against correct syntactic,
semantic, and tonal rules.