0% found this document useful (0 votes)
199 views13 pages

Speech Recognition Architecture

This document discusses the key components of an automatic speech recognition system, including pre-processing of acoustic signals, feature extraction, acoustic modeling, language modeling, pattern classification, and part-of-speech tagging. It describes how speech is converted to digital signals and analyzed frame-by-frame to extract features before being matched to acoustic models using techniques like hidden Markov modeling. Language models then incorporate structural constraints to distinguish words with similar pronunciations.

Uploaded by

Dhrumil Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
199 views13 pages

Speech Recognition Architecture

This document discusses the key components of an automatic speech recognition system, including pre-processing of acoustic signals, feature extraction, acoustic modeling, language modeling, pattern classification, and part-of-speech tagging. It describes how speech is converted to digital signals and analyzed frame-by-frame to extract features before being matched to acoustic models using techniques like hidden Markov modeling. Language models then incorporate structural constraints to distinguish words with similar pronunciations.

Uploaded by

Dhrumil Das
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 13

Speech Recognition

Architecture
• The fundamental aspect of speech recognition is the translation of sound into text and
commands.
• Speech recognition is the process by which computer maps an acoustic speech signal to some
form of abstract meaning of the speech.

Automatic Speech Recognition System


• Pre-processing/Digital Processing:
• The recorded acoustic signal is an analog signal.
• An analog signal cannot directly transfer to the ASR systems.
• So these speech signals need to transform in the form of digital signals and then only they can
be processed.
• These digital signals are move to the first order filters to spectrally flatten the signals.
• This procedure increases the energy of signal at higher frequency.
• Feature Extraction
• Feature extraction step finds the set of parameters of utterances that have acoustic correlation
with speech signals and these parameters are computed through processing of the acoustic
waveform.
• These parameters are known as features.
• The main focus of feature extractor is to keep the relevant information and discard irrelevant
one.
• To act upon this operation, feature extractor divides the acoustic signal into 10-25 ms.
• Data acquired in these frames is multiplied by window function.
• There are many types of window functions that can be used such as hamming Rectangular,
Blackman, Welch or Gaussian etc. In this way features have been extracted from every frame.
• There are several methods for feature extraction such as Mel-Frequency Cepstral Coefficient
(MFCC), Linear Predictive Cepstral Coefficient (LPCC), Perceptual Linear Prediction (PLP),
wavelet and RASTA-PLP (Relative Spectral Transform) Processing etc.
• Acoustic Modeling
• The connection between the acoustic information and phonetics is established.
• Acoustic model plays important role in performance of the system and responsible for
computational load.
• Training establishes co-relation between the basic speech units and the acoustic observations.
• Training of the system requires creating a pattern representative for the features of class using
one or more patterns that correspond to speech sounds of the same class.
• Many models are available for acoustic modeling out of them Hidden Markov Model (HMM)
is widely used and accepted as it is efficient algorithm for training and recognition
• Language Modeling
• A language model contains the structural constraints available in the language to generate the
probabilities of occurrence.
• It induces the probability of a word occurrence after a word sequence.
• The language model distinguishes word and phrase that has similar sound.
• For example, in American English, the phrases like “recognize speech" and "wreck a nice
beach" have same pronunciation but mean very different things.
• These ambiguities are easier to resolve when evidence from the language model is incorporated
with the pronunciation model and the acoustic model.
• Pattern Classification
• Pattern Classification (or recognition) is the process of comparing the unknown test pattern with
each sound class reference pattern and computing a measure of similarity between them.
• After completing training of the system at the time of testing patterns are classified to recognize
the speech.
• Part of Speech Tagging
• Part-of-speech (POS) tagging is a process in natural language processing (NLP) where each
word in a text is labeled with its corresponding part of speech.
• This can include nouns, verbs, adjectives, and other grammatical categories.
• POS tagging is useful for a variety of NLP tasks, such as information extraction, named entity
recognition, and machine translation.
• It can also be used to identify the grammatical structure of a sentence and to disambiguate
words that have multiple meanings.
• POS tagging is typically performed using machine learning algorithms, which are trained on a
large annotated corpus of text.
• The algorithm learns to predict the correct POS tag for a given word based on the context in
which it appears.
• Let’s take an example,
• Text: “The cat sat on the mat.”
• POS tags:
• The: determiner
• cat: noun
• sat: verb
• on: preposition
• the: determiner
• mat: noun
• Use of Parts of Speech Tagging in NLP
• To understand the grammatical structure of a sentence
• To disambiguate words with multiple meanings
• To improve the accuracy of NLP tasks
• To facilitate research in linguistics
• Steps Involved in the POS tagging
• Collect a dataset of annotated text
• Preprocess the text
• Divide the dataset into training and testing sets
• Train the POS tagger
• Test the POS tagger
• Fine-tune the POS tagger
• Use the POS tagge
• Implement Parts-Of-Speech tags using Spacy in Python
pip install spacy
python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("Apple is planning to buy Indian startup for $1 billion")
for token in doc:
print(token, "|", token.pos_,"|", spacy.explain(token.pos_),"|",token.tag_,
spacy.explain(token.tag_))
• token.pos_ will give the POS tag of the specific token
• o/p:
• Output
• Apple | PROPN | proper noun | NNP noun, proper singular
• is | AUX | auxiliary | VBZ verb, 3rd person singular present
• planning | VERB | verb | VBG verb, gerund or present participle
• to | PART | particle | TO infinitival "to"
• buy | VERB | verb | VB verb, base form
• Indian | ADJ | adjective | JJ adjective (English), other noun-modifier (Chinese)
• startup | NOUN | noun | NN noun, singular or mass
• for | ADP | adposition | IN conjunction, subordinating or preposition
• $ | SYM | symbol | $ symbol, currency
• 1 | NUM | numeral | CD cardinal number
• billion | NUM | numeral | CD cardinal number
• Defining a tag set
• We have to define an inventory of labels for the word classes (i.e. the tag set)
-Most taggers rely on models that have to be trained on annotated (tagged) corpora. Evaluation
also requires annotated corpora.
-Since human annotation is expensive/time-consuming, the tag sets used in a few existing
labeled corpora become the de facto standard.
-Tag sets need to capture semantically or syntactically important distinctions that can easily be
made by trained human annotators.
Word classes
Open classes:
Nouns, Verbs, Adjectives, Adverbs
Closed classes:
Auxiliaries and modal verbs, Prepositions, Conjunctions Pronouns, Determiners,Particles,
Numerals
• Defining a tag set
• Tag sets have different granularities: Brown corpus (Francis and Kucera 1982): 87 tags Penn
• Treebank (Marcus et al. 1993): 45 tags Simplified version of Brown tag set (de facto standard
for English now)
NN: common noun (singular or mass): water, book
NNS: common noun (plural): books Prague
• Dependency Treebank (Czech): 4452 tags
Complete morphological analysis: AAFP3----3N----: nejnezajímavějším
Adjective Regular Feminine Plural Dative….Superlative
• How much ambiguity is there?
Most word types are unambiguous:
Number of tags per word type:
• NB: These numbers are based on word/tag combinations in the corpus. Many combinations
that don’t occur in the corpus are equally correct.
• But a large fraction of word tokens are ambiguous Original Brown corpus: 40% of tokens
are ambiguous
• Qualitative evaluation
• Generate a confusion matrix (for development data): How often was a word with tag i
mistagged as tag j:
• See what errors are causing problems: -Noun (NN) vs ProperNoun (NNP) vs Adj (JJ) -
Preterite (VBD) vs Participle (VBN) vs Adjective (JJ)

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy