0% found this document useful (0 votes)
11 views31 pages

224s 22 Lec1

The document outlines the course CS 224S / LINGUIST 285 on Spoken Language Processing at Stanford University, covering topics such as dialog systems, speech recognition, and synthesis. It emphasizes the importance of ethics in speech technology and provides details on course logistics, including project requirements and grading. The course aims to equip students with practical skills in building spoken language applications using advanced machine learning techniques.

Uploaded by

whythisonemore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views31 pages

224s 22 Lec1

The document outlines the course CS 224S / LINGUIST 285 on Spoken Language Processing at Stanford University, covering topics such as dialog systems, speech recognition, and synthesis. It emphasizes the importance of ethics in speech technology and provides details on course logistics, including project requirements and grading. The course aims to equip students with practical skills in building spoken language applications using advanced machine learning techniques.

Uploaded by

whythisonemore
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

CS 224S / LINGUIST 285

Spoken Language Processing

Andrew Maas
Stanford University
Spring 2022

Lecture 1: Course Introduction


Original slides by Dan Jurafsky
Week 1
— Course introduction
— Course Logistics
— Course topics overview
— Dialog / conversational agents
— Speech recognition (Speech to text)
— Speech synthesis (Text to speech)
— Applications
— Brief history
— Articulatory Phonetics
— ARPAbet transcription
Exciting recent developments have
disrupted this field

Amazon Alexa + Neural TTS voice cloning


Alexa Prize 2017 End-to-end neural becomes SOTA
2014 2015 - present

Apple Google Microsoft


Realtime speech-speech translation
Siri Assistant Cortana
2020
2011 2016 2014
Entering a new era of spoken
language applications and impact

EuroNews Article
🇺🇦

EuroNews Article
Youtube video
Some basic ethics when working
on speech technologies
— Don’t record someone without their consent
— In California, all parties to any confidential conversation must give
their consent to be recorded. For calls occurring over cellular or
cordless phones, all parties must consent before a person can
record, regardless of confidentiality.

— Don’t create a speech synthesizer / voice clone of


someone without their consent
— It might be fun but it’s a little creepy. People get upset
— Okay to use existing speech datasets (we’ll provide some)

— Consider subgroup and language bias when building


real applications
— Poor performance on subgroups e.g. non-native speakers
— Many languages are under-served relative to English/Mandarin
Course Logistics
— Course goal: Build something you are proud of
— Course project: Research paper? Compelling demo/story for job
interviews? Applied system you can use at home/work?

— Homeworks (2 weeks each):


— Introduction to audio analysis and spoken language tools
— Building a complete dialog system using Amazon Alexa Skills Kit
— Implementing end-to-end deep neural network approaches to
speech recognition using PyTorch
— Working with advanced deep learning toolkits for speech
recognition (SpeechBrain) and voice cloning

— Homeworks use Colab and PyTorch (AWS for Alexa)


Course Logistics

— http://www.stanford.edu/class/cs224s

— Homeworks out on Tuesdays and due 11:59pm Monday

— Gradescope for homework submission

— Ed for questions. Use private post for


personal/confidential questions

— Final project poster session in person!


Admin: Requirements and
Grading
— Readings:
— Jurafsky & Martin. Speech and Language Processing.
— 3rd edition pre-prints available online
— A few conference and journal papers
— Grading
— Homework: 45%
— Course Project: 50%
— Participation: 5%
— Attend 3 guest lectures (3%)
— Ed participation (2%)
Course Projects
— Build something you are proud of

— Full systems / demos, research papers on individual


components, applying spoken language analysis to
interesting datasets, etc. are all great projects

— Combining projects with other courses is great!


— CS236G (GANs), CS224N, CS329S, CS229 all relevant
— Need instructor permission to combine

— Project handout + intro lecture / discussion soon. Ideally


groups of 2-3
Necessary Background
— Foundations of machine learning and natural language
processing
— CS 124, CS 224N, CS 229, or equivalent experience
— Mathematical foundations of neural networks
— Understand forward and back propagation in terms of
equations
— Deep learning intro lecture will adjust to class needs.
— Proficiency in Python
— Programming heavy homeworks will use Python, Colab
Notebooks, and PyTorch
Office hours and CAs

— Andrew: In person after class on Thursdays (projects + other)

— CAs: Zoom with Calendly (homework + projects)

— Meet your teaching staff!


— Gaurab Banerjee

— Shreya Gupta

— Alex Ke

— Questions on logistics?
Week 1
— Course introduction
— Course Logistics
— Course topics overview
— Dialog / conversational agents
— Speech recognition (Speech to text)
— Speech synthesis (Text to speech)
— Applications
— Brief history
— Articulatory Phonetics
— ARPAbet transcription
Dialogue (= Conversational Agents)
— Task-oriented conversations
— Personal Assistants (Alexa, Siri, etc.)
— Design considerations
— Synchronous or asynchronous tasks
— Pure speech, pure text, UI hybrids
— Functionality versus personality
Dialogue (= Conversational Agents)
Paradigms for Dialogue
— POMDP
— Partially-Observed Markov Decision Processes
— Reinforcement Learning to learn what action to take
— Asking a question or answering one are just actions
— “Speech acts”

— Simple slot filling (ML or regular expressions)


— Pre-built frames
— Calendar
— Who
— When
— Where
— Filled by hand-built rules
— (“on (Mon|Tue|Wed…)”)
Paradigms for Dialogue
— POMDP
— Active research area. Deep learning RL
— Not quite industry-strength
— Simple slot filling (ML or regex)
— State of the art used most systems
— Reusing new search engine technology
— Intent recognition / semantic parsing
— Neural network chatbots
— Replacing major pieces of dialog systems
Speech Recognition
— Large Vocabulary Continuous Speech
Recognition (LVCSR)
—~64,000 words
—Speaker independent (vs. speaker-
dependent)
—Continuous speech (vs isolated-word)
Current error rates
Why is conversational speech
harder?
— A piece of an utterance without context

— The same utterance with more context


HSR versus ASR

(Saon et al, 2017)


Why accents are hard
— A word by itself

— The word in context


So is speech recognition solved?
Why study it vs use some API?
— In the last ~10 years
— Dramatic reduction in LVCSR error rates (16% to 6%)
— Human level LVCSR performance on Switchboard
— New class of recognizers (end to end neural network)
— Understanding how ASR works enables better ASR-
enabled systems
— What types of errors are easy to correct?
— How can a downstream system make use of uncertain
outputs?
— How much would building our own improve on an API?
— Next generation of ASR challenges as systems go live
on phones and in homes
Speech Recognition Design
Intuition
— Build a statistical model of the speech-to-words
process
— Collect lots and lots of speech, and transcribe all the
words.
— Train the model on the labeled speech
— Paradigm: Supervised Machine Learning + Search
TTS (= Text-to-Speech) (= Speech
Synthesis)
— Produce speech from a text input
— Applications:
— Personal Assistants
— Apple SIRI
— Microsoft Cortana
— Google Assistant
— Games
— Announcements / voice-overs
TTS Overview
— Collect lots of speech (5-50 hours) from one
speaker, transcribe very carefully, all the
syllables and phones and whatnot
— Rapid recent progress in neural approaches
— Modern systems are DNN-based,
understandable, but not yet emotive
TTS Overview: End-to-end neural

Tacotron. (Wang et al. 2017)


Applications
— Machine learning applications
— Extract information from speech using
supervised learning
— Emotion, speaker ID, flirtation, deception,
depression, intoxication
— Dialog system / SLU applications
— Building systems to solve a problem
— Medical transcription, reservations via chat
— New area: Self-supervised foundation
models
Extraction of Social Meaning from
Speech
— Detection of student uncertainty in tutoring
— Forbes-Riley et al. (2008)
— Emotion detection (annoyance)
— Ang et al. (2002)
— Detection of deception
— Newman et al. (2003)
— Detection of charisma
— Rosenberg and Hirschberg (2005)
— Speaker stress, trauma
— Rude et al. (2004), Pennebaker and Lay (2002)
Conversational style
— Given speech and text from a conversation
— Can we tell if a speaker is
— Awkward?
— Flirtatious?
— Friendly?
— Dataset:
— 1000 4-minute “speed-dates”
— Each subject rated their partner for these styles
— The following segment has been lightly signal-processed:
Week 1
— Course introduction
— Course Logistics
— Course topics overview
— Dialog / conversational agents
— Speech recognition (Speech to text)
— Speech synthesis (Text to speech)
— Applications
— Brief history
— Articulatory Phonetics
— ARPAbet transcription

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy