


default search action
22nd Interspeech 2021: Brno, Czechia
- Hynek Hermansky, Honza Cernocký, Lukás Burget, Lori Lamel, Odette Scharenborg, Petr Motlícek:
22nd Annual Conference of the International Speech Communication Association, Interspeech 2021, Brno, Czechia, August 30 - September 3, 2021. ISCA 2021
Speech Synthesis: Other Topics
- Michael Pucher, Thomas Woltron:
Conversion of Airborne to Bone-Conducted Speech with Deep Neural Networks. 1-5 - Markéta Rezácková, Jan Svec
, Daniel Tihelka
:
T5G2P: Using Text-to-Text Transfer Transformer for Grapheme-to-Phoneme Conversion. 6-10 - Olivier Perrotin, Hussein El Amouri
, Gérard Bailly, Thomas Hueber:
Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values. 11-15 - Phat Do
, Matt Coler
, Jelske Dijkstra, Esther Klabbers:
A Systematic Review and Analysis of Multilingual Data Strategies in Text-to-Speech for Low-Resource Languages. 16-20
Disordered Speech
- Tanya Talkar, Nancy Pearl Solomon, Douglas S. Brungart, Stefanie E. Kuchinsky, Megan M. Eitel, Sara M. Lippa, Tracey A. Brickell, Louis M. French
, Rael T. Lange, Thomas F. Quatieri:
Acoustic Indicators of Speech Motor Coordination in Adults With and Without Traumatic Brain Injury. 21-25 - Juan Camilo Vásquez-Correa, Julian Fritsch, Juan Rafael Orozco-Arroyave, Elmar Nöth, Mathew Magimai-Doss:
On Modeling Glottal Source Information for Phonation Assessment in Parkinson's Disease. 26-30 - Khalid Daoudi, Biswajit Das, Solange Milhé de Saint Victor, Alexandra Foubert-Samier, Anne Pavy-Le Traon, Olivier Rascol, Wassilios G. Meissner
, Virginie Woisard:
Distortion of Voiced Obstruents for Differential Diagnosis Between Parkinson's Disease and Multiple System Atrophy. 31-35 - Pu Wang
, Bagher BabaAli, Hugo Van hamme
:
A Study into Pre-Training Strategies for Spoken Language Understanding on Dysarthric Speech. 36-40 - Rosanna Turrisi, Arianna Braccia
, Marco Emanuele, Simone Giulietti, Maura Pugliatti
, Mariachiara Sensi, Luciano Fadiga
, Leonardo Badino
:
EasyCall Corpus: A Dysarthric Speech Dataset. 41-45
Speech Signal Analysis and Representation II
- Xiaoyu Bie
, Laurent Girin, Simon Leglaive
, Thomas Hueber, Xavier Alameda-Pineda:
A Benchmark of Dynamical Variational Autoencoders Applied to Speech Spectrogram Modeling. 46-50 - Metehan Yurt, Pavan Kantharaju, Sascha Disch, Andreas Niedermeier, Alberto N. Escalante-B., Veniamin I. Morgenshtern:
Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods. 51-55 - RaviShankar Prasad, Mathew Magimai-Doss:
Identification of F1 and F2 in Speech Using Modified Zero Frequency Filtering. 56-60 - Yann Teytaut, Axel Roebel
:
Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice. 61-65
Feature, Embedding and Neural Architecture for Speaker Recognition
- Seong-Hu Kim, Yong-Hwa Park
:
Adaptive Convolutional Neural Network for Text-Independent Speaker Recognition. 66-70 - Jiajun Qi, Wu Guo, Bin Gu:
Bidirectional Multiscale Feature Aggregation for Speaker Verification. 71-75 - Yu-Jia Zhang, Yih-Wen Wang, Chia-Ping Chen, Chung-Li Lu, Bo-Cheng Chan:
Improving Time Delay Neural Network Based Speaker Recognition with Convolutional Block and Feature Aggregation Methods. 76-80 - Yanfeng Wu, Junan Zhao, Chenkai Guo, Jing Xu:
Improving Deep CNN Architectures with Variable-Length Training Samples for Text-Independent Speaker Verification. 81-85 - Tinglong Zhu, Xiaoyi Qin, Ming Li:
Binary Neural Network for Speaker Verification. 86-90 - Youzhi Tu, Man-Wai Mak:
Mutual Information Enhanced Training for Speaker Embedding. 91-95 - Ge Zhu, Fei Jiang, Zhiyao Duan:
Y-Vector: Multiscale Waveform Encoder for Speaker Embedding. 96-100 - Yan Liu, Zheng Li, Lin Li, Qingyang Hong:
Phoneme-Aware and Channel-Wise Attentive Learning for Text Dependent Speaker Verification. 101-105 - Hongning Zhu, Kong Aik Lee
, Haizhou Li:
Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding. 106-110
Speech Synthesis: Toward End-to-End Synthesis II
- Cheng Gong, Longbiao Wang, Ju Zhang, Shaotong Guo, Yuguang Wang, Jianwu Dang:
TacoLPCNet: Fast and Stable TTS by Conditioning LPCNet on Mel Spectrogram Predictions. 111-115 - Taejun Bak, Jae-Sung Bae, Hanbin Bae, Young-Ik Kim, Hoon-Young Cho:
FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis. 116-120 - Taiki Nakamura, Tomoki Koriyama
, Hiroshi Saruwatari:
Sequence-to-Sequence Learning for Deep Gaussian Process Based Speech Synthesis Using Self-Attention GP Layer. 121-125 - Naoto Kakegawa, Sunao Hara, Masanobu Abe, Yusuke Ijima:
Phonetic and Prosodic Information Estimation from Texts for Genuine Japanese End-to-End Text-to-Speech. 126-130 - Xudong Dai, Cheng Gong, Longbiao Wang, Kaili Zhang:
Information Sieve: Content Leakage Reduction in End-to-End Prosody Transfer for Expressive Speech Synthesis. 131-135 - Qingyun Dou, Xixin Wu, Moquan Wan, Yiting Lu, Mark J. F. Gales:
Deliberation-Based Multi-Pass Speech Synthesis. 136-140 - Isaac Elias, Heiga Zen
, Jonathan Shen, Yu Zhang, Ye Jia, R. J. Skerry-Ryan, Yonghui Wu:
Parallel Tacotron 2: A Non-Autoregressive Neural TTS Model with Differentiable Duration Modeling. 141-145 - Chunyang Wu, Zhiping Xiu, Yangyang Shi, Ozlem Kalinli, Christian Fuegen, Thilo Köhler, Qing He:
Transformer-Based Acoustic Modeling for Streaming Speech Synthesis. 146-150 - Ye Jia, Heiga Zen
, Jonathan Shen, Yu Zhang, Yonghui Wu:
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS. 151-155 - Zhenhao Ge, Lakshmish Kaushik, Masanori Omote, Saket Kumar:
Speed up Training with Variable Length Inputs by Efficient Batching Strategies. 156-160
Speech Enhancement and Intelligibility
- Yuhang Sun, Linju Yang, Huifeng Zhu, Jie Hao:
Funnel Deep Complex U-Net for Phase-Aware Speech Enhancement. 161-165 - Qiquan Zhang, Qi Song, Aaron Nicolson, Tian Lan, Haizhou Li:
Temporal Convolutional Network with Frequency Dimension Adaptive Attention for Speech Enhancement. 166-170 - Changjie Pan, Feng Yang, Fei Chen:
Perceptual Contributions of Vowels and Consonant-Vowel Transitions in Understanding Time-Compressed Mandarin Sentences. 171-175 - Ritujoy Biswas
, Karan Nathwani, Vinayak Abrol:
Transfer Learning for Speech Intelligibility Improvement in Noisy Environments. 176-180 - Ayako Yamamoto, Toshio Irino, Kenichi Arai
, Shoko Araki, Atsunori Ogawa, Keisuke Kinoshita
, Tomohiro Nakatani:
Comparison of Remote Experiments Using Crowdsourcing and Laboratory Experiments on Speech Intelligibility. 181-185 - Wenzhe Liu
, Andong Li, Yuxuan Ke, Chengshi Zheng, Xiaodong Li:
Know Your Enemy, Know Yourself: A Unified Two-Stage Framework for Speech Enhancement. 186-190 - Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang:
Speech Enhancement with Weakly Labelled Data from AudioSet. 191-195 - Tsun-An Hsieh, Cheng Yu, Szu-Wei Fu, Xugang Lu, Yu Tsao:
Improving Perceptual Quality by Phone-Fortified Perceptual Loss Using Wasserstein Distance for Speech Enhancement. 196-200 - Szu-Wei Fu, Cheng Yu, Tsun-An Hsieh, Peter Plantinga, Mirco Ravanelli, Xugang Lu, Yu Tsao:
MetricGAN+: An Improved Version of MetricGAN for Speech Enhancement. 201-205 - Amin Edraki, Wai-Yip Chan, Jesper Jensen, Daniel Fogerty:
A Spectro-Temporal Glimpsing Index (STGI) for Speech Intelligibility Prediction. 206-210 - Yuanhang Qiu, Ruili Wang, Satwinder Singh, Zhizhong Ma
, Feng Hou:
Self-Supervised Learning Based Phone-Fortified Speech Enhancement. 211-215 - Khandokar Md. Nayem
, Donald S. Williamson
:
Incorporating Embedding Vectors from a Human Mean-Opinion Score Prediction Model for Monaural Speech Enhancement. 216-220 - Jianwei Zhang, Suren Jayasuriya, Visar Berisha:
Restoring Degraded Speech via a Modified Diffusion Model. 221-225
Spoken Dialogue Systems I
- Hoang Long Nguyen, Vincent Renkens, Joris Pelemans, Srividya Pranavi Potharaju, Anil Kumar Nalamalapu, Murat Akbacak:
User-Initiated Repetition-Based Recovery in Multi-Utterance Dialogue Systems. 226-230 - Nuo Chen, Chenyu You, Yuexian Zou:
Self-Supervised Dialogue Learning for Spoken Conversational Question Answering. 231-235 - Ruolin Su, Ting-Wei Wu, Biing-Hwang Juang:
Act-Aware Slot-Value Predicting in Multi-Domain Dialogue State Tracking. 236-240 - Yuya Chiba, Ryuichiro Higashinaka:
Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information. 241-245 - Yoshihiro Yamazaki, Yuya Chiba, Takashi Nose, Akinori Ito:
Neural Spoken-Response Generation Using Prosodic and Linguistic Context for Conversational Systems. 246-250 - Weiyuan Xu, Peilin Zhou, Chenyu You, Yuexian Zou:
Semantic Transportation Prototypical Network for Few-Shot Intent Detection. 251-255 - Li Tang, Yuke Si, Longbiao Wang, Jianwu Dang:
Domain-Specific Multi-Agent Dialog Policy Learning in Multi-Domain Task-Oriented Scenarios. 256-260 - Haoyu Wang, John Chen, Majid Laali, Kevin Durda, Jeff King, William Campbell, Yang Liu:
Leveraging ASR N-Best in Deep Entity Retrieval. 261-265
Topics in ASR: Robustness, Feature Extraction, and Far-Field ASR
- Shuai Zhang, Jiangyan Yi, Zhengkun Tian, Ye Bai, Jianhua Tao, Xuefei Liu, Zhengqi Wen:
End-to-End Spelling Correction Conditioned on Acoustic Feature for Code-Switching Speech Recognition. 266-270 - Kathleen Siminyu, Xinjian Li, Antonios Anastasopoulos, David R. Mortensen
, Michael R. Marlo, Graham Neubig:
Phoneme Recognition Through Fine Tuning of Phonetic Representations: A Case Study on Luhya Language Varieties. 271-275 - Erfan Loweimi, Zoran Cvetkovic, Peter Bell, Steve Renals:
Speech Acoustic Modelling Using Raw Source and Filter Components. 276-280 - Masakiyo Fujimoto, Hisashi Kawai:
Noise Robust Acoustic Modeling for Single-Channel Speech Recognition Based on a Stream-Wise Transformer Architecture. 281-285 - Anton Ratnarajah, Zhenyu Tang, Dinesh Manocha:
IR-GAN: Room Impulse Response Generator for Far-Field Speech Recognition. 286-290 - Junqi Chen, Xiao-Lei Zhang:
Scaling Sparsemax Based Channel Selection for Speech Recognition with ad-hoc Microphone Arrays. 291-295 - Feng-Ju Chang, Martin Radfar, Athanasios Mouchtaris, Maurizio Omologo:
Multi-Channel Transformer Transducer for Speech Recognition. 296-300 - Emiru Tsunoo, Kentaro Shibata, Chaitanya Narisetty, Yosuke Kashiwagi, Shinji Watanabe:
Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios. 301-305 - Guodong Ma, Pengfei Hu, Jian Kang, Shen Huang, Hao Huang:
Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition. 306-310 - Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Paden Tomasello, Jacob Kahn, Gilad Avidov, Ronan Collobert, Gabriel Synnaeve:
Rethinking Evaluation in ASR: Are Our Models Robust Enough? 311-315 - Max W. Y. Lam, Jun Wang, Chao Weng, Dan Su, Dong Yu:
Raw Waveform Encoder with Multi-Scale Globally Attentive Locally Recurrent Networks for End-to-End Speech Recognition. 316-320
Voice Activity Detection and Keyword Spotting
- Yuanbo Hou, Zhesong Yu, Xia Liang, Xingjian Du, Bilei Zhu, Zejun Ma, Dick Botteldooren:
Attention-Based Cross-Modal Fusion for Audio-Visual Voice Activity Detection in Musical Video Streams. 321-325 - Ui-Hyun Kim:
Noise-Tolerant Self-Supervised Learning for Audio-Visual Voice Activity Detection. 326-330 - Hyun-Jin Park, Pai Zhu, Ignacio López-Moreno, Niranjan Subrahmanya:
Noisy Student-Teacher Training for Robust Keyword Spotting. 331-335 - Osamu Ichikawa, Kaito Nakano, Takahiro Nakayama, Hajime Shirouzu:
Multi-Channel VAD for Transcription of Group Discussion. 336-340 - Hengshun Zhou, Jun Du, Hang Chen, Zijun Jing, Shifu Xiong, Chin-Hui Lee:
Audio-Visual Information Fusion Using Cross-Modal Teacher-Student Learning for Voice Activity Detection in Realistic Environments. 341-345 - Naoki Makishima, Mana Ihori, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura:
Enrollment-Less Training for Personalized Voice Activity Detection. 346-350 - Yuto Nonaka, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki:
Voice Activity Detection for Live Speech of Baseball Game Based on Tandem Connection with Speech/Noise Separation Model. 351-355 - Young D. Kwon
, Jagmohan Chauhan, Cecilia Mascolo:
FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications. 356-360 - Bo Wei, Meirong Yang, Tao Zhang, Xiao Tang, Xing Huang, Kyuhong Kim, Jaeyun Lee
, Kiho Cho, Sung-Un Park:
End-to-End Transformer-Based Open-Vocabulary Keyword Spotting with Location-Guided Local Attention. 361-365 - Saurabhchand Bhati, Jesús Villalba
, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak:
Segmental Contrastive Predictive Coding for Unsupervised Word Segmentation. 366-370 - Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu:
A Lightweight Framework for Online Voice Activity Detection in the Wild. 371-375
Voice and Voicing
- Aurélie Chlébowski, Nicolas Ballier:
"See what I mean, huh?" Evaluating Visual Inspection of F0 Tracking in Nasal Grunts. 376-380 - Bruce Xiao Wang
, Vincent Hughes:
System Performance as a Function of Calibration Methods, Sample Size and Sampling Variability in Likelihood Ratio-Based Forensic Voice Comparison. 381-385 - Anne Bonneau:
Voicing Assimilations by French Speakers of German in Stop-Fricative Sequences. 386-390 - Titas Chakraborty, Vaishali Patil, Preeti Rao:
The Four-Way Classification of Stops with Voicing and Aspiration for Non-Native Speech Evaluation. 391-395 - Saba Urooj, Benazir Mumtaz, Sarmad Hussain, Ehsan ul Haq:
Acoustic and Prosodic Correlates of Emotions in Urdu Speech. 396-400 - Nour Tamim, Silke Hamann:
Voicing Contrasts in the Singleton Stops of Palestinian Arabic: Production and Perception. 401-405 - Thomas Coy, Vincent Hughes, Philip Harrison, Amelia Jane Gully:
A Comparison of the Accuracy of Dissen and Keshet's (2016) DeepFormants and Traditional LPC Methods for Semi-Automatic Speaker Recognition. 406-410 - Michael Jessen:
MAP Adaptation Characteristics in Forensic Long-Term Formant Analysis. 411-415 - Justin J. H. Lo:
Cross-Linguistic Speaker Individuality of Long-Term Formant Distributions: Phonetic and Forensic Perspectives. 416-420 - Rachel Soo, Khia A. Johnson, Molly Babel:
Sound Change in Spontaneous Bilingual Speech: A Corpus Study on the Cantonese n-l Merger in Cantonese-English Bilinguals. 421-425 - Wendy Lalhminghlui
, Priyankoo Sarmah:
Characterizing Voiced and Voiceless Nasals in Mizo. 426-430
The INTERSPEECH 2021 Computational Paralinguistics Challenge (ComParE) - COVID-19 Cough, COVID-19 Speech, Escalation & Primates
- Björn W. Schuller, Anton Batliner, Christian Bergler, Cecilia Mascolo, Jing Han, Iulia Lefter, Heysem Kaya
, Shahin Amiriparian
, Alice Baird, Lukas Stappen, Sandra Ottl, Maurice Gerczuk, Panagiotis Tzirakis, Chloë Brown
, Jagmohan Chauhan, Andreas Grammenos, Apinan Hasthanasombat, Dimitris Spathis, Tong Xia
, Pietro Cicuta, Léon J. M. Rothkrantz, Joeri A. Zwerts
, Jelle Treep
, Casper S. Kaandorp:
The INTERSPEECH 2021 Computational Paralinguistics Challenge: COVID-19 Cough, COVID-19 Speech, Escalation & Primates. 431-435 - Rubén Solera-Ureña
, Catarina Botelho, Francisco Teixeira
, Thomas Rolland, Alberto Abad
, Isabel Trancoso
:
Transfer Learning-Based Cough Representations for Automatic Detection of COVID-19. 436-440 - Philipp Klumpp, Tobias Bocklet
, Tomás Arias-Vergara
, Juan Camilo Vásquez-Correa, Paula Andrea Pérez-Toro
, Sebastian P. Bayerl, Juan Rafael Orozco-Arroyave, Elmar Nöth:
The Phonetic Footprint of Covid-19? 441-445 - Edresson Casanova, Arnaldo Candido Jr., Ricardo Corso Fernandes Junior, Marcelo Finger, Lucas Rafael Stefanel Gris, Moacir Antonelli Ponti, Daniel Peixoto Pinto da Silva:
Transfer Learning and Data Augmentation Techniques to the COVID-19 Identification Tasks in ComParE 2021. 446-450 - Steffen Illium
, Robert Müller, Andreas Sedlmeier, Claudia Linnhoff-Popien:
Visual Transformers for Primates Classification and Covid Detection. 451-455 - Thomas Pellegrini:
Deep-Learning-Based Central African Primate Species Classification with MixUp and SpecAugment. 456-460 - Robert Müller, Steffen Illium
, Claudia Linnhoff-Popien:
A Deep and Recurrent Architecture for Primate Vocalization Classification. 461-465 - Joeri A. Zwerts
, Jelle Treep
, Casper S. Kaandorp, Floor Meewis, Amparo C. Koot, Heysem Kaya
:
Introducing a Central African Primate Vocalisation Dataset for Automated Species Classification. 466-470 - Georgios Rizos, Jenna Lawson, Zhuoda Han, Duncan Butler, James Rosindell
, Krystian Mikolajczyk, Cristina Banks-Leite, Björn W. Schuller:
Multi-Attentive Detection of the Spider Monkey Whinny in the (Actual) Wild. 471-475 - José Vicente Egas López, Mercedes Vetráb
, László Tóth, Gábor Gosztolya:
Identifying Conflict Escalation and Primates by Using Ensemble X-Vectors and Fisher Vector Features. 476-480 - Oxana Verkholyak, Denis Dresvyanskiy, Anastasia Dvoynikova, Denis Kotov, Elena Ryumina
, Alena Velichko, Danila Mamontov, Wolfgang Minker, Alexey Karpov:
Ensemble-Within-Ensemble Classification for Escalation Prediction from Speech. 481-485 - Dominik Schiller, Silvan Mertes, Pol van Rijn, Elisabeth André
:
Analysis by Synthesis: Using an Expressive TTS Model as Feature Extractor for Paralinguistic Speech Classification. 486-490
Survey Talk 1: Heidi Christensen
- Heidi Christensen:
Towards Automatic Speech Recognition for People with Atypical Speech.
Embedding and Network Architecture for Speaker Recognition
- Chau Luu, Peter Bell, Steve Renals:
Leveraging Speaker Attribute Information Using Multi Task Learning for Speaker Verification and Diarization. 491-495 - Magdalena Rybicka
, Jesús Villalba
, Piotr Zelasko, Najim Dehak
, Konrad Kowalczyk
:
Spine2Net: SpineNet with Res2Net and Time-Squeeze-and-Excitation Blocks for Speaker Recognition. 496-500 - Themos Stafylakis, Johan Rohdin, Lukás Burget:
Speaker Embeddings by Modeling Channel-Wise Correlations. 501-505 - Weipeng He, Petr Motlícek, Jean-Marc Odobez
:
Multi-Task Neural Network for Robust Multiple Speaker Embedding Extraction. 506-510 - Junyi Peng, Xiaoyang Qu, Jianzong Wang
, Rongzhi Gu, Jing Xiao, Lukás Burget, Jan Cernocký:
ICSpk: Interpretable Complex Speaker Embedding Extractor from Raw Waveform. 511-515
Speech Perception I
- Xiao Xiao, Nicolas Audibert, Grégoire Locqueville, Christophe d'Alessandro, Barbara Kühnert, Claire Pillot-Loiseau:
Prosodic Disambiguation Using Chironomic Stylization of Intonation with Native and Non-Native Speakers. 516-520 - Aleese Block
, Michelle Cohn
, Georgia Zellou:
Variation in Perceptual Sensitivity and Compensation for Coarticulation Across Adult and Child Naturally-Produced and TTS Voices. 521-525 - Mohammad Jalilpour-Monesi
, Bernd Accou
, Tom Francart, Hugo Van hamme:
Extracting Different Levels of Speech Information from EEG Using an LSTM-Based Model. 526-530 - Louis ten Bosch, Lou Boves:
Word Competition: An Entropy-Based Approach in the DIANA Model of Human Word Comprehension. 531-535 - Louis ten Bosch, Lou Boves:
Time-to-Event Models for Analyzing Reaction Time Sequences. 536-540 - Sophie Brand, Kimberley Mulder, Louis ten Bosch, Lou Boves:
Models of Reaction Times in Auditory Lexical Decision: RTonset versus RToffset. 541-545
Acoustic Event Detection and Acoustic Scene Classification
- Gwantae Kim, David K. Han, Hanseok Ko
:
SpecMix : A Mixed Sample Data Augmentation Method for Training with Time-Frequency Domain Features. 546-550 - Helin Wang, Yuexian Zou, Wenwu Wang:
SpecAugment++: A Hidden Space Data Augmentation Method for Acoustic Scene Classification. 551-555 - Xu Zheng, Yan Song, Li-Rong Dai, Ian McLoughlin
, Lin Liu:
An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection. 556-560 - Ritika Nandi, Shashank Shekhar, Manjunath Mulimani:
Acoustic Scene Classification Using Kervolution-Based SubSpectralNet. 561-565 - Harshavardhan Sundar, Ming Sun, Chao Wang:
Event Specific Attention for Polyphonic Sound Event Detection. 566-570 - Yuan Gong
, Yu-An Chung, James R. Glass:
AST: Audio Spectrogram Transformer. 571-575 - Soonshin Seo, Donghyun Lee, Ji-Hwan Kim:
Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene. 576-580 - Helen L. Bear, Veronica Morfi, Emmanouil Benetos
:
An Evaluation of Data Augmentation Methods for Sound Scene Geotagging. 581-585 - Chiori Hori, Takaaki Hori, Jonathan Le Roux:
Optimizing Latency for Online Video Captioning Using Audio-Visual Transformers. 586-590 - Shijing Si, Jianzong Wang
, Huiming Sun, Jianhan Wu, Chuanyao Zhang, Xiaoyang Qu, Ning Cheng, Lei Chen, Jing Xiao:
Variational Information Bottleneck for Effective Low-Resource Audio Classification. 591-595 - Soham Deshmukh, Bhiksha Raj, Rita Singh:
Improving Weakly Supervised Sound Event Detection with Self-Supervised Auxiliary Tasks. 596-600 - Tatsuya Komatsu, Shinji Watanabe
, Koichi Miyazaki, Tomoki Hayashi:
Acoustic Event Detection with Classifier Chains. 601-605
Diverse Modes of Speech Acquisition and Processing
- Shu-Chuan Tseng
, Yi-Fen Liu:
Segment and Tone Production in Continuous Speech of Hearing and Hearing-Impaired Children. 606-610 - Feng Wang, Jing Chen, Fei Chen:
Effect of Carrier Bandwidth on Understanding Mandarin Sentences in Simulated Electric-Acoustic Hearing. 611-615 - Manthan Sharma, Navaneetha Gaddam, Tejas Umesh, Aditya Murthy, Prasanta Kumar Ghosh:
A Comparative Study of Different EMG Features for Acoustics-to-EMG Mapping. 616-620 - Ajish K. Abraham, V. Sivaramakrishnan, N. Swapna, N. Manohar:
Image-Based Assessment of Jaw Parameters and Jaw Kinematics for Articulatory Simulation: Preliminary Results. 621-625 - Jianrong Wang, Nan Gu, Mei Yu, Xuewei Li, Qiang Fang, Li Liu:
An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech. 626-630 - Judith Dineley
, Grace Lavelle, Daniel Leightley
, Faith Matcham
, Sara Siddi
, Maria Teresa Peñarrubia-María, Katie M. White
, Alina Ivan, Carolin Oetzmann
, Sara Simblett
, Erin Dawe-Lane, Stuart Bruce, Daniel Stahl
, Yatharth Ranjan, Zulqarnain Rashid
, Pauline Conde, Amos A. Folarin, Josep Maria Haro, Til Wykes, Richard J. B. Dobson
, Vaibhav A. Narayan, Matthew Hotopf, Björn W. Schuller, Nicholas Cummins
, RADAR-CNS Consortium:
Remote Smartphone-Based Speech Collection: Acceptance and Barriers in Individuals with Major Depressive Disorder. 631-635 - Sarah R. Li, Colin T. Annand, Sarah Dugan, Sarah M. Schwab, Kathryn J. Eary, Michael Swearengen, Sarah Stack, Suzanne Boyce, Michael A. Riley, T. Douglas Mast:
An Automatic, Simple Ultrasound Biofeedback Parameter for Distinguishing Accurate and Misarticulated Rhotic Syllables. 636-640 - Manuel Sam Ribeiro, Aciel Eshky, Korin Richmond
, Steve Renals:
Silent versus Modal Multi-Speaker Speech Recognition from Ultrasound and Video. 641-645 - David Ferreira
, Samuel S. Silva, Francisco Curado, António J. S. Teixeira
:
RaSSpeR: Radar-Based Silent Speech Recognition. 646-650 - Beiming Cao, Nordine Sebkhi
, Arpan Bhavsar, Omer T. Inan, Robin Samlan, Ted Mau, Jun Wang:
Investigating Speech Reconstruction for Laryngectomees for Silent Speech Interfaces. 651-655
Multi-Channel Speech Enhancement and Hearing Aids
- Hendrik Schröter, Tobias Rosenkranz, Alberto N. Escalante-B., Andreas K. Maier:
LACOPE: Latency-Constrained Pitch Estimation for Speech Enhancement. 656-660 - Mathieu Fontaine, Kouhei Sekiguchi, Aditya Arie Nugraha, Yoshiaki Bando, Kazuyoshi Yoshii
:
Alpha-Stable Autoregressive Fast Multichannel Nonnegative Matrix Factorization for Joint Speech Enhancement and Dereverberation. 661-665 - Siyuan Zhang, Xiaofei Li:
Microphone Array Generalization for Multichannel Narrowband Deep Speech Enhancement. 666-670 - Hyungchan Song, Jong Won Shin:
Multiple Sound Source Localization Based on Interchannel Phase Differences in All Frequencies with Spectral Masks. 671-675 - Pablo Pérez Zarazaga
, Mariem Bouafif Mansali
, Tom Bäckström
, Zied Lachiri:
Cancellation of Local Competing Speaker with Near-Field Localization for Distributed ad-hoc Sensor Network. 676-680 - Hao Zhang, DeLiang Wang:
A Deep Learning Method to Multi-Channel Active Noise Control. 681-685 - Simone Graetzer, Jon Barker, Trevor J. Cox, Michael Akeroyd, John F. Culling, Graham Naylor, Eszter Porter, Rhoddy Viveros Muñoz
:
Clarity-2021 Challenges: Machine Learning Challenges for Advancing Hearing Aid Processing. 686-690 - Zehai Tu, Ning Ma
, Jon Barker:
Optimising Hearing Aid Fittings for Speech in Noise with a Differentiable Hearing Loss Model. 691-695 - Sunit Sivasankaran, Emmanuel Vincent, Dominique Fohr:
Explaining Deep Learning Models for Speech Enhancement. 696-700 - Weilong Huang, Jinwei Feng:
Minimum-Norm Differential Beamforming for Linear Array with Directional Microphones. 701-705
Self-Supervision and Semi-Supervision for Neural ASR Training
- Songjun Cao, Yueteng Kang, Yanzhe Fu, Xiaoshuo Xu, Sining Sun, Yike Zhang, Long Ma:
Improving Streaming Transformer Based ASR Under a Framework of Self-Supervised Learning. 706-710 - Samik Sadhu, Di He, Che-Wei Huang, Sri Harish Mallidi, Minhua Wu, Ariya Rastrow, Andreas Stolcke, Jasha Droppo, Roland Maas:
wav2vec-C: A Self-Supervised Model for Speech Representation Learning. 711-715 - Electra Wallington, Benji Kershenbaum, Ondrej Klejch, Peter Bell:
On the Learning Dynamics of Semi-Supervised Training for ASR. 716-720 - Wei-Ning Hsu, Anuroop Sriram
, Alexei Baevski, Tatiana Likhomanenko, Qiantong Xu, Vineel Pratap, Jacob Kahn, Ann Lee, Ronan Collobert, Gabriel Synnaeve, Michael Auli:
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training. 721-725 - Yosuke Higuchi, Niko Moritz, Jonathan Le Roux, Takaaki Hori:
Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition. 726-730 - Ananya Misra, Dongseong Hwang, Zhouyuan Huo, Shefali Garg, Nikhil Siddhartha, Arun Narayanan, Khe Chai Sim:
A Comparison of Supervised and Unsupervised Pre-Training of End-to-End Models. 731-735 - Zhehuai Chen, Andrew Rosenberg, Yu Zhang, Heiga Zen
, Mohammadreza Ghodsi, Yinghui Huang, Jesse Emond, Gary Wang, Bhuvana Ramabhadran, Pedro J. Moreno:
Semi-Supervision in ASR: Sequential MixMatch and Factorized TTS-Based Augmentation. 736-740 - Tatiana Likhomanenko, Qiantong Xu, Jacob Kahn, Gabriel Synnaeve, Ronan Collobert:
slimIPL: Language-Model-Free Iterative Pseudo-Labeling. 741-745 - Xianghu Yue, Haizhou Li:
Phonetically Motivated Self-Supervised Speech Representation Learning. 746-750 - Yan Deng, Rui Zhao, Zhong Meng, Xie Chen, Bing Liu, Jinyu Li
, Yifan Gong, Lei He:
Improving RNN-T for Domain Scaling Using Semi-Supervised Training with Neural TTS. 751-755
Spoken Language Processing I
- Scott Seyfarth, Sundararajan Srinivasan, Katrin Kirchhoff:
Speaker-Conversation Factorial Designs for Diarization Error Analysis. 756-760 - Ross McGowan, Jinru Su, Vince DiCocco, Thejaswi Muniyappa, Grant P. Strimel:
SmallER: Scaling Neural Entity Resolution for Edge Devices. 761-765 - Johann C. Rocholl, Vicky Zayats, Daniel D. Walker, Noah B. Murad, Aaron Schneider, Daniel J. Liebling:
Disfluency Detection with Unlabeled Data and Small BERT Models. 766-770 - Qian Chen, Wen Wang, Mengzhe Chen, Qinglin Zhang:
Discriminative Self-Training for Punctuation Prediction. 771-775 - Mana Ihori, Naoki Makishima, Tomohiro Tanaka, Akihiko Takashima, Shota Orihashi, Ryo Masumura:
Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks Using Switching Tokens. 776-780 - Binghuai Lin, Liyuan Wang:
A Noise Robust Method for Word-Level Pronunciation Assessment. 781-785 - Jonathan Wintrode:
Targeted Keyword Filtering for Accelerated Spoken Topic Identification. 786-790 - Shruti Palaskar, Ruslan Salakhutdinov, Alan W. Black, Florian Metze:
Multimodal Speech Summarization Through Semantic Concept Learning. 791-795 - Hyunjae Lee, Jaewoong Yun, Hyunjin Choi, Seongho Joe, Youngjune L. Gwon:
Enhancing Semantic Understanding with Self-Supervised Methods for Abstractive Dialogue Summarization. 796-800 - Marcin Wlodarczak, Emer Gilmartin:
Speaker Transition Patterns in Three-Party Conversation: Evidence from English, Estonian and Swedish. 801-805
Voice Conversion and Adaptation II
- Samuel J. Broughton, Md. Asif Jalal, Roger K. Moore
:
Investigating Deep Neural Structures and their Interpretability in the Domain of Voice Conversion. 806-810 - Kun Zhou, Berrak Sisman, Haizhou Li:
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-Stage Sequence-to-Sequence Training. 811-815 - Yi-Yang Ding, Li-Juan Liu, Yu Hu, Zhen-Hua Ling:
Adversarial Voice Conversion Against Neural Spoofing Detectors. 816-820 - Xiangheng He, Junjie Chen, Georgios Rizos, Björn W. Schuller:
An Improved StarGAN for Emotional Voice Conversion: Enhancing Voice Quality and Data Augmentation. 821-825 - Ziyi Chen, Pengyuan Zhang:
TVQVC: Transformer Based Vector Quantized Variational Autoencoder with CTC Loss for Voice Conversion. 826-830 - Zhichao Wang, Xinyong Zhou, Fengyu Yang, Tao Li, Hongqiang Du, Lei Xie, Wendong Gan, Haitao Chen, Hai Li:
Enriching Source Style Transfer in Recognition-Synthesis Based Non-Parallel Voice Conversion. 831-835 - Jheng-Hao Lin, Yist Y. Lin, Chung-Ming Chien, Hung-yi Lee:
S2VC: A Framework for Any-to-Any Voice Conversion with Self-Supervised Pretrained Representations. 836-840 - Christopher Liberatore, Ricardo Gutierrez-Osuna:
An Exemplar Selection Algorithm for Native-Nonnative Voice Conversion. 841-845 - Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng:
Adversarially Learning Disentangled Speech Representations for Robust Multi-Factor Voice Conversion. 846-850 - Manh Luong
, Viet-Anh Tran:
Many-to-Many Voice Conversion Based Feature Disentanglement Using Variational Autoencoder. 851-855
Privacy-Preserving Machine Learning for Audio & Speech Processing
- Oubaïda Chouchane
, Baptiste Brossier, Jorge Esteban Gamboa Gamboa, Thomas Lardy, Hemlata Tak, Orhan Ermis, Madhu R. Kamble, Jose Patino, Nicholas W. D. Evans, Melek Önen, Massimiliano Todisco:
Privacy-Preserving Voice Anti-Spoofing Using Secure Multi-Party Computation. 856-860 - Ranya Aloufi, Hamed Haddadi, David Boyle:
Configurable Privacy-Preserving Automatic Speech Recognition. 861-865 - Scott Novotney, Yile Gu, Ivan Bulyko:
Adjunct-Emeritus Distillation for Semi-Supervised Language Model Adaptation. 866-870 - Jae Ro, Mingqing Chen, Rajiv Mathews, Mehryar Mohri, Ananda Theertha Suresh:
Communication-Efficient Agnostic Federated Averaging. 871-875 - Timm Koppelmann, Alexandru Nelus, Lea Schönherr, Dorothea Kolossa, Rainer Martin:
Privacy-Preserving Feature Extraction for Cloud-Based Wake Word Verification. 876-880 - Chao-Han Huck Yang, Sabato Marco Siniscalchi, Chin-Hui Lee:
PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification. 881-885 - Haoxin Ma, Jiangyan Yi, Jianhua Tao, Ye Bai, Zhengkun Tian, Chenglong Wang:
Continual Learning for Fake Audio Detection. 886-890 - Muhammad A. Shah, Joseph Szurley, Markus Müller, Athanasios Mouchtaris, Jasha Droppo:
Evaluating the Vulnerability of End-to-End Automatic Speech Recognition Models to Membership Inference Attacks. 891-895 - Amin Fazel, Wei Yang, Yulan Liu, Roberto Barra-Chicote, Yixiong Meng, Roland Maas, Jasha Droppo:
SynthASR: Unlocking Synthetic Data for Speech Recognition. 896-900
The First DiCOVA Challenge: Diagnosis of COVID-19 Using Acoustics
- Ananya Muguli
, Lancelot Pinto, Nirmala R., Neeraj Kumar Sharma
, Prashant Krishnan V, Prasanta Kumar Ghosh, Rohit Kumar, Shrirama Bhat, Srikanth Raj Chetupalli, Sriram Ganapathy, Shreyas Ramoji, Viral Nanda:
DiCOVA Challenge: Dataset, Task, and Baseline System for COVID-19 Diagnosis Using Acoustics. 901-905 - Madhu R. Kamble, José Andrés González López, Teresa Grau, Juan M. Espín, Lorenzo Cascioli
, Yiqing Huang, Alejandro Gómez Alanís, Jose Patino, Roberto Font, Antonio M. Peinado, Angel M. Gomez, Nicholas W. D. Evans, Maria A. Zuluaga, Massimiliano Todisco:
PANACEA Cough Sound-Based Diagnosis of COVID-19 for the DiCOVA 2021 Challenge. 906-910 - Vincent Karas, Björn W. Schuller:
Recognising Covid-19 from Coughing Using Ensembles of SVMs and LSTMs with Handcrafted and Deep Audio Features. 911-915 - Isabella Södergren, Maryam Pahlavan Nodeh, Prakash Chandra Chhipa, Konstantina Nikolaidou, György Kovács:
Detecting COVID-19 from Audio Recording of Coughs Using Random Forests and Support Vector Machines. 916-920 - Rohan Kumar Das, Maulik C. Madhavi, Haizhou Li:
Diagnosis of COVID-19 Using Auditory Acoustic Cues. 921-925 - John B. Harvill, Yash R. Wani, Mark Hasegawa-Johnson, Narendra Ahuja, David G. Beiser
, David Chestek:
Classification of COVID-19 from Cough Using Autoregressive Predictive Coding Pretraining and Spectral Data Augmentation. 926-930 - Gauri Deshpande, Björn W. Schuller:
The DiCOVA 2021 Challenge - An Encoder-Decoder Approach for COVID-19 Recognition from Coughing Audio. 931-935 - Kotra Venkata Sai Ritwik, Shareef Babu Kalluri, Deepu Vijayasenan:
COVID-19 Detection from Spectral Features on the DiCOVA Dataset. 936-940 - Adria Mallol-Ragolta, Helena Cuesta, Emilia Gómez, Björn W. Schuller:
Cough-Based COVID-19 Detection with Contextual Attention Convolutional Neural Networks and Gender Information. 941-945 - Swapnil Bhosale, Upasana Tiwari, Rupayan Chakraborty, Sunil Kumar Kopparapu:
Contrastive Learning of Cough Descriptors for Automatic COVID-19 Preliminary Diagnosis. 946-950 - Flávio Ávila, Amir H. Poorjam, Deepak Mittal, Charles Dognin, Ananya Muguli
, Rohit Kumar, Srikanth Raj Chetupalli, Sriram Ganapathy, Maneesh Singh:
Investigating Feature Selection and Explainability for COVID-19 Diagnostics from Cough Sounds. 951-955
Show and Tell 1
- Gábor Kiss, Dávid Sztahó, Miklós Gábriel Tulics:
Application for Detecting Depression, Parkinson's Disease and Dysphonic Speech. 956-957 - Lenka Weingartová, Veronika Volna, Ewa Balejová:
Beey: More Than a Speech-to-Text Editor. 958-959 - Takayuki Arai:
Downsizing of Vocal-Tract Models to Line up Variations and Reduce Manufacturing Costs. 960-961 - Maël Fabien, Shantipriya Parida, Petr Motlícek, Dawei Zhu, Aravind Krishnan, Hoang H. Nguyen:
ROXANNE Research Platform: Automate Criminal Investigations. 962-964 - Alexandre Flucha, Anthony Larcher, Ambuj Mehrish, Sylvain Meignier, Florian Plaut, Nicolas Poupon, Yevhenii Prokopalo, Adrien Puertolas, Meysam Shamsi, Marie Tahon:
The LIUM Human Active Correction Platform for Speaker Diarization. 965-966 - Yoo Rhee Oh, Kiyoung Park:
On-Device Streaming Transformer-Based End-to-End Speech Recognition. 967-968 - Jaroslav Cmejla, Tomás Kounovský, Jakub Janský, Jirí Málek, M. Rozkovec, Zbynek Koldovský:
Advanced Semi-Blind Speaker Extraction and Tracking Implemented in Experimental Device with Revolving Dense Microphone Array. 969-970
Keynote 1: Hermann Ney
- Hermann Ney:
Forty Years of Speech and Language Processing: From Bayes Decision Rule to Deep Learning.
ASR Technologies and Systems
- Jan Chorowski, Grzegorz Ciesielski, Jaroslaw Dzikowski, Adrian Lancucki, Ricard Marxer
, Mateusz Opala, Piotr Pusz, Pawel Rychlikowski, Michal Stypulkowski:
Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw. 971-975 - Jan Chorowski, Grzegorz Ciesielski, Jaroslaw Dzikowski, Adrian Lancucki, Ricard Marxer
, Mateusz Opala, Piotr Pusz, Pawel Rychlikowski, Michal Stypulkowski:
Aligned Contrastive Predictive Coding. 976-980 - Benjamin Suter
, Josef Novák:
Neural Text Denormalization for Speech Transcripts. 981-985 - Aditya Joglekar, Seyed Omid Sadjadi, Meena Chandra Shekar, Christopher Cieri, John H. L. Hansen:
Fearless Steps Challenge Phase-3 (FSC P3): Advancing SLT for Unseen Channel and Mission Data Across NASA Apollo Audio. 986-990
Phonation and Voicing
- Hannah Leykum
:
Voice Quality in Verbal Irony: Electroglottographic Analyses of Ironic Utterances in Standard Austrian German. 991-995 - Mathilde Hutin
, Yaru Wu, Adèle Jatteau, Ioana Vasilescu, Lori Lamel, Martine Adda-Decker:
Synchronic Fortition in Five Romance Languages? A Large Corpus-Based Study of Word-Initial Devoicing. 996-1000 - Ivan Kraljevski
, Maria Paola Bissiri, Frank Duckhorn, Constanze Tschöpe, Matthias Wolff:
Glottal Stops in Upper Sorbian: A Data-Driven Approach. 1001-1005 - Bogdan Ludusan, Petra Wagner
, Marcin Wlodarczak:
Cue Interaction in the Perception of Prosodic Prominence: The Role of Voice Quality. 1006-1010 - Jenifer Vega Rodríguez, Nathalie Vallée:
Glottal Sounds in Korebaju. 1011-1014 - Anaïs Chanclu, Imen Ben Amor
, Cédric Gendrot, Emmanuel Ferragne, Jean-François Bonastre:
Automatic Classification of Phonation Types in Spontaneous Speech: Towards a New Workflow for the Characterization of Speakers' Voice Quality. 1015-1018
Health and Affect I
- Rob J. J. H. van Son
:
Measuring Voice Quality Parameters After Speaker Pseudonymization. 1019-1023 - Lars Steinert, Felix Putze, Dennis Küster, Tanja Schultz
:
Audio-Visual Recognition of Emotional Engagement of People with Dementia. 1024-1028 - Pascal Hecker
, Florian B. Pokorny
, Katrin D. Bartl-Pokorny, Uwe D. Reichel, Zhao Ren, Simone Hantke, Florian Eyben, Dagmar M. Schuller, Bert Arnrich, Björn W. Schuller:
Speaking Corona? Human and Machine Recognition of COVID-19 from Voice. 1029-1033 - Huyen Nguyen, Ralph Vente
, David Lupea, Sarah Ita Levitan
, Julia Hirschberg:
Acoustic-Prosodic, Lexical and Demographic Cues to Persuasiveness in Competitive Debate Speeches. 1034-1038
Robust Speaker Recognition
- Bengt J. Borgström:
Unsupervised Bayesian Adaptation of PLDA for Speaker Verification. 1039-1043 - Weiqing Wang, Danwei Cai, Jin Wang, Qingjian Lin, Xuyang Wang, Mi Hong, Ming Li:
The DKU-Duke-Lenovo System Description for the Fearless Steps Challenge Phase III. 1044-1048 - Yafeng Chen, Wu Guo, Bin Gu:
Improved Meta-Learning Training for Speaker Verification. 1049-1053 - Dan Wang, Yuanjie Dong, Yaxing Li, Yunfei Zi
, Zhihui Zhang, Xiaoqi Li, Shengwu Xiong:
Variational Information Bottleneck Based Regularization for Speaker Recognition. 1054-1058 - Niko Brümmer, Luciana Ferrer, Albert Swart:
Out of a Hundred Trials, How Many Errors Does Your Speaker Verifier Make? 1059-1063 - Roza Chojnacka, Jason Pelecanos, Quan Wang, Ignacio López-Moreno:
SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System. 1064-1068 - Zhiming Wang, Furong Xu, Kaisheng Yao, Yuan Cheng, Tao Xiong, Huijia Zhu:
AntVoice Neural Speaker Embedding System for FFSVC 2020. 1069-1073 - Jianchen Li
, Jiqing Han, Hongwei Song:
Gradient Regularization for Noise-Robust Speaker Verification. 1074-1078 - Saurabh Kataria, Jesús Villalba
, Piotr Zelasko, Laureano Moro-Velázquez, Najim Dehak
:
Deep Feature CycleGANs: Speaker Identity Preserving Non-Parallel Microphone-Telephone Domain Adaptation for Speaker Verification. 1079-1083 - Jie Pu, Yuguang Yang, Ruirui Li, Oguz Elibol, Jasha Droppo:
Scaling Effect of Self-Supervised Speech Models. 1084-1088 - Yibo Wu, Longbiao Wang, Kong Aik Lee
, Meng Liu, Jianwu Dang:
Joint Feature Enhancement and Speaker Recognition with Multi-Objective Task-Oriented Network. 1089-1093 - Li Zhang, Qing Wang, Kong Aik Lee
, Lei Xie, Haizhou Li:
Multi-Level Transfer Learning from Near-Field to Far-Field Speaker Verification. 1094-1098 - Jose Patino, Natalia A. Tomashenko
, Massimiliano Todisco, Andreas Nautsch, Nicholas W. D. Evans:
Speaker Anonymisation Using the McAdams Coefficient. 1099-1103
Source Separation, Dereverberation and Echo Cancellation
- Yiyu Luo, Jing Wang, Liang Xu, Lidong Yang:
Multi-Stream Gated and Pyramidal Temporal Convolutional Neural Networks for Audio-Visual Speech Separation in Multi-Talker Environments. 1104-1108 - Helin Wang, Bo Wu, Lianwu Chen, Meng Yu, Jianwei Yu, Yong Xu, Shi-Xiong Zhang, Chao Weng, Dan Su, Dong Yu:
TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation. 1109-1113 - Jianjun Gu, Longbiao Cheng, Xingwei Sun, Junfeng Li, Yonghong Yan:
Residual Echo and Noise Cancellation with Feature Attention Module and Multi-Domain Loss Function. 1114-1118 - Xiyun Li, Yong Xu, Meng Yu, Shi-Xiong Zhang, Jiaming Xu, Bo Xu, Dong Yu:
MIMO Self-Attentive RNN Beamformer for Multi-Speaker Speech Separation. 1119-1123 - Ritwik Giri, Shrikant Venkataramani, Jean-Marc Valin, Umut Isik, Arvindh Krishnaswamy:
Personalized PercepNet: Real-Time, Low-Complexity Target Voice Separation and Enhancement. 1124-1128 - Yochai Yemini, Ethan Fetaya, Haggai Maron, Sharon Gannot:
Scene-Agnostic Multi-Microphone Speech Dereverberation. 1129-1133 - Keitaro Tanaka, Ryosuke Sawata, Shusuke Takahashi:
Manifold-Aware Deep Clustering: Maximizing Angles Between Embedding Vectors Based on Regular Simplex. 1134-1138 - Hao Zhang, DeLiang Wang:
A Deep Learning Approach to Multi-Channel and Multi-Microphone Acoustic Echo Cancellation. 1139-1143 - Yueyue Na, Ziteng Wang, Zhang Liu, Biao Tian, Qiang Fu:
Joint Online Multichannel Acoustic Echo Cancellation, Speech Dereverberation and Source Separation. 1144-1148 - Hiroshi Sato, Tsubasa Ochiai, Marc Delcroix
, Keisuke Kinoshita
, Takafumi Moriya, Naoyuki Kamo:
Should We Always Separate?: Switching Between Enhanced and Observed Signals for Overlapping Speech Recognition. 1149-1153
Speech Signal Analysis and Representation I
- Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh:
Estimating Articulatory Movements in Speech Production with Transformer Networks. 1154-1158 - Dongchao Yang, Helin Wang, Yuexian Zou:
Unsupervised Multi-Target Domain Adaptation for Acoustic Scene Classification. 1159-1163 - Alfredo Esquivel Jaramillo
, Jesper Kjær Nielsen, Mads Græsbøll Christensen
:
Speech Decomposition Based on a Hybrid Speech Model and Optimal Segmentation. 1164-1168 - Jian Luo, Jianzong Wang
, Ning Cheng, Jing Xiao:
Dropout Regularization for Self-Supervised Learning of Transformer Encoder Speech Representation. 1169-1173 - Chiranjeevi Yarra
, Prasanta Kumar Ghosh:
Noise Robust Pitch Stylization Using Minimum Mean Absolute Error Criterion. 1174-1178 - Yu-Lin Huang, Bo-Hao Su, Y.-W. Peter Hong, Chi-Chun Lee
:
An Attribute-Aligned Strategy for Learning Speech Representation. 1179-1183 - Abdolreza Sabzi Shahrebabaki
, Sabato Marco Siniscalchi, Torbjørn Svendsen
:
Raw Speech-to-Articulatory Inversion by Temporal Filtering and Decimation. 1184-1188 - Jason Lilley
, H. Timothy Bunnell
:
Unsupervised Training of a DNN-Based Formant Tracker. 1189-1193 - Shu-Wen Yang, Po-Han Chi, Yung-Sung Chuang, Cheng-I Jeff Lai, Kushal Lakhotia, Yist Y. Lin, Andy T. Liu, Jiatong Shi, Xuankai Chang, Guan-Ting Lin, Tzu-Hsien Huang, Wei-Cheng Tseng, Ko-tik Lee, Da-Rong Liu, Zili Huang, Shuyan Dong, Shang-Wen Li, Shinji Watanabe
, Abdelrahman Mohamed, Hung-yi Lee:
SUPERB: Speech Processing Universal PERformance Benchmark. 1194-1198 - Cong Zhang, Jian Zhu:
Synchronising Speech Segments with Musical Beats in Mandarin and English Singing. 1199-1203 - Jacob Peplinski, Joel Shor, Sachin Joglekar, Jake Garrison, Shwetak N. Patel:
FRILL: A Non-Semantic Speech Embedding for Mobile Devices. 1204-1208 - Hiroki Mori
:
Pitch Contour Separation from Overlapping Speech. 1209-1213 - Anurag Kumar, Yun Wang, Vamsi Krishna Ithapu, Christian Fuegen:
Do Sound Event Representations Generalize to Other Audio Tasks? A Case Study in Audio Transfer Learning. 1214-1218
Spoken Language Understanding I
- Baolin Peng, Chenguang Zhu, Michael Zeng, Jianfeng Gao:
Data Augmentation for Spoken Language Understanding via Pretrained Language Models. 1219-1223 - Martin Radfar, Athanasios Mouchtaris, Siegfried Kunzmann, Ariya Rastrow:
FANS: Fusing ASR and NLU for On-Device SLU. 1224-1228 - Yiran Cao, Nihal Potdar, Anderson R. Avila:
Sequential End-to-End Intent and Slot Label Classification and Localization. 1229-1233 - Deepak Muralidharan, Joel Ruben Antony Moniz, Weicheng Zhang, Stephen Pulman, Lin Li, Megan Barnes, Jingjing Pan, Jason D. Williams
, Alex Acero:
DEXTER: Deep Encoding of External Knowledge for Named Entity Recognition in Virtual Assistants. 1234-1238 - Ting-Wei Wu, Ruolin Su, Biing-Hwang Juang:
A Context-Aware Hierarchical BERT Fusion Network for Multi-Turn Dialog Act Detection. 1239-1243 - Qian Chen, Wen Wang, Qinglin Zhang:
Pre-Training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning. 1244-1248 - Quynh Do, Judith Gaspers, Daniil Sorokin, Patrick Lehnen:
Predicting Temporal Performance Drop of Deployed Production Spoken Language Understanding Models. 1249-1253 - Jatin Ganhotra, Samuel Thomas, Hong-Kwang Jeff Kuo, Sachindra Joshi, George Saon
, Zoltán Tüske, Brian Kingsbury:
Integrating Dialog History into End-to-End Spoken Language Understanding Systems. 1254-1258 - Ting Han, Chongxuan Huang, Wei Peng:
Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking. 1259-1263 - Siddhant Arora, Alissa Ostapenko, Vijay Viswanathan, Siddharth Dalmia, Florian Metze, Shinji Watanabe
, Alan W. Black:
Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding. 1264-1268
Topics in ASR: Adaptation, Transfer Learning, Children’s Speech, and Low-Resource Settings
- Jianwei Sun, Zhiyuan Tang, Hengxin Yin, Wei Wang, Xi Zhao, Shuaijiang Zhao, Xiaoning Lei, Wei Zou, Xiangang Li:
Semantic Data Augmentation for End-to-End Mandarin Speech Recognition. 1269-1273 - Xun Gong, Yizhou Lu, Zhikai Zhou, Yanmin Qian:
Layer-Wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition. 1274-1278 - Jinhan Wang, Yunzheng Zhu, Ruchao Fan
, Wei Chu, Abeer Alwan:
Low Resource German ASR with Untranscribed Data Spoken by Non-Native Children - INTERSPEECH 2021 Shared Task SPAPL System. 1279-1283 - Khe Chai Sim, Angad Chandorkar, Fan Gao, Mason Chua, Tsendsuren Munkhdalai, Françoise Beaufays:
Robust Continuous On-Device Personalization for Automatic Speech Recognition. 1284-1288 - Shashi Kumar, Shakti P. Rath, Abhishek Pandey:
Speaker Normalization Using Joint Variational Autoencoder. 1289-1293 - Gaopeng Xu, Song Yang, Lu Ma, Chengfei Li, Zhongqin Wu:
The TAL System for the INTERSPEECH2021 Shared Task on Automatic Speech Recognition for Non-Native Childrens Speech. 1294-1298 - Tsz Kin Lam, Mayumi Ohta
, Shigehiko Schamoni, Stefan Riezler:
On-the-Fly Aligned Data Augmentation for Sequence-to-Sequence ASR. 1299-1303 - Heting Gao, Junrui Ni
, Yang Zhang, Kaizhi Qian, Shiyu Chang, Mark Hasegawa-Johnson:
Zero-Shot Cross-Lingual Phonetic Recognition with External Language Embedding. 1304-1308 - Yan Huang, Guoli Ye, Jinyu Li
, Yifan Gong:
Rapid Speaker Adaptation for Conformer Transducer: Attention and Bias Are All You Need. 1309-1313 - Nilaksh Das, Sravan Bodapati, Monica Sunkara, Sundararajan Srinivasan, Duen Horng Chau:
Best of Both Worlds: Robust Accented Speech Recognition with Adversarial Transfer Learning. 1314-1318 - Wei Chu, Peng Chang, Jing Xiao:
Extending Pronunciation Dictionary with Automatically Detected Word Mispronunciations to Improve PAII's System for Interspeech 2021 Non-Native Child English Close Track ASR Challenge. 1319-1323
Voice Conversion and Adaptation I
- Tingle Li, Yichen Liu, Chenxu Hu, Hang Zhao:
CVC: Contrastive Learning for Non-Parallel Voice Conversion. 1324-1328 - Wen-Chin Huang, Kazuhiro Kobayashi, Yu-Huai Peng, Ching-Feng Liu, Yu Tsao, Hsin-Min Wang
, Tomoki Toda:
A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker Identity in Dysarthric Voice Conversion. 1329-1333 - Sefik Emre Eskimez, Dimitrios Dimitriadis, Ken'ichi Kumatani, Robert Gmyr:
One-Shot Voice Conversion with Speaker-Agnostic StarGAN. 1334-1338 - Takeshi Koshizuka, Hidefumi Ohmura, Kouichi Katsurada:
Fine-Tuning Pre-Trained Voice Conversion Model for Adding New Target Speakers with Limited Data. 1339-1343 - Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng:
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-Shot Voice Conversion. 1344-1348 - Yinghao Aaron Li, Ali Zare, Nima Mesgarani:
StarGANv2-VC: A Diverse, Unsupervised, Non-Parallel Framework for Natural-Sounding Voice Conversion. 1349-1353 - Neeraj Kumar, Srishti Goel, Ankur Narang, Brejesh Lall:
Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis. 1354-1358 - Shoki Sakamoto, Akira Taniguchi
, Tadahiro Taniguchi, Hirokazu Kameoka:
StarGAN-VC+ASR: StarGAN-Based Non-Parallel Voice Conversion Regularized by Automatic Speech Recognition. 1359-1363 - Xuexin Xu
, Liang Shi, Jinhui Chen, Xunquan Chen, Jie Lian, Pingyuan Lin, Zhihong Zhang, Edwin R. Hancock
:
Two-Pathway Style Embedding for Arbitrary Voice Conversion. 1364-1368 - Yufei Liu, Chengzhu Yu, Shuai Wang, Zhenchuan Yang, Yang Chao, Weibin Zhang:
Non-Parallel Any-to-Many Voice Conversion by Replacing Speaker Statistics. 1369-1373 - Yi Zhou, Xiaohai Tian, Zhizheng Wu, Haizhou Li:
Cross-Lingual Voice Conversion with a Cycle Consistency Loss on Linguistic Representation. 1374-1378 - Hongqiang Du, Lei Xie:
Improving Robustness of One-Shot Voice Conversion with Deep Discriminative Speaker Encoder. 1379-1383
Voice Quality Characterization for Clinical Voice Assessment: Voice Production, Acoustics, and Auditory Perception
- Hannah White
, Joshua Penney
, Andy Gibson
, Anita Szakay
, Felicity Cox
:
Optimizing an Automatic Creaky Voice Detection Method for Australian English Speaking Females. 1384-1388 - Joshua Penney
, Andy Gibson
, Felicity Cox
, Michael I. Proctor
, Anita Szakay
:
A Comparison of Acoustic Correlates of Voice Quality Across Different Recording Devices: A Cautionary Tale. 1389-1393 - Anna Sfakianaki, George P. Kafentzis:
Investigating Voice Function Characteristics of Greek Speakers with Hearing Loss Using Automatic Glottal Source Feature Extraction. 1394-1398 - Mark A. Huckvale, Catinca Buciuleac:
Automated Detection of Voice Disorder in the Saarbrücken Voice Database: Effects of Pathology Subset and Audio Materials. 1399-1403 - Steven M. Lulich, Rita R. Patel:
Accelerometer-Based Measurements of Voice Quality in Children During Semi-Occluded Vocal Tract Exercise with a Narrow Straw in Air. 1404-1408 - Matthew Perez
, Amrit Romana, Angela Roberts
, Noelle Carlozzi, Jennifer Ann Miner, Praveen Dayalu, Emily Mower Provost:
Articulatory Coordination for Speech Motor Tracking in Huntington Disease. 1409-1413 - Carlos A. Ferrer, Efren Aragón, María E. Hdez-Díaz, Marc S. De Bodt, Roman Cmejla
, Marina Englert, Mara Behlau, Elmar Nöth:
Modeling Dysphonia Severity as a Function of Roughness and Breathiness Ratings in the GRBAS Scale. 1414-1418
Miscellanous Topics in ASR
- Nikolay Karpov, Alexander Denisenko, Fedor Minkin:
Golos: Russian Dataset for Speech Research. 1419-1423 - Samik Sadhu, Hynek Hermansky
:
Radically Old Way of Computing Spectra: Applications in End-to-End ASR. 1424-1428 - Ragheb Al-Ghezi, Yaroslav Getman
, Aku Rouhe, Raili Hildén
, Mikko Kurimo:
Self-Supervised End-to-End ASR for Low Resource L2 Swedish. 1429-1433 - Patrick K. O'Neill, Vitaly Lavrukhin, Somshubra Majumdar, Vahid Noroozi, Yuekai Zhang, Oleksii Kuchaiev, Jagadeesh Balam, Yuliya Dovzhenko, Keenan Freyberg, Michael D. Shulman, Boris Ginsburg, Shinji Watanabe
, Georg Kucsko:
SPGISpeech: 5, 000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition. 1434-1438 - Solène Evain, Ha Nguyen, Hang Le, Marcely Zanon Boito, Salima Mdhaffar
, Sina Alisamir, Ziyi Tong, Natalia A. Tomashenko
, Marco Dinarelli, Titouan Parcollet, Alexandre Allauzen, Yannick Estève, Benjamin Lecouteux, François Portet, Solange Rossato, Fabien Ringeval, Didier Schwab, Laurent Besacier:
LeBenchmark: A Reproducible Framework for Assessing Self-Supervised Representation Learning from Speech. 1439-1443
Phonetics I
- Pavel Sturm, Radek Skarnitzl
, Tomás Nechanský:
Prosodic Accommodation in Face-to-Face and Telephone Dialogues. 1444-1448 - Josiane Riverin-Coutlée, Conceição Cunha, Enkeleida Kapia, Jonathan Harrington:
Dialect Features in Heterogeneous and Homogeneous Gheg Speaking Communities. 1449-1453 - Margaret Zellers
, Alena Witzlack-Makarevich
, Lilja Saeboe, Saudah Namyalo:
An Exploration of the Acoustic Space of Rhotics and Laterals in Ruruuli. 1454-1458 - Kübra Bodur
, Sweeney Branje, Morgane Peirolo
, Ingrid Tiscareno
, James Sneed German:
Domain-Initial Strengthening in Turkish: Acoustic Cues to Prosodic Hierarchy in Stop Consonants. 1459-1463
Target Speaker Detection, Localization and Separation
- Katerina Zmolíková
, Marc Delcroix
, Desh Raj
, Shinji Watanabe
, Jan Cernocký:
Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics. 1464-1468 - Marvin Borsdorf, Chenglin Xu, Haizhou Li, Tanja Schultz
:
Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers. 1469-1473 - Lukás Mateju, Frantisek Kynych, Petr Cerva, Jindrich Zdánský, Jirí Málek:
Using X-Vectors for Speech Activity Detection in Broadcast Streams. 1474-1478 - Daniele Salvati
, Carlo Drioli, Gian Luca Foresti:
Time Delay Estimation for Speaker Localization Using CNN-Based Parametrized GCC-PHAT Features. 1479-1483 - Midia Yousefi, John H. L. Hansen:
Real-Time Speaker Counting in a Cocktail Party Scenario Using Attention-Guided Convolutional Neural Network. 1484-1488
Language and Accent Recognition
- Hexin Liu, Leibny Paola García-Perera
, Xinyi Zhang, Justin Dauwels, Andy W. H. Khong, Sanjeev Khudanpur, Suzy J. Styles
:
End-to-End Language Diarization for Bilingual Code-Switching Speech. 1489-1493 - Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina:
Modeling and Training Strategies for Language Recognition Systems. 1494-1498 - Hui Wang, Lin Liu, Yan Song, Lei Fang, Ian McLoughlin
, Li-Rong Dai:
A Weight Moving Average Based Alternate Decoupled Learning Algorithm for Long-Tailed Language Identification. 1499-1503 - Keqi Deng, Songjun Cao, Long Ma:
Improving Accent Identification and Accented Speech Recognition Under a Framework of Self-Supervised Learning. 1504-1508 - Zhiyun Fan, Meng Li, Shiyu Zhou, Bo Xu:
Exploring wav2vec 2.0 on Speaker Verification and Language Identification. 1509-1513 - Gundluru Ramesh, C. Shiva Kumar, K. Sri Rama Murty
:
Self-Supervised Phonotactic Representations for Language Identification. 1514-1518 - Jicheng Zhang, Yizhou Peng, Van Tung Pham, Haihua Xu, Hao Huang, Eng Siong Chng:
E2E-Based Multi-Task Learning Approach to Joint Speech and Accent Recognition. 1519-1523 - Moakala Tzudir, Shikha Baghel, Priyankoo Sarmah, S. R. Mahadeva Prasanna:
Excitation Source Feature Based Dialect Identification in Ao - A Low Resource Language. 1524-1528
Low-Resource Speech Recognition
- Shreya Khare, Ashish R. Mittal, Anuj Diwan, Sunita Sarawagi, Preethi Jyothi, Samarth Bharadwaj:
Low Resource ASR: The Surprising Effectiveness of High Resource Transliteration. 1529-1533 - Siyuan Feng
, Piotr Zelasko, Laureano Moro-Velázquez, Odette Scharenborg
:
Unsupervised Acoustic Unit Discovery by Leveraging a Language-Independent Subword Discriminative Feature Representation. 1534-1538 - Herman Kamper
, Benjamin van Niekerk
:
Towards Unsupervised Phone and Word Segmentation Using Self-Supervised Vector-Quantized Neural Networks. 1539-1543 - Dongwei Jiang, Wubo Li, Miao Cao, Wei Zou, Xiangang Li:
Speech SimCLR: Combining Contrastive and Reconstruction Objective for Self-Supervised Speech Representation Learning. 1544-1548 - Christiaan Jacobs, Herman Kamper
:
Multilingual Transfer of Acoustic Word Embeddings Improves When Training on Languages Related to the Target Zero-Resource Language. 1549-1553 - Benjamin van Niekerk
, Leanne Nortje, Matthew Baas, Herman Kamper
:
Analyzing Speaker Information in Self-Supervised Models to Improve Zero-Resource Speech Processing. 1554-1558 - Shun Takahashi, Sakriani Sakti, Satoshi Nakamura:
Unsupervised Neural-Based Graph Clustering for Variable-Length Speech Representation Discovery of Zero-Resource Languages. 1559-1563 - Takashi Maekaku, Xuankai Chang, Yuya Fujita, Li-Wei Chen, Shinji Watanabe
, Alexander I. Rudnicky
:
Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021. 1564-1568 - Xia Cui
, Amila Gamage, Terry Hanley, Tingting Mu:
Identifying Indicators of Vulnerability from Short Speech Segments Using Acoustic and Textual Features. 1569-1573 - Ewan Dunbar, Mathieu Bernard, Nicolas Hamilakis, Tu Anh Nguyen, Maureen de Seyssel, Patricia Rozé, Morgane Rivière, Eugene Kharitonov, Emmanuel Dupoux:
The Zero Resource Speech Challenge 2021: Spoken Language Modelling. 1574-1578 - Gautham Krishna Gudur, Satheesh Kumar Perepu:
Zero-Shot Federated Learning with New Classes for Audio Classification. 1579-1583 - Andrew Rouditchenko, Angie W. Boggust, David Harwath, Brian Chen, Dhiraj Joshi, Samuel Thomas, Kartik Audhkhasi, Hilde Kuehne
, Rameswar Panda, Rogério Schmidt Feris, Brian Kingsbury, Michael Picheny, Antonio Torralba, James R. Glass:
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. 1584-1588
Speech Synthesis: Singing, Multimodal, Crosslingual Synthesis
- Gyeong-Hoon Lee, Tae-Woo Kim, Hanbin Bae, Min-Ji Lee, Young-Ik Kim, Hoon-Young Cho:
N-Singer: A Non-Autoregressive Korean Singing Voice Synthesis System for Pronunciation Enhancement. 1589-1593 - Georgia Maniati, Nikolaos Ellinas, Konstantinos Markopoulos, Georgios Vamvoukakis, June Sig Sung, Hyoungmin Park, Aimilios Chalamandaris, Pirros Tsiakoulis:
Cross-Lingual Low Resource Speaker Adaptation Using Phonological Features. 1594-1598 - Haoyue Zhan, Haitong Zhang, Wenjie Ou, Yue Lin:
Improve Cross-Lingual Text-To-Speech Synthesis on Monolingual Corpora with Pitch Contour Information. 1599-1603 - Zhenchuan Yang, Weibin Zhang, Yufei Liu, Xiaofen Xing:
Cross-Lingual Voice Conversion with Disentangled Universal Linguistic Representations. 1604-1608 - Zhengchen Liu, Chenfeng Miao, Qingying Zhu, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao:
EfficientSing: A Chinese Singing Voice Synthesis System Using Duration-Free Acoustic Model and HiFi-GAN Vocoder. 1609-1613 - Detai Xin, Yuki Saito, Shinnosuke Takamichi, Tomoki Koriyama
, Hiroshi Saruwatari:
Cross-Lingual Speaker Adaptation Using Domain Adaptation and Speaker Consistency Loss for Text-To-Speech Synthesis. 1614-1618 - Zengqiang Shang, Zhihua Huang, Haozhe Zhang, Pengyuan Zhang, Yonghong Yan:
Incorporating Cross-Speaker Style Transfer for Multi-Language Text-to-Speech. 1619-1623 - Ege Kesim, Engin Erzin
:
Investigating Contributions of Speech and Facial Landmarks for Talking Head Generation. 1624-1628 - Shijing Si, Jianzong Wang
, Xiaoyang Qu, Ning Cheng, Wenqi Wei, Xinghua Zhu, Jing Xiao:
Speech2Video: Cross-Modal Distillation for Speech to Video Generation. 1629-1633
Speech Coding and Privacy
- Junhyeok Lee
, Seungu Han:
NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling. 1634-1638 - Gang-Xuan Lin, Shih-Wei Hu, Yen-Ju Lu, Yu Tsao, Chun-Shien Lu:
QISTA-Net-Audio: Audio Super-Resolution via Non-Convex ℓ_q-Norm Minimization. 1639-1643 - Liang Wen, Lizhong Wang, Xue Wen, Yuxing Zheng, Youngo Park, Kwang Pyo Choi:
X-net: A Joint Scale Down and Scale Up Method for Voice Call. 1644-1648 - Kexun Zhang, Yi Ren, Changliang Xu, Zhou Zhao:
WSRGlow: A Glow-Based Waveform Generative Model for Audio Super-Resolution. 1649-1653 - Jiangyan Yi, Ye Bai, Jianhua Tao, Haoxin Ma, Zhengkun Tian, Chenglong Wang, Tao Wang, Ruibo Fu:
Half-Truth: A Partially Fake Audio Detection Dataset. 1654-1658 - Bhusan Chettri
, Rosa González Hautamäki, Md. Sahidullah, Tomi Kinnunen:
Data Quality as Predictor of Voice Anti-Spoofing Generalization. 1659-1663 - Youngju Cheon, Soojoong Hwang, Sangwook Han, Inseon Jang, Jong Won Shin:
Coded Speech Enhancement Using Neural Network-Based Vector-Quantized Residual Features. 1664-1668 - Lukas Drude, Jahn Heymann, Andreas Schwarz, Jean-Marc Valin:
Multi-Channel Opus Compression for Far-Field Automatic Speech Recognition with a Fixed Bitrate Budget. 1669-1673 - Ingo Siegert:
Effects of Prosodic Variations on Accidental Triggers of a Commercial Voice Assistant. 1674-1678 - Adam Gabrys, Yunlong Jiao, Viacheslav Klimkov, Daniel Korzekwa, Roberto Barra-Chicote:
Improving the Expressiveness of Neural Vocoding with Non-Affine Normalizing Flows. 1679-1683 - Gauri P. Prajapati, Dipesh K. Singh, Preet P. Amin, Hemant A. Patil:
Voice Privacy Through x-Vector and CycleGAN-Based Anonymization. 1684-1688 - Ju Lin, Yun Wang, Kaustubh Kalgaonkar, Gil Keren, Didi Zhang, Christian Fuegen:
A Two-Stage Approach to Speech Bandwidth Extension. 1689-1693 - Joon Byun, Seungmin Shin
, Youngcheol Park, Jongmo Sung, Seungkwon Beack:
Development of a Psychoacoustic Loss Function for the Deep Neural Network (DNN)-Based Speech Coder. 1694-1698 - Dimitrios Stoidis, Andrea Cavallaro:
Protecting Gender and Identity with Disentangled Speech Representations. 1699-1703
Speech Perception II
- Yahya Aldholmi
, Rawan Aldhafyan, Asma Alqahtani:
Perception of Standard Arabic Synthetic Speech Rate. 1704-1707 - Takeshi Kishiyama
:
The Influence of Parallel Processing on Illusory Vowels. 1708-1712 - Anupama Chingacham, Vera Demberg, Dietrich Klakow:
Exploring the Potential of Lexical Paraphrases for Mitigating Noise-Induced Comprehension Errors. 1713-1717 - Olympia Simantiraki, Martin Cooke:
SpeechAdjuster: A Tool for Investigating Listener Preferences and Speech Intelligibility. 1718-1722 - Susumu Saito, Yuta Ide, Teppei Nakano, Tetsuji Ogawa:
VocalTurk: Exploring Feasibility of Crowdsourced Speaker Identification. 1723-1727 - Min Xu, Jing Shao
, Lan Wang:
Effects of Aging and Age-Related Hearing Loss on Talker Discrimination. 1728-1732 - Yuqing Zhang
, Zhu Li, Bin Wu, Yanlu Xie, Binghuai Lin, Jinsong Zhang:
Relationships Between Perceptual Distinctiveness, Articulatory Complexity and Functional Load in Speech Communication. 1733-1737 - Camryn Terblanche, Philip Harrison, Amelia Jane Gully:
Human Spoofing Detection Performance on Degraded Speech. 1738-1742 - Marieke Einfeldt, Rita Sevastjanova, Katharina Zahner-Ritter, Ekaterina Kazak
, Bettina Braun:
Reliable Estimates of Interpretable Cue Effects with Active Learning in Psycholinguistic Research. 1743-1747 - Puneet Kumar, Vishesh Kaushik, Balasubramanian Raman:
Towards the Explainability of Multimodal Speech Emotion Recognition. 1748-1752 - Biao Zeng, Rui Wang, Guoxing Yu, Christian Dobel:
Primacy of Mouth over Eyes: Eye Movement Evidence from Audiovisual Mandarin Lexical Tones and Vowels. 1753-1756 - Takanori Ashihara, Takafumi Moriya, Makio Kashino:
Investigating the Impact of Spectral and Temporal Degradation on End-to-End Automatic Speech Recognition Performance. 1757-1761
Streaming for ASR/RNN Transducers
- Thai-Son Nguyen, Sebastian Stüker, Alex Waibel:
Super-Human Performance in Online Low-Latency Recognition of Conversational Speech. 1762-1766 - Vikas Joshi, Amit Das, Eric Sun, Rupesh R. Mehta, Jinyu Li
, Yifan Gong:
Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems. 1767-1771 - Duc Le, Mahaveer Jain, Gil Keren, Suyoun Kim, Yangyang Shi, Jay Mahadeokar, Julian Chan, Yuan Shangguan, Christian Fuegen, Ozlem Kalinli, Yatharth Saraf, Michael L. Seltzer:
Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion. 1772-1776 - Tara N. Sainath, Yanzhang He, Arun Narayanan, Rami Botros, Ruoming Pang, David Rybach, Cyril Allauzen, Ehsan Variani, James Qin, Quoc-Nam Le-The, Shuo-Yiin Chang, Bo Li, Anmol Gulati, Jiahui Yu, Chung-Cheng Chiu, Diamantino Caseiro, Wei Li, Qiao Liang, Pat Rondon:
An Efficient Streaming Non-Recurrent On-Device End-to-End Model with Improvements to Rare-Word Modeling. 1777-1781 - Liang Lu, Naoyuki Kanda, Jinyu Li
, Yifan Gong:
Streaming Multi-Talker Speech Recognition with Joint Speaker Identification. 1782-1786 - Takafumi Moriya, Tomohiro Tanaka, Takanori Ashihara, Tsubasa Ochiai, Hiroshi Sato, Atsushi Ando, Ryo Masumura, Marc Delcroix
, Taichi Asami:
Streaming End-to-End Speech Recognition for Hybrid RNN-T/Attention Architecture. 1787-1791 - Andreas Schwarz, Ilya Sklyar, Simon Wiesler:
Improving RNN-T ASR Accuracy Using Context Audio. 1792-1796 - Lu Huang, Jingyu Sun, Yufeng Tang, Junfeng Hou, Jinkun Chen, Jun Zhang, Zejun Ma:
HMM-Free Encoder Pre-Training for Streaming RNN Transducer. 1797-1801 - Xiaodong Cui, Brian Kingsbury, George Saon
, David Haws, Zoltán Tüske:
Reducing Exposure Bias in Training Recurrent Neural Network Transducers. 1802-1806 - Thibault Doutre, Wei Han, Chung-Cheng Chiu, Ruoming Pang, Olivier Siohan, Liangliang Cao:
Bridging the Gap Between Streaming and Non-Streaming ASR Systems by Distilling Ensembles of CTC and RNN-T Models. 1807-1811 - Kartik Audhkhasi, Tongzhou Chen, Bhuvana Ramabhadran, Pedro J. Moreno:
Mixture Model Attention: Flexible Streaming and Non-Streaming Automatic Speech Recognition. 1812-1816 - Hirofumi Inaguma, Tatsuya Kawahara
:
StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR. 1817-1821 - Niko Moritz, Takaaki Hori, Jonathan Le Roux:
Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition. 1822-1826 - Kwangyoun Kim, Felix Wu, Prashant Sridhar, Kyu Jeong Han, Shinji Watanabe
:
Multi-Mode Transformer Transducer with Stochastic Future Context. 1827-1831
ConferencingSpeech 2021 Challenge: Far-Field Multi-Channel Speech Enhancement for Video Conferencing
- Xinlei Ren, Xu Zhang, Lianwu Chen, Xiguang Zheng
, Chen Zhang, Liang Guo, Bing Yu:
A Causal U-Net Based Neural Beamforming Network for Real-Time Multi-Channel Speech Enhancement. 1832-1836 - Rui Zhu, Feiran Yang, Yuepeng Li, Shidong Shang:
A Partitioned-Block Frequency-Domain Adaptive Kalman Filter for Stereophonic Acoustic Echo Cancellation. 1837-1841 - Taihui Wang, Feiran Yang, Rui Zhu, Jun Yang:
Real-Time Independent Vector Analysis Using Semi-Supervised Nonnegative Matrix Factorization as a Source Model. 1842-1846 - Jiangyu Han
, Wei Rao, Yannan Wang, Yanhua Long:
Improving Channel Decorrelation for Multi-Channel Target Speech Extraction. 1847-1851 - Jinjiang Liu, Xueliang Zhang:
Inplace Gated Convolutional Recurrent Neural Network for Dual-Channel Speech Enhancement. 1852-1856 - R. G. Prithvi Raj, Rohit Kumar, M. K. Jayesh, Anurenjan Purushothaman
, Sriram Ganapathy, M. Ali Basha Shaik:
SRIB-LEAP Submission to Far-Field Multi-Channel Speech Enhancement Challenge for Video Conferencing. 1857-1861 - Cheng Xue, Weilong Huang, Weiguang Chen, Jinwei Feng:
Real-Time Multi-Channel Speech Enhancement Based on Neural Network Masking with Attention Model. 1862-1866
Survey Talk 2: Sriram Ganapathy
- Sriram Ganapathy:
Uncovering the Acoustic Cues of COVID-19 Infection.
Keynote 2: Pascale Fung
- Pascale Fung:
Ethical and Technological Challenges of Conversational AI.
Language Modeling and Text-Based Innovations for ASR
- Dominique Fohr, Irina Illina:
BERT-Based Semantic Model for Rescoring N-Best Speech Recognition List. 1867-1871 - Karel Benes, Lukás Burget:
Text Augmentation for Language Models in High Error Recognition Scenario. 1872-1876 - Yingbo Gao, David Thulke, Alexander Gerstenberger, Khoa Viet Tran, Ralf Schlüter
, Hermann Ney:
On Sampling-Based Training Criteria for Neural Language Modeling. 1877-1881 - Janne Pylkkönen, Antti Ukkonen
, Juho Kilpikoski, Samu Tamminen, Hannes Heikinheimo:
Fast Text-Only Domain Adaptation of RNN-Transducer Prediction Network. 1882-1886
Speaker, Language, and Privacy
- Christopher Cieri, James Fiumara, Jonathan Wright:
Using Games to Augment Corpora for Language Recognition and Confusability. 1887-1891 - Gianni Fenu, Mirko Marras, Giacomo Medda
, Giacomo Meloni:
Fair Voice Biometrics: Impact of Demographic Imbalance on Group Fairness in Speaker Recognition. 1892-1896 - Leying Zhang
, Zhengyang Chen, Yanmin Qian:
Knowledge Distillation from Multi-Modality to Single-Modality for Person Verification. 1897-1901 - Paul-Gauthier Noé, Mohammad MohammadAmini, Driss Matrouf, Titouan Parcollet, Andreas Nautsch, Jean-François Bonastre:
Adversarial Disentanglement of Speaker Representation for Attribute-Driven Privacy Preservation. 1902-1906
Assessment of Pathological Speech and Language I
- Amrit Romana, John Bandon, Matthew Perez
, Stephanie Gutierrez, Richard Richter, Angela Roberts
, Emily Mower Provost:
Automatically Detecting Errors and Disfluencies in Read Speech to Predict Cognitive Impairment in People with Parkinson's Disease. 1907-1911 - Robin Vaysse, Jérôme Farinas
, Corine Astésano, Régine André-Obrecht:
Automatic Extraction of Speech Rhythm Descriptors for Speech Intelligibility Assessment in the Context of Head and Neck Cancers. 1912-1916 - Jinzi Qi, Hugo Van hamme
:
Speech Disorder Classification Using Extended Factorized Hierarchical Variational Auto-Encoders. 1917-1921 - Vikram C. Mathad, Tristan J. Mahr, Nancy Scherer, Kathy Chapman, Katherine C. Hustad, Julie Liss, Visar Berisha:
The Impact of Forced-Alignment Errors on Automatic Pronunciation Evaluation. 1922-1926 - Esaú Villatoro-Tello
, S. Pavankumar Dubagunta, Julian Fritsch, Gabriela Ramírez-de-la-Rosa, Petr Motlícek, Mathew Magimai-Doss:
Late Fusion of the Available Lexicon and Raw Waveform-Based Acoustic Modeling for Depression and Dementia Recognition. 1927-1931 - Amin Honarmandi Shandiz
, László Tóth, Gábor Gosztolya, Alexandra Markó, Tamás Gábor Csapó:
Neural Speaker Embeddings for Ultrasound-Based Silent Speech Interfaces. 1932-1936
Communication and Interaction, Multimodality
- Jatin Lamba, Abhishek, Jayaprakash Akula, Rishabh Dabral, Preethi Jyothi, Ganesh Ramakrishnan:
Cross-Modal Learning for Audio-Visual Video Parsing. 1937-1941 - Darren Cook
, Miri Zilka
, Simon Maskell, Laurence Alison:
A Psychology-Driven Computational Analysis of Political Interviews. 1942-1946 - Jennifer Santoso, Takeshi Yamada, Shoji Makino, Kenkichi Ishizuka, Takekatsu Hiramura:
Speech Emotion Recognition Based on Attention Weight Correction Using Word-Level Confidence Measure. 1947-1951 - Alif Silpachai, Ivana Rehman, Taylor Anne Barriuso, John Levis, Evgeny Chukharev-Hudilainen
, Guanlong Zhao, Ricardo Gutierrez-Osuna:
Effects of Voice Type and Task on L2 Learners' Awareness of Pronunciation Errors. 1952-1956 - Alla Menshikova, Daniil Kocharov
, Tatiana Kachkovskaia:
Lexical Entrainment and Intra-Speaker Variability in Cooperative Dialogues. 1957-1961 - Shamila Nasreen, Julian Hough, Matthew Purver
:
Detecting Alzheimer's Disease Using Interactional and Acoustic Features from Spontaneous Speech. 1962-1966 - Hardik Kothare, Vikram Ramanarayanan, Oliver Roesler, Michael Neumann, Jackson Liscombe, William Burke, Andrew Cornish, Doug Habberstad, Alaa Sakallah, Sara Markuson, Seemran Kansara, Afik Faerman, Yasmine Bensidi-Slimane, Laura Fry, Saige Portera, David Suendermann-Oeft, David Pautler, Carly Demopoulos:
Investigating the Interplay Between Affective, Phonatory and Motoric Subsystems in Autism Spectrum Disorder Using a Multimodal Dialogue Agent. 1967-1971 - Carlos Toshinori Ishi, Taiken Shintani:
Analysis of Eye Gaze Reasons and Gaze Aversions During Three-Party Conversations. 1972-1976
Language and Lexical Modeling for ASR
- Suyoun Kim, Abhinav Arora, Duc Le, Ching-Feng Yeh, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer:
Semantic Distance: A New Metric for ASR Performance Analysis Towards Spoken Language Understanding. 1977-1981 - Xiaoqiang Wang, Yanqing Liu, Sheng Zhao, Jinyu Li
:
A Light-Weight Contextual Spelling Correction Model for Customizing Transducer-Based Speech Recognition Systems. 1982-1986 - Ning Shi, Wei Wang, Boxin Wang, Jinfeng Li, Xiangyu Liu, Zhouhan Lin:
Incorporating External POS Tagger for Punctuation Restoration. 1987-1991 - Vasileios Papadourakis, Markus Müller, Jing Liu, Athanasios Mouchtaris, Maurizio Omologo:
Phonetically Induced Subwords for End-to-End Speech Recognition. 1992-1996 - Courtney Mansfield, Sara Ng, Gina-Anne Levow, Richard A. Wright
, Mari Ostendorf:
Revisiting Parity of Human vs. Machine Conversational Speech Transcription. 1997-2001 - W. Ronny Huang, Tara N. Sainath, Cal Peyser, Shankar Kumar, David Rybach, Trevor Strohman:
Lookup-Table Recurrent Language Models for Long Tail Speech Recognition. 2002-2006 - Jesús Andrés-Ferrer, Dario Albesano, Puming Zhan, Paul Vozila:
Contextual Density Ratio for Language Model Biasing of Sequence to Sequence ASR Systems. 2007-2011 - Qiushi Huang, Tom Ko, H. Lilian Tang, Xubo Liu, Bo Wu:
Token-Level Supervised Contrastive Learning for Punctuation Restoration. 2012-2016 - Yun Zhao, Xuerui Yang, Jinchao Wang, Yongyu Gao, Chao Yan, Yuanfu Zhou:
BART Based Semantic Correction for Mandarin Automatic Speech Recognition System. 2017-2021 - Lingfeng Dai, Qi Liu, Kai Yu:
Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR. 2022-2026 - Gakuto Kurata, George Saon
, Brian Kingsbury, David Haws, Zoltán Tüske:
Improving Customization of Neural Transducers by Mitigating Acoustic Mismatch of Synthesized Audio. 2027-2031 - Mandana Saebi, Ernest Pusateri, Aaksha Meghawat, Christophe Van Gysel:
A Discriminative Entity-Aware Language Model for Virtual Assistants. 2032-2036 - Mahdi Namazifar, John Malik
, Li Erran Li, Gökhan Tür, Dilek Hakkani-Tür:
Correcting Automated and Manual Speech Transcription Errors Using Warped Language Models. 2037-2041
Novel Neural Network Architectures for ASR
- Yangyang Shi, Varun Nagaraja, Chunyang Wu, Jay Mahadeokar, Duc Le, Rohit Prabhavalkar, Alex Xiao, Ching-Feng Yeh, Julian Chan, Christian Fuegen, Ozlem Kalinli, Michael L. Seltzer:
Dynamic Encoder Transducer: A Flexible Solution for Trading Off Accuracy for Latency. 2042-2046 - Shiqi Zhang, Yan Liu, Deyi Xiong, Pei Zhang, Boxing Chen:
Domain-Aware Self-Attention for Multi-Domain Neural Machine Translation. 2047-2051 - Albert Zeyer, André Merboldt, Wilfried Michel, Ralf Schlüter
, Hermann Ney:
Librispeech Transducer Model with Internal Language Model Prior Correction. 2052-2056 - Sepand Mavandadi, Tara N. Sainath, Ke Hu, Zelin Wu:
A Deliberation-Based Joint Acoustic and Text Decoder. 2057-2061 - Zoltán Tüske, George Saon
, Brian Kingsbury:
On the Limit of English Conversational Speech Recognition. 2062-2066 - Keyu An, Yi Zhang, Zhijian Ou:
Deformable TDNN with Adaptive Receptive Fields for Speech Recognition. 2067-2071 - Zhao You, Shulin Feng, Dan Su, Dong Yu:
SpeechMoE: Scaling to Large Acoustic Models with Dynamic Routing Mixture of Experts. 2077-2081 - Chi-Hang Leong, Yu-Han Huang, Jen-Tzung Chien
:
Online Compressive Transformer for End-to-End Speech Recognition. 2082-2086 - Binghuai Lin, Liyuan Wang:
End to End Transformer-Based Contextual Speech Recognition Based on Pointer Network. 2087-2091 - Shigeki Karita, Yotaro Kubo, Michiel Adriaan Unico Bacchiani, Llion Jones:
A Comparative Study on Neural Architectures and Training Methods for Japanese Speech Recognition. 2092-2096 - Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux:
Advanced Long-Context End-to-End Speech Recognition Using Context-Expanded Transformers. 2097-2101 - Md. Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh:
Transformer-Based ASR Incorporating Time-Reduction Layer and Fine-Tuning with Self-Knowledge Distillation. 2102-2106 - Jay Mahadeokar, Yangyang Shi, Yuan Shangguan, Chunyang Wu, Alex Xiao, Hang Su, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer:
Flexi-Transducer: Optimizing Latency, Accuracy and Compute for Multi-Domain On-Device Scenarios. 2107-2111
Speech Localization, Enhancement, and Quality Assessment
- Przemyslaw Falkowski-Gilski
:
Difference in Perceived Speech Signal Quality Assessment Among Monolingual and Bilingual Teenage Students. 2112-2116 - Christopher Schymura, Benedikt T. Bönninghoff, Tsubasa Ochiai, Marc Delcroix
, Keisuke Kinoshita
, Tomohiro Nakatani, Shoko Araki, Dorothea Kolossa:
PILOT: Introducing Transformers for Probabilistic Sound Event Localization. 2117-2121 - Masahito Togami, Robin Scheibler:
Sound Source Localization with Majorization Minimization. 2122-2126 - Gabriel Mittag, Babak Naderi, Assmaa Chehadi, Sebastian Möller:
NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. 2127-2131 - Babak Naderi, Ross Cutler:
Subjective Evaluation of Noise Suppression Algorithms in Crowdsourcing. 2132-2136 - Jianhua Geng, Sifan Wang, Juan Li, Jingwei Li, Xin Lou:
Reliable Intensity Vector Selection for Multi-Source Direction-of-Arrival Estimation Using a Single Acoustic Vector Sensor. 2137-2141 - Meng Yu, Chunlei Zhang, Yong Xu, Shi-Xiong Zhang, Dong Yu:
MetricNet: Towards Improved Modeling For Non-Intrusive Speech Quality Assessment. 2142-2146 - Andrea Toma, Daniele Salvati
, Carlo Drioli, Gian Luca Foresti:
CNN-Based Processing of Acoustic and Radio Frequency Signals for Speaker Localization from MAVs. 2147-2151 - Katsutoshi Itoyama, Yoshiya Morimoto, Shungo Masaki, Ryosuke Kojima
, Kenji Nishida, Kazuhiro Nakadai:
Assessment of von Mises-Bernoulli Deep Neural Network in Sound Source Localization. 2152-2156 - Rongliang Liu, Nengheng Zheng, Xi Chen:
Feature Fusion by Attention Networks for Robust DOA Estimation. 2157-2161 - Shoufeng Lin, Zhaojie Luo
:
Far-Field Speaker Localization and Adaptive GLMB Tracking. 2162-2166 - Vivek Sivaraman Narayanaswamy, Jayaraman J. Thiagarajan, Andreas Spanias:
On the Design of Deep Priors for Unsupervised Audio Restoration. 2167-2171 - Weiguang Chen, Cheng Xue, Xionghu Zhong:
Cramér-Rao Lower Bound for DOA Estimation with an Array of Directional Microphones in Reverberant Environments. 2172-2176
Speech Synthesis: Neural Waveform Generation
- Jaeseong You, Dalhyun Kim, Gyuhyeon Nam, Geumbyeol Hwang, Gyeongsu Chae:
GAN Vocoder: Multi-Resolution Discriminator Is All You Need. 2177-2181 - Jian Cong, Shan Yang, Lei Xie, Dan Su:
Glow-WaveGAN: Learning Speech Representations from GAN-Based Variational Auto-Encoder for High Fidelity Flow-Based Speech Synthesis. 2182-2186 - Reo Yoneyama, Yi-Chiao Wu, Tomoki Toda:
Unified Source-Filter GAN: Unified Source-Filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN. 2187-2191 - Kazuki Mizuta, Tomoki Koriyama
, Hiroshi Saruwatari:
Harmonic WaveGAN: GAN-Based Speech Waveform Generation Model with Harmonic Structure Discriminator. 2192-2196 - Ji-Hoon Kim, Sang-Hoon Lee, Ji-Hyun Lee, Seong-Whan Lee:
Fre-GAN: Adversarial Frequency-Consistent Audio Synthesis. 2197-2201 - Jinhyeok Yang, Jae-Sung Bae, Taejun Bak, Young-Ik Kim, Hoon-Young Cho:
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis. 2202-2206 - Won Jang, Dan Lim, Jaesam Yoon, Bongwan Kim, Juntae Kim:
UnivNet: A Neural Vocoder with Multi-Resolution Spectrogram Discriminators for High-Fidelity Waveform Generation. 2207-2211 - Mohammed Salah Al-Radhi
, Tamás Gábor Csapó, Csaba Zainkó, Géza Németh
:
Continuous Wavelet Vocoder-Based Decomposition of Parametric Speech Waveform Synthesis. 2212-2216 - Patrick Lumban Tobing
, Tomoki Toda:
High-Fidelity and Low-Latency Universal Neural Vocoder Based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling. 2217-2221 - Zhengxi Liu, Yanmin Qian:
Basis-MelGAN: Efficient Neural Vocoder Based on Audio Decomposition. 2222-2226 - Min-Jae Hwang, Ryuichi Yamamoto, Eunwoo Song, Jae-Min Kim:
High-Fidelity Parallel WaveGAN with Multi-Band Harmonic-Plus-Noise Model. 2227-2231
Spoken Machine Translation
- Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang:
SpecRec: An Alternative Solution for Improving End-to-End Speech-to-Text Translation via Spectrogram Reconstruction. 2232-2236 - Colin Cherry, Naveen Arivazhagan, Dirk Padfield, Maxim Krikun:
Subtitle Translation as Markup Translation. 2237-2241 - Changhan Wang, Anne Wu, Juan Pino, Alexei Baevski, Michael Auli, Alexis Conneau:
Large-Scale Self- and Semi-Supervised Learning for Speech Translation. 2242-2246 - Changhan Wang, Anne Wu, Jiatao Gu, Juan Pino:
CoVoST 2 and Massively Multilingual Speech Translation. 2247-2251 - Yao-Fei Cheng, Hung-Shin Lee, Hsin-Min Wang
:
AlloST: Low-Resource Speech Translation Without Source Transcription. 2252-2256 - Johanes Effendi, Sakriani Sakti, Satoshi Nakamura:
Weakly-Supervised Speech-to-Text Mapping with Visually Connected Non-Parallel Speech-Text Data Using Cyclic Partially-Aligned Transformer. 2257-2261 - Hirotaka Tokuyama, Sakriani Sakti, Katsuhito Sudoh, Satoshi Nakamura:
Transcribing Paralinguistic Acoustic Cues to Target Language Text in Transformer-Based Speech-to-Text Translation. 2262-2266 - Rong Ye, Mingxuan Wang, Lei Li:
End-to-End Speech Translation via Cross-Modal Progressive Training. 2267-2271 - Yuka Ko, Katsuhito Sudoh, Sakriani Sakti, Satoshi Nakamura:
ASR Posterior-Based Loss for Multi-Task End-to-End Speech Translation. 2272-2276 - Alejandro Pérez González de Martos
, Javier Iranzo-Sánchez, Adrià Giménez-Pastor, Javier Jorge, Joan Albert Silvestre-Cerdà
, Jorge Civera, Albert Sanchís, Alfons Juan:
Towards Simultaneous Machine Interpretation. 2277-2281 - Giuseppe Martucci, Mauro Cettolo, Matteo Negri
, Marco Turchi:
Lexical Modeling of ASR Errors for Robust Speech Translation. 2282-2286 - Piyush Vyas, Anastasia Kuznetsova, Donald S. Williamson
:
Optimally Encoding Inductive Biases into the Transformer Improves End-to-End Speech Translation. 2287-2291 - Tejaswini Ananthanarayana, Lipisha Chaudhary, Ifeoma Nwogu
:
Effects of Feature Scaling and Fusion on Sign Language Translation. 2292-2296
SdSV Challenge 2021: Analysis and Exploration of New Ideas on Short-Duration Speaker Verification
- Alexander Alenin, Anton Okhotnikov, Rostislav Makarov, Nikita Torgashov, Ilya Shigabeev, Konstantin Simonchik:
The ID R&D System Description for Short-Duration Speaker Verification Challenge 2021. 2297-2301 - Jenthe Thienpondt
, Brecht Desplanques, Kris Demuynck:
Integrating Frequency Translational Invariance in TDNNs and Frequency Positional Information in 2D ResNets to Enhance Speaker Verification. 2302-2306 - Aleksei Gusev, Alisa Vinogradova, Sergey Novoselov, Sergei Astapov:
SdSVC Challenge 2021: Tips and Tricks to Boost the Short-Duration Speaker Verification System Performance. 2307-2311 - Woo Hyun Kang, Nam Soo Kim:
Team02 Text-Independent Speaker Verification System for SdSV Challenge 2021. 2312-2316 - Xiaoyi Qin, Chao Wang
, Yong Ma, Min Liu, Shilei Zhang, Ming Li:
Our Learned Lessons from Cross-Lingual Speaker Verification: The CRMI-DKU System Description for the Short-Duration Speaker Verification Challenge 2021. 2317-2321 - Peng Zhang, Peng Hu, Xueliang Zhang:
Investigation of IMU&Elevoc Submission for the Short-Duration Speaker Verification Challenge 2021. 2322-2326 - Jie Yan, Shengyu Yao, Yiqian Pan, Wei Chen:
The Sogou System for Short-Duration Speaker Verification Challenge 2021. 2327-2331 - Bing Han, Zhengyang Chen, Zhikai Zhou, Yanmin Qian:
The SJTU System for Short-Duration Speaker Verification Challenge 2021. 2332-2336
Show and Tell 2
- Sungjae Cho, Soo-Young Lee:
Multi-Speaker Emotional Text-to-Speech Synthesizer. 2337-2338 - Ales Prazák, Zdenek Loose, Josef V. Psutka, Vlasta Radová, Josef Psutka, Jan Svec:
Live TV Subtitling Through Respeaking. 2339-2340 - Stefan Fragner, Tobias Topar, Maximilian Giller, Lukas Pfeifenberger, Franz Pernkopf:
Autonomous Robot for Measuring Room Impulse Responses. 2341-2342 - Jonas Beskow, Charlie Caper, Johan Ehrenfors, Nils Hagberg, Anne Jansen, Chris Wood:
Expressive Robot Performance Based on Facial Motion Capture. 2343-2344 - Mónica Domínguez, Juan Soler Company, Leo Wanner:
ThemePro 2.0: Showcasing the Role of Thematic Progression in Engaging Human-Computer Interaction. 2345-2346 - Sai Guruju, Jithendra Vepa:
Addressing Compliance in Call Centers with Entity Extraction. 2347-2348 - Krishnachaitanya Gogineni, Tarun Reddy Yadama, Jithendra Vepa:
Audio Segmentation Based Conversational Silence Detection for Contact Center Calls. 2349-2350
Graph and End-to-End Learning for Speaker Recognition
- Desh Raj
, Sanjeev Khudanpur:
Reformulating DOVER-Lap Label Mapping as a Graph Partitioning Problem. 2351-2355 - Hemlata Tak, Jee-weon Jung, Jose Patino, Massimiliano Todisco, Nicholas W. D. Evans:
Graph Attention Networks for Anti-Spoofing. 2356-2360 - Victoria Mingote
, Antonio Miguel, Alfonso Ortega Giménez
, Eduardo Lleida:
Log-Likelihood-Ratio Cost Function as Objective Loss for Speaker Verification Systems. 2361-2365 - Junyi Peng, Xiaoyang Qu, Rongzhi Gu, Jianzong Wang
, Jing Xiao, Lukás Burget, Jan Cernocký:
Effective Phase Encoding for End-To-End Speaker Verification. 2366-2370
Spoken Language Processing II
- Ha Nguyen, Yannick Estève, Laurent Besacier:
Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation. 2371-2375 - Dominik Machácek
, Matús Zilinec, Ondrej Bojar:
Lost in Interpreting: Speech Translation from Source or Interpreter? 2376-2380 - Baptiste Pouthier, Laurent Pilati, Leela K. Gudupudi, Charles Bouveyron, Frédéric Precioso:
Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion. 2381-2385 - Sarenne Wallbridge
, Peter Bell, Catherine Lai:
It's Not What You Said, it's How You Said it: Discriminative Perception of Speech as a Multichannel Communication System. 2386-2390
Speech and Audio Analysis
- Thilo Michael, Gabriel Mittag, Andreas Bütow, Sebastian Möller
:
Extending the Fullband E-Model Towards Background Noise, Bursty Packet Loss, and Conversational Degradations. 2391-2395 - Christian Bergler, Manuel Schmitt, Andreas K. Maier, Helena Symonds, Paul Spong, Steven R. Ness, George Tzanetakis, Elmar Nöth:
ORCA-SLANG: An Automatic Multi-Stage Semi-Supervised Deep Learning Framework for Large-Scale Killer Whale Call Type Identification. 2396-2400 - Wim Boes
, Hugo Van hamme
:
Audiovisual Transfer Learning for Audio Tagging and Sound Event Detection. 2401-2405 - Natalia Nessler, Milos Cernak, Paolo Prandoni, Pablo Mainar:
Non-Intrusive Speech Quality Assessment with Transfer Learning and Subject-Specific Scaling. 2406-2410 - Andreea-Maria Oncescu
, A. Sophia Koepke, João F. Henriques, Zeynep Akata, Samuel Albanie:
Audio Retrieval with Natural Language Queries. 2411-2415
Cross/Multi-Lingual and Code-Switched ASR
- Manuel Giollo, Deniz Gunceler, Yulan Liu, Daniel Willett:
Bootstrap an End-to-End ASR System by Multilingual Training, Transfer Learning, Text-to-Text Mapping and Synthetic Audio. 2416-2420 - Ngoc-Quan Pham, Tuan-Nam Nguyen, Sebastian Stüker, Alex Waibel:
Efficient Weight Factorization for Multilingual Speech Recognition. 2421-2425 - Alexis Conneau, Alexei Baevski, Ronan Collobert, Abdelrahman Mohamed, Michael Auli:
Unsupervised Cross-Lingual Representation Learning for Speech Recognition. 2426-2430 - Tomoaki Hayakawa, Chee Siang Leow, Akio Kobayashi, Takehito Utsuro, Hiromitsu Nishizaki:
Language and Speaker-Independent Feature Transformation for End-to-End Multilingual Speech Recognition. 2431-2435 - Krishna D. N, Pinyi Wang, Bruno Bozza:
Using Large Self-Supervised Models for Low-Resource Speech Recognition. 2436-2440 - Mari Ganesh Kumar, Jom Kuriakose, Anand Thyagachandran, Arun Kumar A, Ashish Seth, Lodagala Durga Prasad, Saish Jaiswal, Anusha Prakash, Hema A. Murthy:
Dual Script E2E Framework for Multilingual and Code-Switching ASR. 2441-2445 - Anuj Diwan, Rakesh Vaideeswaran, Sanket Shah, Ankita Singh, Srinivasa Raghavan K. M., Shreya Khare, Vinit Unni, Saurabh Vyas, Akash Rajpuria, Chiranjeevi Yarra
, Ashish R. Mittal, Prasanta Kumar Ghosh, Preethi Jyothi, Kalika Bali, Vivek Seshadri, Sunayana Sitaram, Samarth Bharadwaj, Jai Nanavati, Raoul Nanavati, Karthik Sankaranarayanan:
MUCS 2021: Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2446-2450 - Genta Indra Winata, Guangsen Wang, Caiming Xiong, Steven C. H. Hoi:
Adapt-and-Adjust: Overcoming the Long-Tail Problem of Multilingual Speech Recognition. 2451-2455 - Hardik B. Sailor
, Kiran Praveen, Vikas Agrawal, Abhinav Jain, Abhishek Pandey:
SRI-B End-to-End System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2456-2460 - Xinjian Li, Juncheng Li, Florian Metze, Alan W. Black:
Hierarchical Phone Recognition with Compositional Phonetics. 2461-2465 - Shammur Absar Chowdhury, Amir Hussein, Ahmed Abdelali, Ahmed Ali:
Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR. 2466-2470 - Brian Yan, Siddharth Dalmia, David R. Mortensen
, Florian Metze, Shinji Watanabe
:
Differentiable Allophone Graphs for Language-Universal Speech Recognition. 2471-2475
Health and Affect II
- Vincent P. Martin
, Jean-Luc Rouas
, Florian Boyer, Pierre Philip:
Automatic Speech Recognition Systems Errors for Objective Sleepiness Detection Through Voice. 2476-2480 - Jon Gillick, Wesley Deng, Kimiko Ryokai, David Bamman:
Robust Laughter Detection in Noisy Environments. 2481-2485 - Mizuki Nagano, Yusuke Ijima, Sadao Hiroya:
Impact of Emotional State on Estimation of Willingness to Buy from Advertising Speech. 2486-2490 - Huda Alsofyani, Alessandro Vinciarelli:
Stacked Recurrent Neural Networks for Speech-Based Inference of Attachment Condition in School Age Children. 2491-2495 - Nujud Aloshban, Anna Esposito
, Alessandro Vinciarelli:
Language or Paralanguage, This is the Problem: Comparing Depressed and Non-Depressed Speakers Through the Analysis of Gated Multimodal Units. 2496-2500 - Aniruddha Tammewar, Alessandra Cervone, Giuseppe Riccardi:
Emotion Carrier Recognition from Personal Narratives. 2501-2505 - Scott Condron, Georgia Clarke, Anita Klementiev, Daniela Morse-Kopp, Jack Parry, Dimitri Palaz:
Non-Verbal Vocalisation and Laughter Detection Using Sequence-to-Sequence Models and Multi-Label Training. 2506-2510 - Cong Cai, Mingyue Niu, Bin Liu, Jianhua Tao, Xuefei Liu:
TDCA-Net: Time-Domain Channel Attention Network for Depression Detection. 2511-2515 - Catarina Botelho, Alberto Abad
, Tanja Schultz
, Isabel Trancoso
:
Visual Speech for Obstructive Sleep Apnea Detection. 2516-2520 - Héctor A. Cordourier Maruri, Sinem Aslan, Georg Stemmer
, Nese Alyüz, Lama Nachman:
Analysis of Contextual Voice Changes in Remote Meetings. 2521-2525 - Nadee Seneviratne, Carol Y. Espy-Wilson:
Speech Based Depression Severity Level Classification Using a Multi-Stage Dilated CNN-LSTM Model. 2526-2530
Neural Network Training Methods for ASR
- Ho-Gyeong Kim, Min-Joong Lee, Hoshik Lee, Tae Gyoon Kang, Jihyun Lee, Eunho Yang, Sung Ju Hwang:
Multi-Domain Knowledge Distillation via Uncertainty-Matching for End-to-End ASR Models. 2531-2535 - Jonathan Macoskey, Grant P. Strimel, Ariya Rastrow:
Learning a Neural Diff for Speech Models. 2536-2540 - Shucong Zhang, Erfan Loweimi
, Peter Bell, Steve Renals:
Stochastic Attention Head Removal: A Simple and Effective Method for Improving Transformer Based ASR Models. 2541-2545 - Jiabin Xue, Tieran Zheng, Jiqing Han:
Model-Agnostic Fast Adaptive Multi-Objective Balancing Algorithm for Multilingual Automatic Speech Recognition Model Training. 2546-2550 - Heng-Jui Chang
, Hung-yi Lee, Lin-Shan Lee:
Towards Lifelong Learning of End-to-End ASR. 2551-2555 - Isabel Leal, Neeraj Gaur, Parisa Haghani, Brian Farris, Pedro J. Moreno, Manasa Prasad, Bhuvana Ramabhadran, Yun Zhu:
Self-Adaptive Distillation for Multilingual Speech Recognition: Leveraging Student Independence. 2556-2560 - Hainan Xu, Kartik Audhkhasi, Yinghui Huang, Jesse Emond, Bhuvana Ramabhadran:
Regularizing Word Segmentation by Creating Misspellings. 2561-2565 - Peidong Wang, Tara N. Sainath, Ron J. Weiss:
Multitask Training with Text Data for End-to-End Speech Recognition. 2566-2570 - Xianzhao Chen, Hao Ni, Yi He, Kang Wang, Zejun Ma, Zongxia Xie:
Emitting Word Timings with HMM-Free End-to-End System in Automatic Speech Recognition. 2571-2575 - Jasha Droppo, Oguz Elibol:
Scaling Laws for Acoustic Models. 2576-2580 - Jayadev Billa:
Leveraging Non-Target Language Resources to Improve ASR Performance in a Target Language. 2581-2585 - Andrea Fasoli, Chia-Yu Chen, Mauricio J. Serrano, Xiao Sun
, Naigang Wang, Swagath Venkataramani, George Saon
, Xiaodong Cui, Brian Kingsbury, Wei Zhang, Zoltán Tüske, Kailash Gopalakrishnan:
4-Bit Quantization of LSTM-Based Speech Recognition Models. 2586-2590 - Ryo Masumura, Daiki Okamura, Naoki Makishima, Mana Ihori, Akihiko Takashima, Tomohiro Tanaka, Shota Orihashi:
Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation. 2591-2595 - Zhong Meng, Yu Wu, Naoyuki Kanda, Liang Lu, Xie Chen, Guoli Ye, Eric Sun, Jinyu Li
, Yifan Gong:
Minimum Word Error Rate Training with Language Model Fusion for End-to-End Speech Recognition. 2596-2600 - Dongcheng Jiang, Chao Zhang, Philip C. Woodland:
Variable Frame Rate Acoustic Models Using Minimum Error Reinforcement Learning. 2601-2605
Prosodic Features and Structure
- Constantijn Kaland
, Matthew Gordon:
How f0 and Phrase Position Affect Papuan Malay Word Identification. 2606-2610 - Anna Bothe Jespersen
, Pavel Sturm, Mísa Hejná
:
On the Feasibility of the Danish Model of Intonational Transcription: Phonetic Evidence from Jutlandic Danish. 2611-2615 - Adrien Méli, Nicolas Ballier, Achille Falaise, Alice Henderson
:
An Experiment in Paratone Detection in a Prosodically Annotated EAP Spoken Corpus. 2616-2620 - Branislav Gerazov
, Michael Wagner:
ProsoBeast Prosody Annotation Tool. 2621-2625 - Trang Tran
, Mari Ostendorf:
Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts. 2626-2630 - Roger Cheng-yen Liu
, Feng-fan Hsieh
, Yueh-Chin Chang:
Targeted and Targetless Neutral Tones in Taiwanese Southern Min. 2631-2635 - Mária Gósy, Kálmán Abari:
The Interaction of Word Complexity and Word Duration in an Agglutinative Language. 2636-2640 - Ho-hsien Pan
, Shao-Ren Lyu:
Taiwan Min Nan (Taiwanese) Checked Tones Sound Change. 2641-2645 - Moritz Jakob
, Bettina Braun, Katharina Zahner-Ritter:
In-Group Advantage in the Perception of Emotions: Evidence from Three Varieties of German. 2646-2650 - Christer Gobl
:
The LF Model in the Frequency Domain for Glottal Airflow Modelling Without Aliasing Distortion. 2651-2655 - Michael Wagner, Alvaro Iturralde Zurita, Sijia Zhang:
Parsing Speech for Grouping and Prominence, and the Typology of Rhythm. 2656-2660 - Benazir Mumtaz, Massimiliano Canzi, Miriam Butt:
Prosody of Case Markers in Urdu. 2661-2665 - Brynhildur Stefansdottir, Francesco Burroni
, Sam Tilsen:
Articulatory Characteristics of Icelandic Voiced Fricative Lenition: Gradience, Categoricity, and Speaker/Gesture-Specific Effects. 2666-2670 - Khia A. Johnson:
Leveraging the Uniformity Framework to Examine Crosslinguistic Similarity for Long-Lag Stops in Spontaneous Cantonese-English Bilingual Speech. 2671-2675
Single-Channel Speech Enhancement
- Aswin Sivaraman
, Sunwoo Kim, Minje Kim:
Personalized Speech Enhancement Through Self-Supervised Data Augmentation and Purification. 2676-2680 - Mark R. Saddler, Andrew Francl, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott:
Speech Denoising with Auditory Models. 2681-2685 - Sefik Emre Eskimez, Xiaofei Wang, Min Tang, Hemin Yang, Zirun Zhu, Zhuo Chen, Huaming Wang, Takuya Yoshioka:
Human Listening and Live Captioning: Multi-Task Training for Speech Enhancement. 2686-2690 - Xinmeng Xu, Yang Wang, Dongxiang Xu, Yiyuan Peng, Cong Zhang, Jie Jia, Binbin Chen:
Multi-Stage Progressive Speech Enhancement Network. 2691-2695 - Oscar Chang, Dung N. Tran, Kazuhito Koishida:
Single-Channel Speech Enhancement Using Learnable Loss Mixup. 2696-2700 - Xiaoqi Zhang, Jun Du, Li Chai, Chin-Hui Lee:
A Maximum Likelihood Approach to SNR-Progressive Learning Using Generalized Gaussian Distribution for LSTM-Based Speech Enhancement. 2701-2705 - Vikas Agrawal, Shashi Kumar, Shakti P. Rath:
Whisper Speech Enhancement Using Joint Variational Autoencoder for Improved Speech Recognition. 2706-2710 - Lukas Lee, Youna Ji, Minjae Lee, Min-Seok Choi:
DEMUCS-Mobile : On-Device Lightweight Speech Enhancement. 2711-2715 - Madhav Mahesh Kashyap
, Anuj Tambwekar
, Krishnamoorthy Manohara, S. Natarajan:
Speech Denoising Without Clean Training Data: A Noise2Noise Approach. 2716-2720 - Feng Dang
, Pengyuan Zhang, Hangting Chen:
Improved Speech Enhancement Using a Complex-Domain GAN with Fused Time-Domain and Time-Frequency Domain Constraints. 2721-2725 - Xudong Zhang, Liang Zhao, Feng Gu:
Speech Enhancement with Topology-Enhanced Generative Adversarial Networks (GANs). 2726-2730 - Suliang Bu, Yunxin Zhao, Shaojun Wang, Mei Han:
Learning Speech Structure to Improve Time-Frequency Masks. 2731-2735 - Eesung Kim, Hyeji Seo:
SE-Conformer: Time-Domain Speech Enhancement Using Conformer. 2736-2740
Speech Synthesis: Tools, Data, Evaluation
- Thananchai Kongthaworn, Burin Naowarat, Ekapol Chuangsuwanich:
Spectral and Latent Speech Representation Distortion for TTS Evaluation. 2741-2745 - Cassia Valentini-Botinhao, Simon King:
Detection and Analysis of Attention Errors in Sequence-to-Sequence Text-to-Speech. 2746-2750 - Rohola Zandie, Mohammad H. Mahoor
, Julia Madsen, Eshrat S. Emamian:
RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis. 2751-2755 - Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li:
AISHELL-3: A Multi-Speaker Mandarin TTS Corpus. 2756-2760 - Nicholas Eng, C. T. Justine Hui, Yusuke Hioka, Catherine I. Watson:
Comparing Speech Enhancement Techniques for Voice Adaptation-Based Speech Synthesis. 2761-2765 - Chenye Cui, Yi Ren, Jinglin Liu, Feiyang Chen, Rongjie Huang, Ming Lei, Zhou Zhao:
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model. 2766-2770 - Sai Sirisha Rallabandi, Abhinav Bharadwaj, Babak Naderi, Sebastian Möller
:
Perception of Social Speaker Characteristics in Synthetic Speech. 2771-2775 - Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg, Yang Zhang:
Hi-Fi Multi-Speaker English TTS Dataset. 2776-2780 - Wei-Cheng Tseng, Chien-yu Huang, Wei-Tsung Kao, Yist Y. Lin, Hung-yi Lee:
Utilizing Self-Supervised Representations for MOS Prediction. 2781-2785 - Saida Mussakhojayeva, Aigerim Janaliyeva, Almas Mirzakhmetov, Yerbolat Khassanov, Huseyin Atakan Varol:
KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset. 2786-2790 - Jason Taylor, Korin Richmond
:
Confidence Intervals for ASR-Based TTS Evaluation. 2791-2795
INTERSPEECH 2021 Deep Noise Suppression Challenge
- Chandan K. A. Reddy, Harishchandra Dubey, Kazuhito Koishida, Arun Asokan Nair, Vishak Gopal, Ross Cutler, Sebastian Braun, Hannes Gamper, Robert Aichner, Sriram Srinivasan:
INTERSPEECH 2021 Deep Noise Suppression Challenge. 2796-2800 - Andong Li, Wenzhe Liu
, Xiaoxue Luo, Guochen Yu
, Chengshi Zheng, Xiaodong Li:
A Simultaneous Denoising and Dereverberation Framework with Target Decoupling. 2801-2805 - Ziyi Xu, Maximilian Strake, Tim Fingscheidt
:
Deep Noise Suppression with Non-Intrusive PESQNet Supervision Enabling the Use of Real Training Data. 2806-2810 - Xiaohuai Le, Hongsheng Chen, Kai Chen, Jing Lu:
DPCRN: Dual-Path Convolution Recurrent Network for Single Channel Speech Enhancement. 2811-2815 - Shubo Lv, Yanxin Hu, Shimin Zhang, Lei Xie:
DCCRN+: Channel-Wise Subband DCCRN with SNR Estimation for Speech Enhancement. 2816-2820 - Kanghao Zhang
, Shulin He, Hao Li, Xueliang Zhang:
DBNet: A Dual-Branch Network Architecture Processing on Spectrum and Waveform for Single-Channel Speech Enhancement. 2821-2825 - Xu Zhang, Xinlei Ren, Xiguang Zheng
, Lianwu Chen, Chen Zhang, Liang Guo, Bing Yu:
Low-Delay Speech Enhancement Using Perceptually Motivated Target and Loss. 2826-2830 - Koen Oostermeijer, Qing Wang, Jun Du:
Lightweight Causal Transformer with Local Self-Attention for Real-Time Speech Enhancement. 2831-2835
Neural Network Training Methods and Architectures for ASR
- Nicolae-Catalin Ristea, Radu Tudor Ionescu:
Self-Paced Ensemble Learning for Speech and Audio Classification. 2836-2840 - Atsushi Kojima:
Knowledge Distillation for Streaming Transformer-Transducer. 2841-2845 - Timo Lohrenz
, Zhengyang Li, Tim Fingscheidt
:
Multi-Encoder Learning and Stream Fusion for Transformer-Based End-to-End Automatic Speech Recognition. 2846-2850 - Salah Zaiem, Titouan Parcollet, Slim Essid
:
Conditional Independence for Pretext Task Selection in Self-Supervised Speech Representation Learning. 2851-2855 - Mohammad Zeineldeen, Aleksandr Glushko, Wilfried Michel, Albert Zeyer, Ralf Schlüter
, Hermann Ney:
Investigating Methods to Improve Language Model Integration for Attention-Based Encoder-Decoder ASR Models. 2856-2860 - Apoorv Vyas, Srikanth R. Madikeri, Hervé Bourlard:
Comparing CTC and LFMMI for Out-of-Domain Adaptation of wav2vec 2.0 Acoustic Model. 2861-2865
Emotion and Sentiment Analysis I
- Clément Le Moine
, Nicolas Obin, Axel Roebel
:
Speaker Attentive Speech Emotion Recognition. 2866-2870 - Seong-Gyun Leem, Daniel Fulford, Jukka-Pekka Onnela, David Gard, Carlos Busso:
Separation of Emotional and Reconstruction Embeddings on Ladder Network to Improve Speech Emotion Recognition Robustness in Noisy Conditions. 2871-2875 - Efthymios Georgiou
, Georgios Paraskevopoulos, Alexandros Potamianos:
M3: MultiModal Masking Applied to Sentiment Analysis. 2876-2880
Linguistic Components in End-to-End ASR
- Ondrej Klejch, Electra Wallington, Peter Bell:
The CSTR System for Multilingual and Code-Switching ASR Challenges for Low Resource Indian Languages. 2881-2885 - Wei Zhou
, Mohammad Zeineldeen, Zuoyun Zheng, Ralf Schlüter
, Hermann Ney:
Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition. 2886-2890 - Wei Zhou
, Albert Zeyer, André Merboldt, Ralf Schlüter
, Hermann Ney:
Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept. 2891-2895 - Abbas Khosravani, Philip N. Garner
, Alexandros Lazaridis:
Modeling Dialectal Variation for Swiss German Automatic Speech Recognition. 2896-2900 - Ekaterina Egorova, Hari Krishna Vydana, Lukás Burget, Jan Cernocký:
Out-of-Vocabulary Words Detection with Attention and CTC Alignments in an End-to-End ASR System. 2901-2905 - Matthew Wiesner, Mousmita Sarma, Ashish Arora, Desh Raj
, Dongji Gao, Ruizhe Huang, Supreet Preet, Moris Johnson, Zikra Iqbal, Nagendra Goel, Jan Trmal, Leibny Paola García-Perera, Sanjeev Khudanpur:
Training Hybrid Models on Noisy Transliterated Transcripts for Code-Switched Speech Recognition. 2906-2910
Assessment of Pathological Speech and Language II
- Wei Xue
, Roeland van Hout, Fleur Boogmans, Mario Ganzeboom, Catia Cucchiarini, Helmer Strik:
Speech Intelligibility of Dysarthric Speech: Human Scores and Acoustic-Phonetic Features. 2911-2915 - Young-Kyung Kim
, Rimita Lahiri, Md. Nasir, So Hyun Kim
, Somer Bishop, Catherine Lord, Shrikanth S. Narayanan:
Analyzing Short Term Dynamic Speech Features for Understanding Behavioral Traits of Children with Autism Spectrum Disorder. 2916-2920 - Waldemar Jesko
:
Vocalization Recognition of People with Profound Intellectual and Multiple Disabilities (PIMD) Using Machine Learning Algorithms. 2921-2925 - Barbara Gili Fivela, Vincenzo Sallustio, Silvia Pede, Danilo Patrocinio:
Phonetic Complexity, Speech Accuracy and Intelligibility Assessment of Italian Dysarthric Speech. 2926-2930 - Si Ioi Ng, Cymie Wing-Yee Ng, Jingyu Li, Tan Lee
:
Detection of Consonant Errors in Disordered Speech Based on Consonant-Vowel Segment Embedding. 2931-2935 - Adam Hair, Guanlong Zhao, Beena Ahmed, Kirrie J. Ballard
, Ricardo Gutierrez-Osuna:
Assessing Posterior-Based Mispronunciation Detection on Field-Collected Recordings from Child Speech Therapy Sessions. 2936-2940 - Bahman Mirheidari, Yilin Pan, Daniel Blackburn
, Ronan O'Malley, Heidi Christensen
:
Identifying Cognitive Impairment Using Sentence Representation Vectors. 2941-2945 - Zhengjun Yue, Jon Barker, Heidi Christensen
, Cristina McKean
, Elaine Ashton, Yvonne Wren, Swapnil Gadgil, Rebecca Bright:
Parental Spoken Scaffolding and Narrative Skills in Crowd-Sourced Storytelling Samples of Young Children. 2946-2950 - Tong Xia
, Jing Han, Lorena Qendro, Ting Dang, Cecilia Mascolo:
Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data. 2951-2955 - Disong Wang, Liqun Deng, Yu Ting Yeung, Xiao Chen, Xunying Liu, Helen Meng:
Unsupervised Domain Adaptation for Dysarthric Speech Detection via Domain Adversarial Training and Mutual Information Minimization. 2956-2960 - Tanuka Bhattacharjee, Jhansi Mallela, Yamini Belur, Atchayaram Nalini, Ravi Yadav
, Pradeep Reddy, Dipanjan Gope, Prasanta Kumar Ghosh:
Source and Vocal Tract Cues for Speech-Based Classification of Patients with Parkinson's Disease and Healthy Subjects. 2961-2965 - R'mani Haulcy, James R. Glass:
CLAC: A Speech Corpus of Healthy English Speakers. 2966-2970
Multimodal Systems
- Leanne Nortje, Herman Kamper
:
Direct Multimodal Few-Shot Learning of Speech and Images. 2971-2975 - Ramon Sanabria, Austin Waters, Jason Baldridge:
Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval. 2976-2980 - Huan Zhao, Kaili Ma:
A Fast Discrete Two-Step Learning Hashing for Scalable Cross-Modal Retrieval. 2981-2985 - Jianrong Wang, Ziyue Tang, Xuewei Li, Mei Yu, Qiang Fang, Li Liu:
Cross-Modal Knowledge Distillation Method for Automatic Cued Speech Recognition. 2986-2990 - Kayode Olaleye, Herman Kamper
:
Attention-Based Keyword Localisation in Speech Using Visual Grounding. 2991-2995 - Khazar Khorrami
, Okko Räsänen
:
Evaluation of Audio-Visual Alignments in Visually Grounded Speech Models. 2996-3000 - Hang Chen, Jun Du, Yu Hu, Li-Rong Dai, Bao-Cai Yin, Chin-Hui Lee:
Automatic Lip-Reading with Hierarchical Pyramidal Convolution and Self-Attention for Image Sequences with No Word Boundaries. 3001-3005 - Andrew Rouditchenko, Angie W. Boggust, David Harwath, Samuel Thomas, Hilde Kuehne
, Brian Chen, Rameswar Panda, Rogério Feris, Brian Kingsbury, Michael Picheny, James R. Glass:
Cascaded Multilingual Audio-Visual Learning from Videos. 3006-3010 - Pingchuan Ma, Rodrigo Mira
, Stavros Petridis, Björn W. Schuller, Maja Pantic:
LiRA: Learning Visual Speech Representations from Audio Through Self-Supervision. 3011-3015 - Richard Rose, Olivier Siohan, Anshuman Tripathi, Otavio Braga:
End-to-End Audio-Visual Speech Recognition for Overlapping Speech. 3016-3020 - Yifei Wu, Chenda Li, Song Yang, Zhongqin Wu, Yanmin Qian:
Audio-Visual Multi-Talker Speech Recognition in a Cocktail Party. 3021-3025
Source Separation I
- Sanyuan Chen, Yu Wu, Zhuo Chen, Jian Wu, Takuya Yoshioka, Shujie Liu, Jinyu Li
, Xiangzhan Yu:
Ultra Fast Speech Separation Model with Teacher Student Learning. 3026-3030 - Murtiza Ali, Ashwani Koul, Karan Nathwani:
Group Delay Based Re-Weighted Sparse Recovery Algorithms for Robust and High-Resolution Source Separation in DOA Framework. 3031-3035 - Cong Han, Yi Luo, Chenda Li, Tianyan Zhou, Keisuke Kinoshita
, Shinji Watanabe
, Marc Delcroix
, Hakan Erdogan, John R. Hershey, Nima Mesgarani, Zhuo Chen:
Continuous Speech Separation Using Speaker Inventory for Long Recording. 3036-3040 - Weitao Yuan, Shengbei Wang, Xiangrui Li, Masashi Unoki, Wenwu Wang:
Crossfire Conditional Generative Adversarial Networks for Singing Voice Extraction. 3041-3045 - Kai Wang, Hao Huang, Ying Hu, Zhihua Huang, Sheng Li
:
End-to-End Speech Separation Using Orthogonal Representation in Complex and Real Time-Frequency Domain. 3046-3050 - Yu Nakagome, Masahito Togami, Tetsuji Ogawa, Tetsunori Kobayashi:
Efficient and Stable Adversarial Learning Using Unpaired Data for Unsupervised Multichannel Speech Separation. 3051-3055 - Sung-Feng Huang, Shun-Po Chuang, Da-Rong Liu, Yi-Chen Chen, Gene-Ping Yang, Hung-yi Lee:
Stabilizing Label Assignment for Speech Separation by Self-Supervised Pre-Training. 3056-3060 - Fan-Lin Wang, Yu-Huai Peng, Hung-Shin Lee, Hsin-Min Wang
:
Dual-Path Filter Network: Speaker-Aware Modeling for Speech Separation. 3061-3065 - Jian Wu, Zhuo Chen, Sanyuan Chen, Yu Wu, Takuya Yoshioka, Naoyuki Kanda, Shujie Liu, Jinyu Li
:
Investigation of Practical Aspects of Single Channel Speech Separation for ASR. 3066-3070 - Yi Luo, Nima Mesgarani:
Implicit Filter-and-Sum Network for End-to-End Multi-Channel Speech Separation. 3071-3075 - Yong Xu, Zhuohuang Zhang, Meng Yu, Shi-Xiong Zhang, Dong Yu:
Generalized Spatio-Temporal RNN Beamformer for Target Speech Separation. 3076-3080
Speaker Diarization I
- Yi-Chieh Liu, Eunjung Han, Chul Lee, Andreas Stolcke:
End-to-End Neural Diarization: From Transformer to Conformer. 3081-3085 - Jee-weon Jung, Hee-Soo Heo, Youngki Kwon, Joon Son Chung, Bong-Jin Lee:
Three-Class Overlapped Speech Detection Using a Convolutional Recurrent Neural Network. 3086-3090 - Xucheng Wan, Kai Liu, Huan Zhou:
Online Speaker Diarization Equipped with Discriminative Modeling and Guided Inference. 3091-3095 - Yuki Takashima, Yusuke Fujita, Shota Horiguchi, Shinji Watanabe
, Leibny Paola García-Perera, Kenji Nagamatsu:
Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization. 3096-3100 - Youngki Kwon, Jee-weon Jung, Hee-Soo Heo, You Jin Kim, Bong-Jin Lee, Joon Son Chung:
Adapting Speaker Embeddings for Speaker Diarisation. 3101-3105 - Yu-Xuan Wang, Jun Du, Maokui He, Shutong Niu, Lei Sun, Chin-Hui Lee:
Scenario-Dependent Speaker Diarization for DIHARD-III Challenge. 3106-3110 - Hervé Bredin, Antoine Laurent:
End-To-End Speaker Segmentation for Overlap-Aware Resegmentation. 3111-3115 - Yawen Xue, Shota Horiguchi, Yusuke Fujita, Yuki Takashima, Shinji Watanabe
, Leibny Paola García-Perera, Kenji Nagamatsu:
Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers. 3116-3120 - Or Haim Anidjar
, Itshak Lapidot, Chen Hajaj
, Amit Dvir:
A Thousand Words are Worth More Than One Recording: Word-Embedding Based Speaker Change Detection. 3121-3125
Speech Synthesis: Prosody Modeling I
- Kosuke Futamata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana:
Phrase Break Prediction with Bidirectional Encoder Representations in Japanese Text-to-Speech Synthesis. 3126-3130 - Iván Vallés-Pérez, Julian Roth, Grzegorz Beringer, Roberto Barra-Chicote, Jasha Droppo:
Improving Multi-Speaker TTS Prosody Variance with a Residual Encoder and Normalizing Flows. 3131-3135 - Chenpeng Du
, Kai Yu:
Rich Prosody Diversity Modelling with Phone-Level Mixture Density Network. 3136-3140 - Kenichi Fujita, Atsushi Ando, Yusuke Ijima:
Phoneme Duration Modeling Using Speech Rhythm-Based Speaker Embeddings for Multi-Speaker Speech Synthesis. 3141-3145 - Yuxiang Zou, Shichao Liu, Xiang Yin, Haopeng Lin, Chunfeng Wang, Haoyu Zhang, Zejun Ma:
Fine-Grained Prosody Modeling in Neural Speech Synthesis Using ToBI Representation. 3146-3150 - Mayank Sharma, Yogesh Virkar, Marcello Federico, Roberto Barra-Chicote, Robert Enyedi:
Intra-Sentential Speaking Rate Control in Neural Text-To-Speech for Automatic Dubbing. 3151-3155 - Guangyan Zhang, Ying Qin, Daxin Tan, Tan Lee
:
Applying the Information Bottleneck Principle to Prosodic Representation Learning. 3156-3160 - Alice Baird, Silvan Mertes, Manuel Milling, Lukas Stappen, Thomas Wiest, Elisabeth André
, Björn W. Schuller:
A Prototypical Network Approach for Evaluating Generated Emotional Speech. 3161-3165
Speech Production II
- Tsukasa Yoshinaga
, Kohei Tada, Kazunori Nozaki, Akiyoshi Iida:
A Simplified Model for the Vocal Tract of [s] with Inclined Incisors. 3166-3170 - Takayuki Arai:
Vocal-Tract Models to Visualize the Airstream of Human Breath and Droplets While Producing Speech. 3171-3175 - Ryo Tanji, Hidefumi Ohmura, Kouichi Katsurada:
Using Transposed Convolution for Articulatory-to-Acoustic Conversion from Real-Time MRI Data. 3176-3180 - Rafia Inaam, Tsukasa Yoshinaga
, Takayuki Arai, Hiroshi Yokoyama
, Akiyoshi Iida:
Comparison Between Lumped-Mass Modeling and Flow Simulation of the Reed-Type Artificial Vocal Fold. 3181-3185 - Raphael Werner, Susanne Fuchs, Jürgen Trouvain, Bernd Möbius
:
Inhalations in Speech: Acoustic and Physiological Characteristics. 3186-3190 - Anqi Xu
, Daniel R. van Niekerk
, Branislav Gerazov
, Paul Konstantin Krug, Santitham Prom-on, Peter Birkholz
, Yi Xu:
Model-Based Exploration of Linking Between Vowel Articulatory Space and Acoustic Space. 3191-3195 - Mikey Elmers
, Raphael Werner, Beeke Muhlack
, Bernd Möbius
, Jürgen Trouvain:
Take a Breath: Respiratory Sounds Improve Recollection in Synthetic Speech. 3196-3200 - Taijing Chen, Adam C. Lammert, Benjamin Parrell:
Modeling Sensorimotor Adaptation in Speech Through Alterations to Forward and Inverse Models. 3201-3205 - Hideki Kawahara, Toshie Matsui, Kohei Yatabe
, Ken-Ichi Sakakibara, Minoru Tsuzaki, Masanori Morise, Toshio Irino:
Mixture of Orthogonal Sequences Made from Extended Time-Stretched Pulses Enables Measurement of Involuntary Voice Fundamental Frequency Response to Pitch Perturbation. 3206-3210
Spoken Dialogue Systems II
- Chenyu You, Nuo Chen, Yuexian Zou:
Contextualized Attention-Based Knowledge Transfer for Spoken Conversational Question Answering. 3211-3215 - Wenying Duan
, Xiaoxi He, Zimu Zhou, Hong Rao, Lothar Thiele:
Injecting Descriptive Meta-Information into Pre-Trained Language Models with Hypernetworks. 3216-3220 - Mahdin Rohmatillah
, Jen-Tzung Chien
:
Causal Confusion Reduction for Robust Multi-Domain Dialogue Policy. 3221-3225 - Shinya Fujie, Hayato Katayama, Jin Sakuma, Tetsunori Kobayashi:
Timing Generating Networks: Neural Network Based Precise Turn-Taking Timing Prediction in Multiparty Conversation. 3226-3230 - Kehan Chen, Zezhong Li, Suyang Dai, Wei Zhou, Haiqing Chen:
Human-to-Human Conversation Dataset for Learning Fine-Grained Turn-Taking Action. 3231-3235 - Mukuntha Narayanan Sundararaman, Ayush Kumar, Jithendra Vepa:
PhonemeBERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript. 3236-3240 - Hongyin Luo, James R. Glass, Garima Lalwani, Yi Zhang, Shang-Wen Li:
Joint Retrieval-Extraction Training for Evidence-Aware Dialog Response Selection. 3241-3245 - Ashish Shenoy, Sravan Bodapati, Monica Sunkara, Srikanth Ronanki, Katrin Kirchhoff:
Adapting Long Context NLM for ASR Rescoring in Conversational Agents. 3246-3250
Oriental Language Recognition
- Jing Li, Binling Wang, Yiming Zhi, Zheng Li, Lin Li, Qingyang Hong, Dong Wang:
Oriental Language Recognition (OLR) 2020: Summary and Analysis. 3251-3255 - Raphaël Duroselle, Md. Sahidullah, Denis Jouvet, Irina Illina:
Language Recognition on Unknown Conditions: The LORIA-Inria-MULTISPEECH System for AP20-OLR Challenge. 3256-3260 - Tianlong Kong, Shouyi Yin, Dawei Zhang, Wang Geng, Xin Wang, Dandan Song, Jinwen Huang, Huiyu Shi, Xiaorui Wang:
Dynamic Multi-Scale Convolution for Dialect Identification. 3261-3265 - Ding Wang, Shuaishuai Ye, Xinhui Hu, Sheng Li
, Xinkang Xu:
An End-to-End Dialect Identification System with Transfer Learning from a Multilingual Automatic Speech Recognition Model. 3266-3270 - Haibin Yu, Jing Zhao, Song Yang, Zhongqin Wu, Yuting Nie, Wei-Qiang Zhang:
Language Recognition Based on Unsupervised Pretrained Models. 3271-3275 - Zheng Li, Yan Liu, Lin Li, Qingyang Hong:
Additive Phoneme-Aware Margin Softmax Loss for Language Recognition. 3276-3280
Automatic Speech Recognition in Air Traffic Management
- Nataly Jahchan, Florentin Barbier, Ariyanidevi Dharma Gita, Khaled Khelif, Estelle Delpech:
Towards an Accent-Robust Approach for ATC Communications Transcription. 3281-3285 - Igor Szöke, Santosh Kesiraju
, Ondrej Novotný, Martin Kocour
, Karel Veselý, Jan Cernocký:
Detecting English Speech in the Air Traffic Control Voice Communication. 3286-3290 - Oliver Ohneiser
, Seyyed Saeed Sarfjoo, Hartmut Helmke, Shruthi Shetty, Petr Motlícek, Matthias Kleinert, Heiko Ehr, Sarunas Murauskas:
Robust Command Recognition for Lithuanian Air Traffic Control Tower Utterances. 3291-3295 - Juan Zuluaga-Gomez, Iuliia Nigmatulina, Amrutha Prasad, Petr Motlícek, Karel Veselý, Martin Kocour
, Igor Szöke:
Contextual Semi-Supervised Learning: An Approach to Leverage Air-Surveillance and Untranscribed ATC Data in ASR Systems. 3296-3300 - Martin Kocour
, Karel Veselý, Alexander Blatt, Juan Zuluaga-Gomez, Igor Szöke, Jan Cernocký, Dietrich Klakow, Petr Motlícek:
Boosting of Contextual Information in ASR for Air-Traffic Call-Sign Recognition. 3301-3305 - Benjamin Elie, Jodie Gauvain, Jean-Luc Gauvain, Lori Lamel:
Modeling the Effect of Military Oxygen Masks on Speech Characteristics. 3306-3310
Show and Tell 3
- Benjamin Milde, Tim Fischer, Steffen Remus, Chris Biemann:
MoM: Minutes of Meeting Bot. 3311-3312 - Alexander Wilbrandt, Simon Stone, Peter Birkholz:
Articulatory Data Recorder: A Framework for Real-Time Articulatory Data Recording. 3313-3314 - Joan Codina-Filbà, Guillermo Cámbara, Alex Peiró Lilja, Jens Grivolla, Roberto Carlini, Mireia Farrús:
The INGENIOUS Multilingual Operations App. 3315-3316 - Joanna Rownicka, Kilian Sprenkamp, Antonio Tripiana, Volodymyr Gromoglasov, Timo P. Kunz:
Digital Einstein Experience: Fast Text-to-Speech for Conversational AI. 3317-3318 - Robert Geislinger, Benjamin Milde, Timo Baumann, Chris Biemann:
Live Subtitling for BigBlueButton with Open-Source Software. 3319-3320 - Davis Nicmanis, Askars Salimbajevs:
Expressive Latvian Speech Synthesis for Dialog Systems. 3321-3322 - Pramod H. Kachare, Prem C. Pandey, Vishal Mane, Hirak Dasgupta, K. S. Nataraj, Akshada Rathod, Sheetal K. Pathak:
ViSTAFAE: A Visual Speech-Training Aid with Feedback of Articulatory Efforts. 3323-3324
Survey Talk 3: Karen Livescu
- Karen Livescu:
Learning Speech Models from Multi-Modal Data.
Keynote 3: Mounya Elhilali
- Mounya Elhilali:
Adaptive Listening to Everyday Soundscapes.
Speech Production I
- Vinicius Ribeiro, Karyna Isaieva
, Justine Leclere
, Pierre-André Vuissoz, Yves Laprie
:
Towards the Prediction of the Vocal Tract Shape from the Sequence of Phonemes to be Articulated. 3325-3329 - Rémi Blandin, Marc Arnela, Simon Félix, Jean-Baptiste Doc, Peter Birkholz
:
Comparison of the Finite Element Method, the Multimodal Method and the Transmission-Line Model for the Computation of Vocal Tract Transfer Functions. 3330-3334 - Petra Wagner
, Sina Zarrieß, Joana Cholin:
Effects of Time Pressure and Spontaneity on Phonotactic Innovations in German Dialogues. 3335-3339 - Salvador Medina, Sarah Taylor, Mark Tiede, Alexander G. Hauptmann, Iain A. Matthews:
Importance of Parasagittal Sensor Information in Tongue Motion Capture Through a Diphonic Analysis. 3340-3344 - Marc-Antoine Georges, Laurent Girin, Jean-Luc Schwartz, Thomas Hueber:
Learning Robust Speech Representation with an Articulatory-Regularized Variational Autoencoder. 3345-3349 - Heather Weston, Laura L. Koenig, Susanne Fuchs:
Changes in Glottal Source Parameter Values with Light to Moderate Physical Load. 3350-3354
Speech Enhancement and Coding
- Mohammad Hassan Vali
, Tom Bäckström
:
End-to-End Optimized Multi-Stage Vector Quantization of Spectral Envelopes for Speech and Audio Coding. 3355-3359 - Santhan Kumar Reddy Nareddula, Subrahmanyam Gorthi, Rama Krishna Sai Subrahmanyam Gorthi:
Fusion-Net: Time-Frequency Information Fusion Y-Network for Speech Enhancement. 3360-3364 - Lubos Marcinek, Michael Stone, Rebecca E. Millman, Patrick Gaydecki:
N-MTTL SI Model: Non-Intrusive Multi-Task Transfer Learning-Based Speech Intelligibility Prediction Model with Scenery Classification. 3365-3369
Emotion and Sentiment Analysis II
- Yangyang Xia, Li-Wei Chen, Alexander Rudnicky
, Richard M. Stern
:
Temporal Context in Speech Emotion Recognition. 3370-3374 - Hang Li
, Wenbiao Ding, Zhongqin Wu, Zitao Liu:
Learning Fine-Grained Cross Modality Excitement for Speech Emotion Recognition. 3375-3379 - Einari Vaaras, Sari Ahlqvist-Björkroth, Konstantinos Drossos
, Okko Räsänen
:
Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit. 3380-3384 - Fan Qian, Jiqing Han:
Multimodal Sentiment Analysis with Temporal Modality Attention. 3385-3389 - Mani Kumar Tellamekala, Enrique Sanchez, Georgios Tzimiropoulos, Timo Giesbrecht, Michel F. Valstar:
Stochastic Process Regression for Cross-Cultural Speech Emotion Recognition. 3390-3394 - Haoqi Li, Yelin Kim, Cheng-Hao Kuo, Shrikanth S. Narayanan:
Acted vs. Improvised: Domain Adaptation for Elicitation Approaches in Audio-Visual Emotion Recognition. 3395-3399 - Leonardo Pepino, Pablo Riera, Luciana Ferrer:
Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. 3400-3404 - Jiawang Liu, Haoxiang Wang:
Graph Isomorphism Network for Speech Emotion Recognition. 3405-3409 - Pooja Kumawat, Aurobinda Routray:
Applying TDNN Architectures for Analyzing Duration Dependencies on Speech Emotion Recognition. 3410-3414 - Aaron Keesing
, Yun Sing Koh, Michael Witbrock:
Acoustic Features and Neural Representations for Categorical Emotion Recognition from Speech. 3415-3419 - Suwon Shon, Pablo Brusco, Jing Pan, Kyu Jeong Han, Shinji Watanabe
:
Leveraging Pre-Trained Language Model for Speech Sentiment Analysis. 3420-3424
Multi- and Cross-Lingual ASR, Other Topics in ASR
- Wenxin Hou, Jindong Wang
, Xu Tan, Tao Qin
, Takahiro Shinozaki:
Cross-Domain Speech Recognition with Unsupervised Character-Level Distribution Matching. 3425-3429 - Naoyuki Kanda, Guoli Ye, Yu Wu, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka:
Large-Scale Pre-Training of End-to-End Multi-Talker ASR for Meeting Transcription with Single Distant Microphone. 3430-3434 - Liang Lu, Zhong Meng, Naoyuki Kanda, Jinyu Li
, Yifan Gong:
On Minimum Word Error Rate Training of the Hybrid Autoregressive Transducer. 3435-3439 - Jaeyoung Kim, Han Lu, Anshuman Tripathi, Qian Zhang, Hasim Sak:
Reducing Streaming ASR Model Delay with Self Alignment. 3440-3444 - Anuj Diwan, Preethi Jyothi:
Reduce and Reconstruct: ASR for Low-Resource Phonetic Languages. 3445-3449 - Takashi Fukuda, Samuel Thomas:
Knowledge Distillation Based Training of Universal ASR Source Models for Cross-Lingual Transfer. 3450-3454 - Swayambhu Nath Ray, Minhua Wu, Anirudh Raju, Pegah Ghahremani, Raghavendra Bilgi, Milind Rao, Harish Arsikere, Ariya Rastrow, Andreas Stolcke, Jasha Droppo:
Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End. 3455-3459 - Zhiyun Lu, Wei Han, Yu Zhang, Liangliang Cao:
Exploring Targeted Universal Adversarial Perturbations to End-to-End ASR Models. 3460-3464 - Miguel Del Rio, Natalie Delworth, Ryan Westerman, Michelle Huang, Nishchal Bhandari, Joseph Palakapilly, Quinten McNamara, Joshua Dong, Piotr Zelasko, Miguel Jette:
Earnings-21: A Practical Benchmark for ASR in the Wild. 3465-3469 - Eric Sun, Jinyu Li
, Zhong Meng, Yu Wu, Jian Xue, Shujie Liu, Yifan Gong:
Improving Multilingual Transformer Transducer Models by Reducing Language Confusions. 3470-3474 - Ahmed Ali, Shammur Absar Chowdhury, Amir Hussein, Yasser Hifny:
Arabic Code-Switching Speech Recognition Using Monolingual Data. 3475-3479
Source Separation II
- Aviad Eisenberg, Boaz Schwartz, Sharon Gannot:
Online Blind Audio Source Separation Using Recursive Expectation-Maximization. 3480-3484 - Yi Luo, Cong Han, Nima Mesgarani:
Empirical Analysis of Generalized Iterative Speech Separation Networks. 3485-3489 - Thilo von Neumann, Keisuke Kinoshita
, Christoph Böddeker, Marc Delcroix
, Reinhold Haeb-Umbach:
Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers. 3490-3494 - Jisi Zhang, Catalin Zorila, Rama Doddipatla
, Jon Barker:
Teacher-Student MixIT for Unsupervised and Semi-Supervised Speech Separation. 3495-3499 - Marc Delcroix
, Jorge Bennasar Vázquez, Tsubasa Ochiai, Keisuke Kinoshita
, Shoko Araki:
Few-Shot Learning of New Sound Classes for Target Sound Extraction. 3500-3504 - Cong Han, Yi Luo, Nima Mesgarani:
Binaural Speech Separation of Moving Speakers With Preserved Spatial Cues. 3505-3509 - Shell Xu Hu, Md Rifat Arefin, Viet-Nhat Nguyen, Alish Dipani, Xaq Pitkow, Andreas Savas Tolias:
AvaTr: One-Shot Speaker Extraction with Transformers. 3510-3514 - Saurjya Sarkar
, Emmanouil Benetos
, Mark B. Sandler:
Vocal Harmony Separation Using Time-Domain Neural Networks. 3515-3519 - Matthew Maciejewski, Shinji Watanabe
, Sanjeev Khudanpur:
Speaker Verification-Based Evaluation of Single-Channel Speech Separation. 3520-3524 - Tian Lan, Yuxin Qian, Yilan Lyu, Refuoe Mokhosi, Wenxin Tai, Qiao Liu:
Improved Speech Separation with Time-and-Frequency Cross-Domain Feature Selection. 3525-3529 - Chengyun Deng, Shiqian Ma, Yongtao Sha, Yi Zhang, Hui Zhang, Hui Song, Fei Wang:
Robust Speaker Extraction Network Based on Iterative Refined Adaptation. 3530-3534 - Wupeng Wang, Chenglin Xu, Meng Ge, Haizhou Li:
Neural Speaker Extraction with Speaker-Speech Cross-Attention Network. 3535-3539 - Rémi Rigal, Jacques Chodorowski, Benoît Zerr:
Deep Audio-Visual Speech Separation Based on Facial Motion. 3540-3544
Speaker Diarization II
- Prachi Singh, Rajat Varma, Venkat Krishnamohan, Srikanth Raj Chetupalli, Sriram Ganapathy:
LEAP Submission for the Third DIHARD Diarization Challenge. 3545-3549 - Shiliang Zhang, Siqi Zheng, Weilong Huang, Ming Lei, Hongbin Suo, Jinwei Feng, Zhijie Yan:
Investigation of Spatial-Acoustic Features for Overlapping Speech Detection in Multiparty Meetings. 3550-3554 - Maokui He, Desh Raj
, Zili Huang, Jun Du, Zhuo Chen, Shinji Watanabe
:
Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker. 3555-3559 - Nauman Dawalatabad, Mirco Ravanelli, François Grondin, Jenthe Thienpondt
, Brecht Desplanques, Hwidong Na:
ECAPA-TDNN Embeddings for Speaker Diarization. 3560-3564 - Keisuke Kinoshita
, Marc Delcroix
, Naohiro Tawara:
Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech. 3565-3569 - Neville Ryant, Prachi Singh, Venkat Krishnamohan, Rajat Varma, Kenneth Church
, Christopher Cieri, Jun Du, Sriram Ganapathy, Mark Y. Liberman
:
The Third DIHARD Diarization Challenge. 3570-3574 - Tsun-Yat Leung, Lahiru Samarakoon:
Robust End-to-End Speaker Diarization with Conformer and Additive Margin Penalty. 3575-3579 - Benjamin O'Brien
, Natalia A. Tomashenko
, Anaïs Chanclu, Jean-François Bonastre:
Anonymous Speaker Clusters: Making Distinctions Between Anonymised Speech Recordings with Clustering Interface. 3580-3584 - Kiran Karra, Alan McCree:
Speaker Diarization Using Two-Pass Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings. 3585-3589
Speech Synthesis: Toward End-to-End Synthesis I
- Zhenhou Hong, Jianzong Wang
, Xiaoyang Qu, Jie Liu, Chendong Zhao, Jing Xiao:
Federated Learning with Dynamic Transformer for Text to Speech. 3590-3594 - Huu-Kim Nguyen, Kihyuk Jeong, Seyun Um, Min-Jae Hwang, Eunwoo Song, Hong-Goo Kang:
LiteTTS: A Lightweight Mel-Spectrogram-Free Text-to-Wave Synthesizer Based on Generative Adversarial Networks. 3595-3599 - Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao, Wenjun Zeng
:
Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration. 3600-3604 - Myeonghun Jeong, Hyeongju Kim, Sung Jun Cheon, Byoung Jin Choi, Nam Soo Kim:
Diff-TTS: A Denoising Diffusion Model for Text-to-Speech. 3605-3609 - Jae-Sung Bae, Taejun Bak, Young-Sun Joo
, Hoon-Young Cho:
Hierarchical Context-Aware Transformers for Non-Autoregressive Text to Speech. 3610-3614 - Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux:
Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. 3615-3619 - Penny Karanasou, Sri Karlapati, Alexis Moinet, Arnaud Joly, Ammar Abbas, Simon Slangen, Jaime Lorenzo-Trueba, Thomas Drugman:
A Learned Conditional Prior for the VAE Acoustic Space of a TTS System. 3620-3624 - Dipjyoti Paul, Sankar Mukherjee, Yannis Pantazis, Yannis Stylianou:
A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning Based on Rényi Divergence Minimization. 3625-3629 - Yi-Chiao Wu, Cheng-Hung Hu, Hung-Shin Lee, Yu-Huai Peng, Wen-Chin Huang, Yu Tsao, Hsin-Min Wang
, Tomoki Toda:
Relational Data Selection for Data Augmentation of Speaker-Dependent Multi-Band MelGAN Vocoder. 3630-3634 - Hyunseung Chung, Sang-Hoon Lee, Seong-Whan Lee:
Reinforce-Aligner: Reinforcement Alignment Search for Robust End-to-End Text-to-Speech. 3635-3639 - Shilun Lin, Fenglong Xie, Li Meng, Xinhui Li, Li Lu:
Triple M: A Practical Text-to-Speech Synthesis System with Multi-Guidance Attention and Multi-Band Multi-Time LPCNet. 3640-3644 - Edresson Casanova, Christopher Shulby, Eren Gölge, Nicolas Michael Müller, Frederico Santos de Oliveira, Arnaldo Candido Jr., Anderson da Silva Soares, Sandra Maria Aluísio, Moacir Antonelli Ponti:
SC-GlowTTS: An Efficient Zero-Shot Multi-Speaker Text-To-Speech Model. 3645-3649
Tools, Corpora and Resources
- Ian Palmer, Andrew Rouditchenko, Andrei Barbu, Boris Katz, James R. Glass:
Spoken ObjectNet: A Bias-Controlled Spoken Caption Dataset. 3650-3654 - Elizabeth Salesky
, Matthew Wiesner, Jacob Bremerman, Roldano Cattoni, Matteo Negri
, Marco Turchi, Douglas W. Oard
, Matt Post:
The Multilingual TEDx Corpus for Speech Recognition and Translation. 3655-3659 - David R. Mortensen
, Jordan Picone, Xinjian Li, Kathleen Siminyu:
Tusom2021: A Phonetically Transcribed Speech Dataset from an Endangered Language for Universal Phone Recognition Experiments. 3660-3664 - Yihui Fu, Luyao Cheng, Shubo Lv, Yukai Jv, Yuxiang Kong, Zhuo Chen, Yanxin Hu, Lei Xie, Jian Wu, Hui Bu, Xin Xu, Jun Du, Jingdong Chen:
AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario. 3665-3669 - Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe
, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, Zhiyong Yan:
GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10, 000 Hours of Transcribed Audio. 3670-3674 - You Jin Kim, Hee-Soo Heo, Soyeon Choe, Soo-Whan Chung, Yoohwan Kwon, Bong-Jin Lee, Youngki Kwon, Joon Son Chung:
Look Who's Talking: Active Speaker Detection in the Wild. 3675-3679 - Beena Ahmed, Kirrie J. Ballard
, Denis Burnham, Tharmakulasingam Sirojan, Hadi Mehmood, Dominique Estival, Elise Baker
, Felicity Cox
, Joanne Arciuli
, Titia Benders
, Katherine Demuth
, Barbara Kelly, Chloé Diskin-Holdaway, Mostafa Ali Shahin
, Vidhyasaharan Sethu
, Julien Epps, Chwee Beng Lee, Eliathamby Ambikairajah
:
AusKidTalk: An Auditory-Visual Corpus of 3- to 12-Year-Old Australian Children's Speech. 3680-3684 - Per Fallgren, Jens Edlund:
Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson. 3685-3689 - Elena Ryumina
, Oxana Verkholyak, Alexey Karpov:
Annotation Confidence vs. Training Sample Size: Trade-Off Solution for Partially-Continuous Categorical Emotion Recognition. 3690-3694 - Gonçal V. Garcés Díaz-Munío, Joan Albert Silvestre-Cerdà, Javier Jorge, Adrià Giménez-Pastor, Javier Iranzo-Sánchez, Pau Baquero-Arnal, Nahuel Roselló, Alejandro Pérez González de Martos, Jorge Civera, Albert Sanchís, Alfons Juan:
Europarl-ASR: A Large Corpus of Parliamentary Debates for Streaming ASR Benchmarking and Speech Data Filtering/Verbatimization. - Parul Kapoor, Rudrabha Mukhopadhyay, Sindhu B. Hegde, Vinay P. Namboodiri
, C. V. Jawahar:
Towards Automatic Speech to Sign Language Generation. 3700-3704 - Won-Ik Cho, Seok Min Kim, Hyunchang Cho, Nam Soo Kim:
kosp2e: Korean Speech to English Translation Corpus. 3705-3709 - Junbo Zhang, Zhiwen Zhang, Yongqing Wang, Zhiyong Yan, Qiong Song, Yukai Huang, Ke Li, Daniel Povey, Yujun Wang:
speechocean762: An Open-Source Non-Native English Speech Corpus for Pronunciation Assessment. 3710-3714
Non-Autoregressive Sequential Modeling for Speech Processing
- Ruchao Fan
, Wei Chu, Peng Chang, Jing Xiao, Abeer Alwan:
An Improved Single Step Non-Autoregressive Transformer for Automatic Speech Recognition. 3715-3719 - Pengcheng Guo, Xuankai Chang, Shinji Watanabe
, Lei Xie:
Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain. 3720-3724 - Edwin G. Ng, Chung-Cheng Chiu, Yu Zhang, William Chan:
Pushing the Limits of Non-Autoregressive Speech Recognition. 3725-3729 - Alexander H. Liu, Yu-An Chung, James R. Glass:
Non-Autoregressive Predictive Coding for Learning Speech Representations from Local Dependencies. 3730-3734 - Jumon Nozaki, Tatsuya Komatsu:
Relaxing the Conditional Independence Assumption of CTC-Based ASR by Conditioning on Intermediate Predictions. 3735-3739 - Yuya Fujita, Tianzi Wang, Shinji Watanabe
, Motoi Omachi:
Toward Streaming ASR with Non-Autoregressive Insertion-Based Model. 3740-3744 - Jaesong Lee, Jingu Kang, Shinji Watanabe
:
Layer Pruning on Demand with Intermediate CTC. 3745-3749 - Song Li, Beibei Ouyang, Fuchuan Tong
, Dexin Liao, Lin Li, Qingyang Hong:
Real-Time End-to-End Monaural Multi-Speaker Speech Recognition. 3750-3754 - Tianzi Wang, Yuya Fujita, Xuankai Chang, Shinji Watanabe
:
Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models. 3755-3759 - Stanislav Beliaev, Boris Ginsburg:
TalkNet: Non-Autoregressive Depth-Wise Separable Convolutional Model for Speech Synthesis. 3760-3764 - Nanxin Chen, Yu Zhang, Heiga Zen
, Ron J. Weiss, Mohammad Norouzi, Najim Dehak
, William Chan:
WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. 3765-3769 - Nanxin Chen, Piotr Zelasko, Laureano Moro-Velázquez, Jesús Villalba
, Najim Dehak
:
Align-Denoise: Single-Pass Non-Autoregressive Speech Recognition. 3770-3774 - Hui Lu
, Zhiyong Wu, Xixin Wu, Xu Li, Shiyin Kang, Xunying Liu, Helen Meng:
VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis. 3775-3779
The ADReSSo Challenge: Detecting Cognitive Decline Using Speech Only
- Saturnino Luz, Fasih Haider
, Sofia de la Fuente, Davida Fromm, Brian MacWhinney:
Detecting Cognitive Decline Using Speech Only: The ADReSSo Challenge. 3780-3784 - Paula Andrea Pérez-Toro
, Sebastian P. Bayerl, Tomás Arias-Vergara
, Juan Camilo Vásquez-Correa, Philipp Klumpp, Maria Schuster, Elmar Nöth, Juan Rafael Orozco-Arroyave, Korbinian Riedhammer
:
Influence of the Interviewer on the Automatic Assessment of Alzheimer's Disease in the Context of the ADReSSo Challenge. 3785-3789 - Youxiang Zhu, Abdelrahman Obyat, Xiaohui Liang, John A. Batsis, Robert M. Roth:
WavBERT: Exploiting Semantic and Non-Semantic Speech Using Wav2vec and BERT for Dementia Detection. 3790-3794 - Lara Gauder, Leonardo Pepino, Luciana Ferrer, Pablo Riera:
Alzheimer Disease Recognition Using Speech-Based Embeddings From Pre-Trained Models. 3795-3799 - Aparna Balagopalan, Jekaterina Novikova:
Comparing Acoustic-Based Approaches for Alzheimer's Disease Detection. 3800-3804 - Yu Qiao, Xuefeng Yin, Daniel Wiechmann, Elma Kerz:
Alzheimer's Disease Detection from Spontaneous Speech Through Combining Linguistic Complexity and (Dis)Fluency Features with Pretrained Language Models. 3805-3809 - Yilin Pan, Bahman Mirheidari, Jennifer M. Harris, Jennifer C. Thompson
, Matthew Jones, Julie S. Snowden, Daniel Blackburn
, Heidi Christensen
:
Using the Outputs of Different Automatic Speech Recognition Paradigms for Acoustic- and BERT-Based Alzheimer's Dementia Detection Through Spontaneous Speech. 3810-3814 - Zafi Sherhan Syed, Muhammad Shehram Shah Syed, Margaret Lech, Elena Pirogova:
Tackling the ADRESSO Challenge 2021: The MUET-RMIT System for Alzheimer's Dementia Recognition from Spontaneous Speech. 3815-3819 - Morteza Rohanian, Julian Hough, Matthew Purver
:
Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs. 3820-3824 - Raghavendra Pappagari, Jaejin Cho, Sonal Joshi
, Laureano Moro-Velázquez, Piotr Zelasko, Jesús Villalba
, Najim Dehak
:
Automatic Detection and Assessment of Alzheimer Disease Using Speech and Language Technologies in Low-Resource Scenarios. 3825-3829 - Jun Chen, Jieping Ye, Fengyi Tang, Jiayu Zhou:
Automatic Detection of Alzheimer's Disease Using Spontaneous Speech Only. 3830-3834 - Ning Wang, Yupeng Cao, Shuai Hao, Zongru Shao, K. P. Subbalakshmi:
Modular Multi-Modal Attention Network for Alzheimer's Disease Detection Using Patient Audio and Language Data. 3835-3839
Robust and Far-Field ASR
- Rong Gong, Carl Quillen, Dushyant Sharma, Andrew Goderre, José Laínez, Ljubomir Milanovic:
Self-Attention Channel Combinator Frontend for End-to-End Multichannel Far-Field Speech Recognition. 3840-3844 - Roberto Gretter, Marco Matassoni, Daniele Falavigna, A. Misra, Chee Wee Leong, Kate M. Knill, Linlin Wang:
ETLT 2021: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech. 3845-3849 - Lars Rumberg, Hanna Ehlert
, Ulrike Lüdtke
, Jörn Ostermann:
Age-Invariant Training for End-to-End Child Speech Recognition Using Adversarial Multi-Task Learning. 3850-3854 - Samuele Cornell, Alessio Brutti, Marco Matassoni, Stefano Squartini
:
Learning to Rank Microphones for Distant Speech Recognition. 3855-3859 - Lucile Gelin
, Thomas Pellegrini, Julien Pinquier, Morgane Daniel:
Simulating Reading Mistakes for Child Speech Transformer-Based Phone Recognition. 3860-3864
Speech Synthesis: Prosody Modeling II
- Brooke Stephenson, Thomas Hueber, Laurent Girin, Laurent Besacier:
Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input. 3865-3869 - Pol van Rijn, Silvan Mertes, Dominik Schiller, Peter M. C. Harrison, Pauline Larrouy-Maestri, Elisabeth André
, Nori Jacoby
:
Exploring Emotional Prototypes in a High Dimensional TTS Latent Space. 3870-3874 - Devang S. Ram Mohan, Qinmin Vivian Hu, Tian Huey Teh, Alexandra Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti, Jiameng Gao, Simon King:
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis. 3875-3879 - Alexandra Torresquintero, Tian Huey Teh, Christopher G. R. Wallis, Marlene Staib, Devang S. Ram Mohan, Vivian Hu, Lorenzo Foglianti, Jiameng Gao, Simon King:
ADEPT: A Dataset for Evaluating Prosody Transfer. 3880-3884 - Thi Thu Trang Nguyen, Nguyen Hoang Ky, Albert Rilliard, Christophe d'Alessandro:
Prosodic Boundary Prediction Model for Vietnamese Text-To-Speech. 3885-3889
Source Separation III
- Shaked Dovrat, Eliya Nachmani, Lior Wolf:
Many-Speakers Single Channel Speech Separation with Optimal Permutation Training. 3890-3894 - Mieszko Fras, Marcin Witkowski
, Konrad Kowalczyk
:
Combating Reverberation in NTF-Based Speech Separation Using a Sub-Source Weighted Multichannel Wiener Filter and Linear Prediction. 3895-3899 - Martin Strauss, Jouni Paulus
, Matteo Torcoli
, Bernd Edler:
A Hands-On Comparison of DNNs for Dialog Separation Using Transfer Learning from Music Source Separation. 3900-3904 - Marvin Borsdorf, Chenglin Xu, Haizhou Li, Tanja Schultz
:
GlobalPhone Mix-To-Separate Out of 2: A Multilingual 2000 Speakers Mixtures Database for Speech Separation. 3905-3909
Non-Native Speech
- Kimiko Tsukada
, Yu Rong, Joo-Yeon Kim, Jeong-Im Han
, John Hajek:
Cross-Linguistic Perception of the Japanese Singleton/Geminate Contrast: Korean, Mandarin and Mongolian Compared. 3910-3914 - Daniel Korzekwa, Roberto Barra-Chicote, Szymon Zaporowski, Grzegorz Beringer, Jaime Lorenzo-Trueba, Alicja Serafinowicz, Jasha Droppo, Thomas Drugman, Bozena Kostek:
Detection of Lexical Stress Errors in Non-Native (L2) English with Data Augmentation and Attention. 3915-3919 - Bettina Braun, Nicole Dehé, Marieke Einfeldt, Daniela Wochner, Katharina Zahner-Ritter:
Testing Acoustic Voice Quality Classification Across Languages and Speech Styles. 3920-3924 - Qianyutong Zhang, Kexin Lyu, Zening Chen, Ping Tang:
Acquisition of Prosodic Focus Marking by Three- to Six-Year-Old Children Learning Mandarin Chinese. 3925-3928 - Maryam Sadat Mirzaei, Kourosh Meshgi:
Adaptive Listening Difficulty Detection for L2 Learners Through Moderating ASR Resources. 3929-3933 - Hongwei Ding, Binghuai Lin, Liyuan Wang:
F0 Patterns of L2 English Speech by Mandarin Chinese Learners. 3934-3938 - Binghuai Lin, Liyuan Wang:
A Neural Network-Based Noise Compensation Method for Pronunciation Assessment. 3939-3943 - Jacek Kudera
, Philip Georgis, Bernd Möbius
, Tania Avgustinova, Dietrich Klakow:
Phonetic Distance and Surprisal in Multilingual Priming: Evidence from Slavic. 3944-3948 - Yuqing Zhang
, Zhu Li, Binghuai Lin, Jinsong Zhang:
A Preliminary Study on Discourse Prosody Encoding in L1 and L2 English Spontaneous Narratives. 3949-3953 - Minglin Wu, Kun Li, Wai-Kim Leung, Helen Meng:
Transformer Based End-to-End Mispronunciation Detection and Diagnosis. 3954-3958 - Calbert Graham
:
L1 Identification from L2 Speech Using Neural Spectrogram Analysis. 3959-3963
Phonetics II
- Miran Oh
, Dani Byrd, Shrikanth S. Narayanan:
Leveraging Real-Time MRI for Illuminating Linguistic Velum Action. 3964-3968 - Zirui Liu, Yi Xu:
Segmental Alignment of English Syllables with Singleton and Cluster Onsets. 3969-3973 - Mísa Hejná
:
Exploration of Welsh English Pre-Aspiration: How Wide-Spread is it? 3974-3978 - Beeke Muhlack
, Mikey Elmers
, Heiner Drenhaus, Jürgen Trouvain, Marjolein van Os, Raphael Werner, Margarita Ryzhova, Bernd Möbius
:
Revisiting Recall Effects of Filler Particles in German and English. 3979-3983 - Chunyu Ge, Yixuan Xiong, Peggy Mok:
How Reliable Are Phonetic Data Collected Remotely? Comparison of Recording Devices and Environments on Acoustic Measurements. 3984-3988 - Jing Huang, Feng-fan Hsieh
, Yueh-Chin Chang:
A Cross-Dialectal Comparison of Apical Vowels in Beijing Mandarin, Northeastern Mandarin and Southwestern Mandarin: An EMA and Ultrasound Study. 3989-3993 - Mark Gibson, Oihane Muxika, Marianne Pouplier:
Dissecting the Aero-Acoustic Parameters of Open Articulatory Transitions. 3994-3998 - Amelia Jane Gully:
Quantifying Vocal Tract Shape Variation and its Acoustic Impact: A Geometric Morphometric Approach. 3999-4003 - Adriana Guevara-Rukoz, Shi Yu, Sharon Peperkamp
:
Speech Perception and Loanword Adaptations: The Case of Copy-Vowel Epenthesis. 4004-4008 - Zhe-chen Guo
, Rajka Smiljanic:
Speakers Coarticulate Less When Facing Real and Imagined Communicative Difficulties: An Analysis of Read and Spontaneous Speech from the LUCID Corpus. 4009-4013 - Einar Meister
, Lya Meister
:
Developmental Changes of Vowel Acoustics in Adolescents. 4014-4018 - Sonia D'Apolito, Barbara Gili Fivela:
Context and Co-Text Influence on the Accuracy Production of Italian L2 Non-Native Sounds. 4019-4023 - Wilbert Heeringa, Hans Van de Velde
:
A New Vowel Normalization for Sociophonetics. 4024-4028 - Rosey Billington
, Hywel Stoakes
, Nick Thieberger:
The Pacific Expansion: Optimizing Phonetic Transcription of Archival Corpora. 4029-4033
Search/Decoding Techniques and Confidence Measures for ASR
- Zhengkun Tian, Jiangyan Yi, Ye Bai, Jianhua Tao, Shuai Zhang, Zhengqi Wen:
FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization. 4034-4038 - Anton Mitrofanov, Mariya Korenevskaya, Ivan Podluzhny, Yuri Y. Khokhlov, Aleksandr Laptev, Andrei Andrusenko
, Aleksei Ilin, Maxim Korenevsky, Ivan Medennikov, Aleksei Romanenko:
LT-LM: A Novel Non-Autoregressive Language Model for Single-Shot Lattice Rescoring. 4039-4043 - Cyril Allauzen, Ehsan Variani, Michael Riley, David Rybach, Hao Zhang:
A Hybrid Seq-2-Seq ASR Design for On-Device and Server Applications. 4044-4048 - Hirofumi Inaguma, Tatsuya Kawahara
:
VAD-Free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording. 4049-4053 - Zhuoyuan Yao, Di Wu, Xiong Wang, Binbin Zhang, Fan Yu, Chao Yang, Zhendong Peng, Xiaoyu Chen, Lei Xie, Xin Lei:
WeNet: Production Oriented Streaming and Non-Streaming End-to-End Speech Recognition Toolkit. 4054-4058 - Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Takafumi Moriya, Takanori Ashihara, Shota Orihashi, Naoki Makishima:
Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition. 4059-4063 - Mun-Hak Lee, Joon-Hyuk Chang:
Deep Neural Network Calibration for E2E Speech Recognition System. 4064-4068 - Qiujia Li, Yu Zhang, Bo Li, Liangliang Cao, Philip C. Woodland:
Residual Energy-Based Models for End-to-End Speech Recognition. 4069-4073 - David Qiu, Yanzhang He, Qiujia Li, Yu Zhang, Liangliang Cao, Ian McGraw:
Multi-Task Learning for End-to-End ASR Word and Utterance Confidence with Deletion Prediction. 4074-4078 - Anna Ollerenshaw
, Md. Asif Jalal, Thomas Hain
:
Insights on Neural Representations for End-to-End Speech Recognition. 4079-4083 - Amber Afshan, Kshitiz Kumar, Jian Wu:
Sequence-Level Confidence Classifier for ASR Utterance Accuracy and Application to Acoustic Models. 4084-4088
Speech Synthesis: Linguistic Processing, Paradigms and Other Topics
- Andros Tjandra, Ruoming Pang, Yu Zhang, Shigeki Karita:
Unsupervised Learning of Disentangled Speech Content and Style Representation. 4089-4093 - Eunbi Choi, Hwa-Yeon Kim, Jong-Hwan Kim, Jae-Min Kim:
Label Embedding for Chinese Grapheme-to-Phoneme Conversion. 4094-4098 - Haiteng Zhang:
PDF: Polyphone Disambiguation in Chinese by Using FLAT. 4099-4103 - Junjie Li
, Zhiyu Zhang, Minchuan Chen, Jun Ma, Shaojun Wang, Jing Xiao:
Improving Polyphone Disambiguation for Mandarin Chinese by Combining Mix-Pooling Strategy and Window-Based Attention. 4104-4108 - Yi Shi, Congyi Wang, Yu Chen, Bin Wang:
Polyphone Disambiguation in Mandarin Chinese with Semi-Supervised Learning. 4109-4113 - Yue Chen, Zhen-Hua Ling, Qing-Feng Liu:
A Neural-Network-Based Approach to Identifying Speakers in Novels. 4114-4118 - Xiao Zhou, Zhen-Hua Ling, Li-Rong Dai:
UnitNet-Based Hybrid Speech Synthesis. 4119-4123 - Sashi Novitasari, Sakriani Sakti, Satoshi Nakamura:
Dynamically Adaptive Machine Speech Chain Inference for TTS in Noisy Environment: Listen and Speak Louder. 4124-4128 - Haozhe Zhang, Zhihua Huang, Zengqiang Shang, Pengyuan Zhang, Yonghong Yan:
LinearSpeech: Parallel Text-to-Speech with Linear Complexity. 4129-4133
Speech Type Classification and Diagnosis
- Noa Mansbach, Evgeny Hershkovitch Neiterman, Amos Azaria:
An Agent for Competing with Humans in a Deceptive Game Based on Vocal Cues. 4134-4138 - Ahmed Fakhry, Xinyi Jiang, Jaclyn Xiao
, Gunvant Chaudhari, Asriel Han:
A Multi-Branch Deep Learning Network for Automated Detection of COVID-19. 4139-4143 - Youxuan Ma, Zongze Ren, Shugong Xu:
RW-Resnet: A Novel Speech Anti-Spoofing Model Using Raw Waveform. 4144-4148 - Hira Dhamyal, Ayesha Ali, Ihsan Ayyub Qazi
, Agha Ali Raza
:
Fake Audio Detection in Resource-Constrained Settings Using Microfeatures. 4149-4153 - Tianhao Yan, Hao Meng, Emilia Parada-Cabaleiro, Shuo Liu, Meishu Song, Björn W. Schuller:
Coughing-Based Recognition of Covid-19 with Spatial Attentive ConvLSTM Recurrent Neural Networks. 4154-4158 - Soumava Paul, Gurunath Reddy M, K. Sreenivasa Rao, Partha Pratim Das:
Knowledge Distillation for Singing Voice Detection. 4159-4163 - Ryu Takeda
, Kazunori Komatani:
Age Estimation with Speech-Age Model for Heterogeneous Speech Datasets. 4164-4168 - Kah Kuan Teh, Huy Dat Tran:
Open-Set Audio Classification with Limited Training Resources Based on Augmentation Enhanced Variational Auto-Encoder GAN with Detection-Classification Joint Training. 4169-4173 - Takahiro Fukumori:
Deep Spectral-Cepstral Fusion for Shouted and Normal Speech Classification. 4174-4178 - Shikha Baghel, Mrinmoy Bhattacharjee
, S. R. Mahadeva Prasanna, Prithwijit Guha:
Automatic Detection of Shouted Speech Segments in Indian News Debates. 4179-4183 - Yang Gao, Tyler Vuong, Mahsa Elyasi, Gaurav Bharaj, Rita Singh:
Generalized Spoofing Detection Inspired from Audio Generation Artifacts. 4184-4188 - Weiguang Chen, Van Tung Pham, Eng Siong Chng, Xionghu Zhong:
Overlapped Speech Detection Based on Spectral and Spatial Feature Fusion. 4189-4193
Spoken Term Detection & Voice Search
- Badr M. Abdullah, Marius Mosbach, Iuliia Zaitova, Bernd Möbius
, Dietrich Klakow:
Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study. 4194-4198 - Zheng Gao, Radhika Arava, Qian Hu, Xibin Gao, Thahir Mohamed, Wei Xiao, Mohamed Abdelhady:
Paraphrase Label Alignment for Voice Application Retrieval in Spoken Language Understanding. 4199-4203 - Rajeev Rikhye, Quan Wang, Qiao Liang, Yanzhang He, Ding Zhao, Yiteng Huang, Arun Narayanan, Ian McGraw:
Personalized Keyphrase Detection Using Speaker and Environment Information. 4204-4208 - Vineet Garg, Wonil Chang, Siddharth Sigtia, Saurabh Adya, Pramod Simha, Pranay Dighe, Chandra Dhir:
Streaming Transformer for Hardware Efficient Voice Trigger Detection and False Trigger Mitigation. 4209-4213 - Mark Mazumder, Colby R. Banbury, Josh Meyer, Pete Warden, Vijay Janapa Reddi:
Few-Shot Keyword Spotting in Any Language. 4214-4218 - Li Wang, Rongzhi Gu, Nuo Chen, Yuexian Zou:
Text Anchor Based Metric Learning for Small-Footprint Keyword Spotting. 4219-4223 - Yangbin Chen
, Tom Ko, Jianping Wang:
A Meta-Learning Approach for User-Defined Spoken Term Classification with Varying Classes and Examples. 4224-4228 - Dongyub Lee, Byeongil Ko, Myeongcheol Shin, Taesun Whang, Daniel Lee, Eun Hwa Kim, EungGyun Kim, Jaechoon Jo
:
Auxiliary Sequence Labeling Tasks for Disfluency Detection. 4229-4233 - Hang Zhou, Wenchao Hu, Yu Ting Yeung, Xiao Chen:
Energy-Friendly Keyword Spotting System Using Add-Based Convolution. 4234-4238 - Yan Jia, Xingming Wang, Xiaoyi Qin, Yinping Zhang, Xuyang Wang, Junjie Wang, Dong Zhang, Ming Li:
The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results. 4239-4243 - Jingsong Wang, Yuxuan He, Chunyu Zhao, Qijie Shao, Wei-Wei Tu, Tom Ko, Hung-yi Lee, Lei Xie:
Auto-KWS 2021 Challenge: Task, Datasets, and Baselines. 4244-4248 - Axel Berg
, Mark O'Connor, Miguel Tairum Cruz:
Keyword Transformer: A Self-Attention Model for Keyword Spotting. 4249-4253 - Abhijeet Awasthi, Kevin Kilgour, Hassan Rom:
Teaching Keyword Spotters to Spot New Keywords with Limited Examples. 4254-4258
Voice Anti-Spoofing and Countermeasure
- Xin Wang
, Junichi Yamagishi:
A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection. 4259-4263 - Lin Zhang
, Xin Wang
, Erica Cooper, Junichi Yamagishi, Jose Patino, Nicholas W. D. Evans:
An Initial Investigation for Detecting Partially Spoofed Audio. 4264-4268 - Yang Xie, Zhenchuan Zhang, Yingchun Yang:
Siamese Network with wav2vec Feature for Spoofing Speech Detection. 4269-4273 - Xingliang Cheng, Mingxing Xu, Thomas Fang Zheng:
Cross-Database Replay Detection in Terminal-Dependent Speaker Verification. 4274-4278 - Yuxiang Zhang, Wenchao Wang, Pengyuan Zhang:
The Effect of Silence and Dual-Band Fusion in Anti-Spoofing System. 4279-4283 - Zhiyuan Peng, Xu Li, Tan Lee
:
Pairing Weak with Strong: Twin Models for Defending Against Adversarial Attack on Speaker Verification. 4284-4288 - Hefei Ling, Leichao Huang, Junrui Huang, Baiyan Zhang, Ping Li:
Attention-Based Convolutional Neural Network for ASV Spoofing Detection. 4289-4293 - Haibin Wu, Yang Zhang, Zhiyong Wu, Dong Wang, Hung-yi Lee:
Voting for the Right Answer: Adversarial Defense for Speaker Verification. 4294-4298 - Tomi Kinnunen, Andreas Nautsch, Md. Sahidullah, Nicholas W. D. Evans, Xin Wang
, Massimiliano Todisco, Héctor Delgado, Junichi Yamagishi, Kong Aik Lee
:
Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing. 4299-4303 - Jesús Villalba
, Sonal Joshi
, Piotr Zelasko, Najim Dehak
:
Representation Learning to Classify and Detect Adversarial Attacks Against Speaker and Speech Recognition Systems. 4304-4308 - You Zhang
, Ge Zhu, Fei Jiang, Zhiyao Duan:
An Empirical Study on Channel Effects for Synthetic Voice Spoofing Countermeasure Systems. 4309-4313 - Xu Li, Xixin Wu, Hui Lu
, Xunying Liu, Helen Meng:
Channel-Wise Gated Res2Net: Towards Robust Detection of Synthetic Speech Attacks. 4314-4318 - Wanying Ge, Michele Panariello
, Jose Patino, Massimiliano Todisco, Nicholas W. D. Evans:
Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection. 4319-4323
OpenASR20 and Low Resource ASR Development
- Kay Peterson, Audrey Tong, Yan Yu:
OpenASR20: An Open Challenge for Automatic Speech Recognition of Conversational Telephone Speech in Low-Resource Languages. 4324-4328 - Srikanth R. Madikeri, Petr Motlícek, Hervé Bourlard:
Multitask Adaptation with Lattice-Free MMI for Multi-Genre Speech Recognition of Low Resource Languages. 4329-4333 - Qiu-Shi Zhu, Jie Zhang, Ming-Hui Wu, Xin Fang, Li-Rong Dai:
An Improved Wav2Vec 2.0 Pre-Training Approach Using Enhanced Local Dependency Modeling for Speech Recognition. 4334-4338 - Hung-Pang Lin, Yu-Jia Zhang, Chia-Ping Chen:
Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges. 4339-4343 - Jing Zhao, Zhiqiang Lv, Ambyera Han, Guan-Bo Wang, Gui-Xin Shi, Jian Kang, Jinghao Yan, Pengfei Hu, Shen Huang, Weiqiang Zhang:
The TNT Team System Descriptions of Cantonese and Mongolian for IARPA OpenASR20. 4344-4348 - Tanel Alumäe, Jiaming Kong:
Combining Hybrid and End-to-End Approaches for the OpenASR20 Challenge. 4349-4353 - Ethan Morris, Robbie Jimerson, Emily Prud'hommeaux:
One Size Does Not Fit All in Resource-Constrained ASR. 4354-4358
Survey Talk 4: Alejandrina Cristia
- Alejandrina Cristià:
Child Language Acquisition Studied with Wearables.
Keynote 4: Tomáš Mikolov
- Tomás Mikolov:
Language Modeling and Artificial Intelligence.
Voice Activity Detection
- Pablo Gimeno, Alfonso Ortega Giménez
, Antonio Miguel, Eduardo Lleida:
Unsupervised Representation Learning for Speech Activity Detection in the Fearless Steps Challenge 2021. 4359-4363 - Tyler Vuong, Yangyang Xia, Richard M. Stern
:
The Application of Learnable STRF Kernels to the 2021 Fearless Steps Phase-03 SAD Challenge. 4364-4368 - Seyyed Saeed Sarfjoo, Srikanth R. Madikeri, Petr Motlícek:
Speech Activity Detection Based on Multilingual Speech Recognition System. 4369-4373 - Jarrod Luckenbaugh, Samuel Abplanalp, Rachel Gonzalez, Daniel Fulford, David Gard, Carlos Busso:
Voice Activity Detection with Teacher-Student Domain Emulation. 4374-4378 - Omid Ghahabi, Volker Fischer:
EML Online Speech Activity Detection for the Fearless Steps Challenge Phase-III. 4379-4382
Keyword Search and Spoken Language Processing
- Kuba Lopatka, Katarzyna Kaszuba-Miotke, Piotr Klinke, Pawel Trella:
Device Playback Augmentation with Echo Cancellation for Keyword Spotting. 4383-4387 - Bolaji Yusuf, Alican Gök, Batuhan Gündogdu, Murat Saraclar:
End-to-End Open Vocabulary Keyword Search. 4388-4392 - Danny Merkx, Stefan L. Frank, Mirjam Ernestus:
Semantic Sentence Similarity: Size does not Always Matter. 4393-4397 - Jan Svec
, Lubos Smídl
, Josef V. Psutka, Ales Prazák:
Spoken Term Detection and Relevance Score Estimation Using Dot-Product of Pronunciation Embeddings. 4398-4402 - François Buet
, François Yvon:
Toward Genre Adapted Closed Captioning. 4403-4407
Applications in Transcription, Education and Learning
- Daniel Korzekwa, Jaime Lorenzo-Trueba, Thomas Drugman, Shira Calamaro, Bozena Kostek:
Weakly-Supervised Word-Level Pronunciation Error Detection in Non-Native English Speech. 4408-4412 - Naoyuki Kanda, Guoli Ye, Yashesh Gaur, Xiaofei Wang, Zhong Meng, Zhuo Chen, Takuya Yoshioka:
End-to-End Speaker-Attributed ASR with Transformer. 4413-4417 - Hagen Soltau, Mingqiu Wang, Izhak Shafran, Laurent El Shafey:
Understanding Medical Conversations: Rich Transcription, Confidence Scores & Information Extraction. 4418-4422 - Jazmín Vidal, Cyntia Bonomi, Marcelo Sancinetti, Luciana Ferrer:
Phone-Level Pronunciation Scoring for Spanish Speakers Learning English Using a GOP-DNN System. 4423-4427 - Xiaoshuo Xu, Yueteng Kang, Songjun Cao, Binghuai Lin, Long Ma:
Explore wav2vec 2.0 for Mispronunciation Detection. 4428-4432 - Shintaro Ando, Nobuaki Minematsu, Daisuke Saito:
Lexical Density Analysis of Word Productions in Japanese English Using Acoustic Word Embeddings. 4433-4437 - Binghuai Lin, Liyuan Wang:
Deep Feature Transfer Learning for Automatic Pronunciation Assessment. 4438-4442 - Huayun Zhang, Ke Shi, Nancy F. Chen:
Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil. 4443-4447 - Linkai Peng, Kaiqi Fu, Binghuai Lin, Dengfeng Ke, Jinsong Zhang:
A Study on Fine-Tuning wav2vec2.0 Model for the Task of Mispronunciation Detection and Diagnosis. 4448-4452 - Yu Qiao, Wei Zhou
, Elma Kerz, Ralf Schlüter
:
The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech. 4453-4457 - Tomohiro Tanaka, Ryo Masumura, Mana Ihori, Akihiko Takashima, Shota Orihashi, Naoki Makishima:
End-to-End Rich Transcription-Style Automatic Speech Recognition with Semi-Supervised Learning. 4458-4462 - Ronald Cumbal, Birger Moëll, José Lopes, Olov Engwall
:
"You don't understand me!": Comparing ASR Results for L1 and L2 Speakers of Swedish. 4463-4467 - Yang Zhang, Evelina Bakhturina, Kyle Gorman, Boris Ginsburg:
NeMo Inverse Text Normalization: From Development to Production. 4468-4472 - Satsuki Naijo, Akinori Ito, Takashi Nose:
Improvement of Automatic English Pronunciation Assessment with Small Number of Utterances Using Sentence Speakability. 4473-4477
Emotion and Sentiment Analysis III
- Fasih Haider
, Saturnino Luz:
Affect Recognition Through Scalogram and Multi-Resolution Cochleagram Features. 4478-4482 - Jiawang Liu, Haoxiang Wang:
A Speech Emotion Recognition Framework for Better Discrimination of Confusions. 4483-4487 - Ruichen Li, Jinming Zhao, Qin Jin:
Speech Emotion Recognition via Multi-Level Cross-Modal Distillation. 4488-4492 - Koichiro Ito, Takuya Fujioka, Qinghua Sun, Kenji Nagamatsu:
Audio-Visual Speech Emotion Recognition by Disentangling Emotion and Identity Attributes. 4493-4497 - Deboshree Bose, Vidhyasaharan Sethu
, Eliathamby Ambikairajah
:
Parametric Distributions to Model Numerical Emotion Labels. 4498-4502 - Yuan Gao, Jiaxing Liu, Longbiao Wang, Jianwu Dang:
Metric Learning Based Feature Representation with Gated Fusion Model for Speech Emotion Recognition. 4503-4507 - Xingyu Cai, Jiahong Yuan, Renjie Zheng, Liang Huang, Kenneth Church
:
Speech Emotion Recognition with Multi-Task Learning. 4508-4512 - Nadee Seneviratne, Carol Y. Espy-Wilson:
Generalized Dilated CNN Models for Depression Detection Using Inverted Vocal Tract Variables. 4513-4517 - Yuhua Wang, Guang Shen, Yuezhu Xu, Jiahang Li, Zhengdao Zhao:
Learning Mutual Correlation in Multimodal Transformer for Speech Emotion Recognition. 4518-4522 - Jiaxing Liu, Yaodong Song, Longbiao Wang, Jianwu Dang, Ruiguo Yu:
Time-Frequency Representation Learning with Graph Convolutional Network for Dialogue-Level Speech Emotion Recognition. 4523-4527
Resource-Constrained ASR
- Gonçalo Mordido, Matthijs Van Keirsbilck, Alexander Keller:
Compressing 1D Time-Channel Separable Convolutions Using Sparse Random Ternary Matrices. 4528-4532 - Mengli Cheng, Chengyu Wang, Jun Huang, Xiaobo Wang:
Weakly Supervised Construction of ASR Systems from Massive Video Data. 4533-4537 - Byeonggeun Kim, Simyung Chang, Jinkyu Lee, Dooyong Sung:
Broadcasted Residual Learning for Efficient Keyword Spotting. 4538-4542 - Rupak Vignesh Swaminathan, Brian John King, Grant P. Strimel, Jasha Droppo, Athanasios Mouchtaris:
CoDERT: Distilling Encoder Representations with Co-Learning for Transducer-Based Speech Recognition. 4543-4547 - Zhifu Gao, Yiwu Yao, Shiliang Zhang, Jun Yang, Ming Lei, Ian McLoughlin
:
Extremely Low Footprint End-to-End ASR System for Smart Device. 4548-4552 - Yuan Shangguan, Rohit Prabhavalkar, Hang Su, Jay Mahadeokar, Yangyang Shi, Jiatong Zhou, Chunyang Wu, Duc Le, Ozlem Kalinli, Christian Fuegen, Michael L. Seltzer:
Dissecting User-Perceived Latency of On-Device E2E Speech Recognition. 4553-4557 - Jonathan Macoskey, Grant P. Strimel, Jinru Su, Ariya Rastrow:
Amortized Neural Networks for Low-Latency Speech Recognition. 4558-4562 - Rami Botros, Tara N. Sainath, Robert David, Emmanuel Guzman, Wei Li, Yanzhang He:
Tied & Reduced RNN-T Decoder. 4563-4567 - Jangho Kim, Simyung Chang, Nojun Kwak:
PQK: Model Compression via Pruning, Quantization, and Knowledge Distillation. 4568-4572 - Varun Nagaraja, Yangyang Shi, Ganesh Venkatesh, Ozlem Kalinli, Michael L. Seltzer, Vikas Chandra:
Collaborative Training of Acoustic Encoders for Speech Recognition. 4573-4577 - Xiong Wang, Sining Sun, Lei Xie, Long Ma:
Efficient Conformer with Prob-Sparse Attention Mechanism for End-to-End Speech Recognition. 4578-4582 - Titouan Parcollet, Mirco Ravanelli:
The Energy and Carbon Footprint of Training End-to-End Speech Recognizers. 4583-4587
Speaker Recognition: Applications
- Long Chen, Venkatesh Ravichandran
, Andreas Stolcke:
Graph-Based Label Propagation for Semi-Supervised Speaker Identification. 4588-4592 - Ruirui Li, Chelsea J.-T. Ju, Zeya Chen, Hongda Mao, Oguz Elibol, Andreas Stolcke:
Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition. 4593-4597 - Sandro Cumani, Salvatore Sarni:
A Generative Model for Duration-Dependent Score Calibration. 4598-4602 - Jason Pelecanos, Quan Wang, Ignacio López-Moreno:
Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition. 4603-4607 - Saurabh Kataria, Shi-Xiong Zhang, Dong Yu:
Multi-Channel Speaker Verification for Single and Multi-Talker Speech. 4608-4612 - Dirk Padfield, Daniel J. Liebling:
Chronological Self-Training for Real-Time Speaker Diarization. 4613-4617 - Runqiu Xiao, Xiaoxiao Miao, Wenchao Wang, Pengyuan Zhang, Bin Cai, Liuping Luo:
Adaptive Margin Circle Loss for Speaker Verification. 4618-4622 - Benjamin O'Brien
, Christine Meunier, Alain Ghio:
Presentation Matters: Evaluating Speaker Identification Tasks. 4623-4627 - Fuchuan Tong
, Yan Liu, Song Li, Jie Wang, Lin Li, Qingyang Hong:
Automatic Error Correction for Speaker Embedding Learning with Noisy Labels. 4628-4632 - Dexin Liao, Jing Li, Yiming Zhi, Song Li, Qingyang Hong, Lin Li:
An Integrated Framework for Two-Pass Personalized Voice Trigger. 4633-4637 - Jiachen Lian, Aiswarya Vinod Kumar, Hira Dhamyal, Bhiksha Raj, Rita Singh:
Masked Proxy Loss for Text-Independent Speaker Verification. 4638-4642
Speech Synthesis: Speaking Style and Emotion
- Keon Lee, Kyumin Park, Daeyoung Kim:
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. 4643-4647 - Rui Liu, Berrak Sisman, Haizhou Li:
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability. 4648-4652 - Sarath Sivaprasad, Saiteja Kosgi, Vineet Gandhi:
Emotional Prosody Control for Speech Generation. 4653-4657 - Jian Cong, Shan Yang, Na Hu, Guangzhi Li, Lei Xie, Dan Su:
Controllable Context-Aware Conversational Speech Synthesis. 4658-4662 - Minchan Kim
, Sung Jun Cheon, Byoung Jin Choi, Jong Jin Kim, Nam Soo Kim:
Expressive Text-to-Speech Using Style Tag. 4663-4667 - Yuzi Yan, Xu Tan, Bohan Li, Guangyan Zhang, Tao Qin
, Sheng Zhao, Yuan Shen, Wei-Qiang Zhang, Tie-Yan Liu:
Adaptive Text to Speech for Spontaneous Style. 4668-4672 - Xiang Li, Changhe Song, Jingbei Li, Zhiyong Wu, Jia Jia, Helen Meng:
Towards Multi-Scale Style Control for Expressive Speech Synthesis. 4673-4677 - Shifeng Pan, Lei He:
Cross-Speaker Style Transfer with Prosody Bottleneck in Neural Speech Synthesis. 4678-4682 - Daxin Tan, Tan Lee
:
Fine-Grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement. 4683-4687 - Xiaochun An, Frank K. Soong, Lei Xie:
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-End Neural TTS. 4688-4692 - Slava Shechtman, Raul Fernandez, Alexander Sorin, David Haws:
Synthesis of Expressive Speaking Styles with Limited Training Data in a Multi-Speaker, Prosody-Controllable Sequence-to-Sequence Architecture. 4693-4697
Spoken Language Understanding II
- Mai Hoang Dao, Thinh Hung Truong, Dat Quoc Nguyen:
Intent Detection and Slot Filling for Vietnamese. 4698-4702 - Haitao Lin, Lu Xiang
, Yu Zhou, Jiajun Zhang, Chengqing Zong
:
Augmenting Slot Values and Contexts for Spoken Language Understanding with Pretrained Models. 4703-4707 - Judith Gaspers, Quynh Do, Daniil Sorokin, Patrick Lehnen:
The Impact of Intent Distribution Mismatch on Semi-Supervised Spoken Language Understanding. 4708-4712 - Yidi Jiang, Bidisha Sharma, Maulik C. Madhavi, Haizhou Li:
Knowledge Distillation from BERT Transformer to Speech Transformer for Intent Classification. 4713-4717 - Nick J. C. Wang, Lu Wang, Yandan Sun, Haimei Kang, Dejun Zhang:
Three-Module Modeling For End-to-End Spoken Language Understanding Using Pre-Trained DNN-HMM-Based Acoustic-Phonetic Model. 4718-4722 - Sujeong Cha, Wangrui Hou, Hyun Jung, My Phung, Michael Picheny, Hong-Kwang Jeff Kuo, Samuel Thomas, Edmilson da Silva Morais:
Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs. 4723-4727 - Xianwei Zhang, Liang He:
End-to-End Cross-Lingual Spoken Language Understanding Model with Multilingual Pretraining. 4728-4732 - Hamidreza Saghir, Samridhi Choudhary, Sepehr Eghbali, Clement Chung:
Factorization-Aware Training of Transformers for Natural Language Understanding on the Edge. 4733-4737 - Michael Saxon, Samridhi Choudhary, Joseph P. McKenna, Athanasios Mouchtaris:
End-to-End Spoken Language Understanding for Generalized Voice Assistants. 4738-4742 - Soyeon Caren Han, Siqu Long, Huichun Li, Henry Weld, Josiah Poon:
Bi-Directional Joint Neural Networks for Intent Classification and Slot Filling. 4743-4747
INTERSPEECH 2021 Acoustic Echo Cancellation Challenge
- Ross Cutler, Ando Saabas, Tanel Pärnamaa, Markus Loide, Sten Sootla, Marju Purin, Hannes Gamper, Sebastian Braun, Karsten Sørensen, Robert Aichner, Sriram Srinivasan:
INTERSPEECH 2021 Acoustic Echo Cancellation Challenge. 4748-4752 - Lukas Pfeifenberger, Matthias Zöhrer, Franz Pernkopf
:
Acoustic Echo Cancellation with Cross-Domain Learning. 4753-4757 - Shimin Zhang
, Yuxiang Kong, Shubo Lv, Yanxin Hu, Lei Xie:
F-T-LSTM Based Complex Network for Joint Acoustic Echo Cancellation and Speech Enhancement. 4758-4762 - Ernst Seidel
, Jan Franzen, Maximilian Strake, Tim Fingscheidt
:
Y2-Net FCRN for Acoustic Echo and Noise Suppression. 4763-4767 - Renhua Peng, Linjuan Cheng, Chengshi Zheng, Xiaodong Li:
Acoustic Echo Cancellation Using Deep Complex Neural Network with Nonlinear Magnitude Compression and Phase Information. 4768-4772 - Amir Ivry, Israel Cohen, Baruch Berdugo:
Nonlinear Acoustic Echo Cancellation with Deep Learning. 4773-4777
Speech Recognition of Atypical Speech
- Jordan R. Green, Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave
, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Katrin Tomanek:
Automatic Speech Recognition of Disordered Speech: Personalized Models Outperforming Human Listeners on Short Phrases. 4778-4782 - Michael Neumann, Oliver Roesler, Jackson Liscombe, Hardik Kothare, David Suendermann-Oeft, David Pautler, Indu Navar, Aria Anvar, Jochen Kumm, Raquel Norel, Ernest Fraenkel
, Alexander V. Sherman, James D. Berry, Gary L. Pattee, Jun Wang, Jordan R. Green, Vikram Ramanarayanan:
Investigating the Utility of Multimodal Conversational Technology and Audiovisual Analytic Measures for the Assessment and Monitoring of Amyotrophic Lateral Sclerosis at Scale. 4783-4787 - Enno Hermann
, Mathew Magimai-Doss:
Handling Acoustic Variation in Dysarthric Speech Recognition Systems Through Model Combination. 4788-4792 - Mengzhe Geng, Shansong Liu, Jianwei Yu, Xurong Xie, Shoukang Hu, Zi Ye, Zengrui Jin, Xunying Liu, Helen Meng:
Spectro-Temporal Deep Features for Disordered Speech Assessment and Recognition. 4793-4797 - Sarah E. Gutz, Hannah P. Rowe
, Jordan R. Green:
Speaking with a KN95 Face Mask: ASR Performance and Speaker Compensation. 4798-4802 - Zengrui Jin, Mengzhe Geng, Xurong Xie, Jianwei Yu, Shansong Liu, Xunying Liu, Helen Meng:
Adversarial Data Augmentation for Disordered Speech Recognition. 4803-4807 - Xurong Xie, Rukiye Ruzi, Xunying Liu, Lan Wang:
Variational Auto-Encoder Based Variability Encoding for Dysarthric Speech Recognition. 4808-4812 - Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu, Helen Meng:
Learning Explicit Prosody Models and Deep Speaker Embeddings for Atypical Voice Conversion. 4813-4817 - Jiajun Deng, Fabian Ritter Gutierrez, Shoukang Hu, Mengzhe Geng, Xurong Xie, Zi Ye, Shansong Liu, Jianwei Yu, Xunying Liu, Helen Meng:
Bayesian Parametric and Architectural Domain Adaptation of LF-MMI Trained TDNNs for Elderly and Dysarthric Speech Recognition. 4818-4822 - Shanqing Cai, Lisie Lillianfeld, Katie Seaver, Jordan R. Green, Michael P. Brenner, Philip C. Nelson, D. Sculley:
A Voice-Activated Switch for Persons with Motor and Speech Impairments: Isolated-Vowel Spotting Using Neural Networks. 4823-4827 - Zhehuai Chen, Bhuvana Ramabhadran, Fadi Biadsy, Xia Zhang, Youzheng Chen, Liyang Jiang, Fang Chu, Rohan Doshi, Pedro J. Moreno:
Conformer Parrotron: A Faster and Stronger End-to-End Speech Conversion and Recognition Model for Atypical Speech. 4828-4832 - Robert L. MacDonald, Pan-Pan Jiang, Julie Cattiau, Rus Heywood, Richard Cave
, Katie Seaver, Marilyn A. Ladewig, Jimmy Tobin, Michael P. Brenner, Philip C. Nelson, Jordan R. Green, Katrin Tomanek:
Disordered Speech Data Collection: Lessons Learned at 1 Million Utterances from Project Euphonia. 4833-4837 - Eun Jung Yeo, Sunhee Kim, Minhwa Chung:
Automatic Severity Classification of Korean Dysarthric Speech Using Phoneme-Level Pronunciation Features. 4838-4842 - Subhashini Venugopalan, Joel Shor, Manoj Plakal, Jimmy Tobin, Katrin Tomanek, Jordan R. Green, Michael P. Brenner:
Comparing Supervised Models and Learned Speech Representations for Classifying Intelligibility of Disordered Speech on Selected Phrases. 4843-4847 - Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu, Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis G. Georgiou, Sachin Kajarekar, Jeffrey P. Bigham:
Analysis and Tuning of a Voice Assistant System for Dysfluent Speech. 4848-4852
Show and Tell 4
- Hideki Kawahara, Kohei Yatabe, Ken-Ichi Sakakibara, Mitsunori Mizumachi, Masanori Morise, Hideki Banno, Toshio Irino:
Interactive and Real-Time Acoustic Measurement Tools for Speech Data Acquisition and Presentation: Application of an Extended Member of Time Stretched Pulses. 4853-4854 - Daniel Tihelka, Markéta Rezácková, Martin Gruber, Zdenek Hanzlícek, Jakub Vít, Jindrich Matousek:
Save Your Voice: Voice Banking and TTS for Anyone. 4855-4856 - Yang Zhang, Evelina Bakhturina, Boris Ginsburg:
NeMo (Inverse) Text Normalization: From Development to Production. 4857-4859 - Corentin Hembise, Lucile Gelin, Morgane Daniel:
Lalilo: A Reading Assistant for Children Featuring Speech Recognition-Based Reading Mistake Detection. 4860-4861 - Manh Hung Nguyen, Vu Hoang, Tu Anh Nguyen, Trung H. Bui:
Automatic Radiology Report Editing Through Voice. 4862-4863 - Ke Shi, Kye Min Tan, Huayun Zhang, Siti Umairah Md. Salleh, Shikang Ni, Nancy F. Chen:
WittyKiddy: Multilingual Spoken Language Learning for Kids. 4864-4865 - Chunxiang Jin, Minghui Yang, Zujie Wen:
Duplex Conversation in Outbound Agent System. 4866-4867 - Sathvik Udupa, Anwesha Roy, Abhayjeet Singh, Aravind Illa, Prasanta Kumar Ghosh:
Web Interface for Estimating Articulatory Movements in Speech Production from Acoustics and Text. 4868-4869

manage site settings
To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.