Self-Supervised Speech Representation Learning

Self-Supervised Speech Representation Learning: A Review

Research output: Contribution to journal › Review › Research › peer-review

Documents

Post-print
Accepted author manuscript, 4.68 MB, PDF document

Abdelrahman Mohamed
Hung yi Lee
Lasse Borgholt
Jakob D. Havtorn
Joakim Edin
Igel, Christian
Katrin Kirchhoff
Shang Wen Li
Karen Livescu
Lars Maaloe
Tara N. Sainath
Shinji Watanabe

Although supervised deep learning has revolutionized speech and audio processing, it has necessitated the building of specialist models for individual tasks and application scenarios. It is likewise difficult to apply this to dialects and languages for which only limited labeled data is available. Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Such methods have shown success in natural language processing and computer vision domains, achieving new levels of performance while reducing the number of labels required for many downstream scenarios. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. Other approaches rely on multi-modal data for pre-training, mixing text or visual data streams with speech. Although self-supervised speech representation is still a nascent research area, it is closely related to acoustic word embedding and learning with zero lexical resources, both of which have seen active research for many years. This review presents approaches for self-supervised speech representation learning and their connection to other research areas. Since many current methods focus solely on automatic speech recognition as a downstream task, we review recent efforts on benchmarking learned representations to extend the application beyond speech recognition.

Original language	English
Journal	IEEE Journal on Selected Topics in Signal Processing
Volume	16
Issue number	6
Pages (from-to)	1179-1210
ISSN	1932-4553
DOIs	https://doi.org/10.1109/JSTSP.2022.3207050
Publication status	Published - 2022

Bibliographical note

Publisher Copyright:
IEEE

Research areas

Data models, Hidden Markov models, Representation learning, Self-supervised learning, Speech processing, speech representations, Task analysis, Training

ID: 322793323

Faculty of Law

Self-Supervised Speech Representation Learning: A Review

Documents

Bibliographical note

Research areas