Semantic Speech Processing with Neural Networks

Research output: Book/ReportPh.D. thesisResearch

  • Lasse Borgholt
Human spoken language understanding relies heavily on contextual information. The context of a single word provides important clues for the listener to accurately recognize and understand it. If a word is mispronounced or drowned out by noise, the listener may infer the word from context.Words like park and play have different meanings depending on the context they appear in. And even when a word is completely unknown to the listener, context may help in deriving its meaning. Thus, training models to identify semantic relations from context is an important path towards computers that can mimic the human understanding of spoken language.
This idea has a long tradition in neural representation learning.
Here, the goal is to learn data representations that are useful for other machine learning tasks. For example, in text-based natural language processing, the idea has inspired approaches for learning semantic word embeddings, such as word2vec. And more recently, it has inspired the development of masked language models, such as BERT.
These approaches have revolutionized natural language processing.
During the course of this thesis project, speech processing has undergone a similar development. However, these models are still evolving and there is much we do not know about what they learn, why they work, and how we can improve them.
This thesis investigates machine learning models that learn semantic features directly from speech. The first part of the thesis studies supervised learning and shows the following.
(i) The performance of end-to-end speech recognition models depends heavily on access to contextual information.
(ii) Question tracking and symptom detection in spoken medical dialogues benefit from multimodal input.
The second part of the thesis focuses on unsupervised learning.
The contributions are the following.
(iii) An overview of unsupervised representation learning for neural speech processing and a corresponding model taxonomy.
(iv) A comparison and analysis of two generations of the popular wav2vec framework for low-resource speech recognition.
(v) A novel hierarchical latent variable model which is benchmarked against other popular stochastic and deterministic models.
(vi) A comparison of contextualized speech representations and speech recognition transcripts when used as input for spoken language understanding tasks.
Original languageEnglish
PublisherDepartment of Computer Science, Faculty of Science, University of Copenhagen
Number of pages195
Publication statusPublished - 2022

ID: 310431624