Sign up now to get early access to Rapport. Get on the list
The Rapport Report
Interspeech 2020: The Papers Our Team Found Most Interesting
Below you’ll see our happy team from last year enjoying each other’s company at Interspeech Graz 2019, all blissfully ignorant of the year 2020 was about to deliver. Going virtually at this year’s conference definitely impacted the energy of the event, but there were still lots of highlights we’re excited to share.
Hats off to the organizers of Interspeech 2020! The first fully virtual Interspeech was well organized, engaging, and accessible. Each paper was accompanied by a one-and-a-half minute highlight video and a 15 minute in-depth video presentation. For those of us who crave structure, live keynotes and technical sessions (containing the highlight videos and a live Q&A) were a blessing. On the other hand, some of us loved the freedom and accessibility of having all these resources available at our fingertips (an almost infinite scroll!).
Our data, linguistics, and machine learning teams are a pretty diverse group and Interspeech had something for us all. With the aim of improving our audio-to-animation technology, we were most interested in staying up to date with:
This year’s keynotes were exciting as usual! First, Prof. Pierrehumbert gave her ISCA medalist talk on the simplicity and complexity of models in language science. Specifically, she presented three case studies (speech production, theory of intonation, and word formation) and showed that simple models enjoyed success because they were cognitively realistic. The second keynote by Prof. Shinn-Cunningham was about the neurological aspects of speech perception and volitional attention. The next keynote by Prof. Lin-shan Lee was very inspiring: he presented his effort in the 1980s to build the first Mandarin ASR system. His most recent works, including the SpeechBERT approach, were also presented this year. He concluded with an inspirational quote: “This may be the golden age we never had for research”. Thank you for the great talk Prof. Lee!
Finally, the last keynote was given by Shehzad Mevawalla from Amazon Alexa. He presented the technical and practical challenges of building Alexa and the future work on the conversational assistant. It’s always interesting to see behind the scenes, especially for one of the most advanced assistants on the market.
The availability of high-quality speech audio datasets (or lack thereof) has long been a problem for the speech and language processing community. Machine learning projects are heavily data-dependent: a model is only as good as the data used to train it (garbage in, garbage out!).
Here at Rapport, our goal is to animate any sound produced by the vocal tract, so we've recently been investigating what we call "non-speech vocalizations" (laughing, coughing, screaming) as well as different "voice modes" (singing, shouting, whispering). The lack of relevant data became quickly apparent so we were excited to come across some new datasets prioritizing the study of non-speech vocalizations and voice mode at Interspeech this year. In the first paper, researchers at Utsunomiya University capture social screams in video game-based spoken dialogues. A total of 1437 screams were manually identified (12 times more than in other existing gaming databases). Next, the Indian Institute of Science has been crowdsourcing a database of breathing, coughing, and voice sounds designed for Covid-19 diagnosis. If you're feeling generous you can donate your voice data here - it only takes 5 minutes. Lastly, the Jukebox dataset was created by a team at Michigan State University to address the shortage of singing voice data. To top it off - it's multilingual with data sung in 18 different languages!
As discussed above, machine learning requires a large amount of speech data, which is often hard to come by. YouTube is a great source for speech in many languages, but data collection on YouTube takes a lot of time and effort. This paper presents an open-source Python library called VCTUBE that facilitates the collection of audio data and their corresponding transcripts from YouTube. To evaluate the quality of this tool, the researchers used it to download datasets for both Korean and English and trained TTS models for each language. Compared to 20 test sentences, the researchers’ evaluation showed good results can be obtained using data collected with VCTUBE. Our team is looking forward to experimenting with this tool towards quick and efficient data collection, especially for low-resource languages for which we have struggled to find enough data in the past.
In order to produce accurate and emotive facial animation we first need to correctly detect emotion from audio (see the excellent summary by our Director of Machine Learning’s blog post from a couple weeks back here). This is a problem we’ve been interested in for a while; we even released a paper about it at last year's Interspeech here.
We were particularly excited by this paper: it is a great extension of the work carried out in our paper from Interspeech 2019 (link above). Dissanayake et al reproduced our approach of increasing the generalization of speech emotion recognition models to unseen domains. They go on to show that the addition of an autoencoder further improves the model’s performance by increasing corpus invariance. An analysis of a T-SNE visualization confirms emotional categories are more clearly separated by the model, while separation according to corpus is shown to be less clear, which is the desired result.
This paper presents an approach for training emotion recognition models on multiple corpora. While this approach is promising to improve models performance and generalization capability, the risk of having the model overfitting on the corpora always exists, and selecting a common label set is a challenge. This paper proposes a new approach to tackle these issues where: (1) the model is composed of parallel output layers, one for each corpus, therefore there is no need to find a common label set, and (2) the model is trained to detect corpora adversarially, i.e. it’s trained to be bad at recognizing corpora, thus improving its generalization capabilities.
AI agents are increasingly being used for customer service roles. For any customer service interaction it is important to know whether or not the customer is satisfied with the outcome of their interaction. However, as noted by Kim et al., it’s actually quite annoying to be asked “were you satisfied with your service?”! This paper written by researchers at Amazon Alexa describes a method for predicting a customer satisfaction score (CSAT) in a continuous and non-intrusive manner. It presents a sentiment analysis model, trained on both acoustic and lexical cues, and explores the relationship between CSAT and sentiment scores. It showed CSAT is correlated with satisfaction and with valence (measuring if an emotion is positive or negative), but not with activation (measuring the energy in an emotion). These satisfaction and valence scores can then be used to predict CSAT using two methods, a static regression using v-support vector regression, or a temporal regression using a BLSTM model. Both approaches show a higher correlation with self-reported CSAT than human annotators.
Rather than relying on hand-crafted labels for each problem, it would be great if models could learn useful representations from the unlabeled data itself. This is exactly the promise of self-supervised learning, in which models rely on only the unlabeled data during training. There’s lots of different approaches to do this, but the Contrastive Predictive Coding (CPC) approach is gaining traction among audio researchers. In CPC model pre training, a contrastive loss is used to guide the model to distinguish similar samples from distant samples in the hope that it will learn useful representations in the process.
In this paper, adversarial perturbations are added to the audio to prevent the model from learning trivial solutions. This paper is also the first to show CPC pre training can be used on audio tasks other than speech tasks like sound event detection.
The phoneme segmentation task consists of finding boundaries separating the different phonemes in a speech signal. To this aim, an interesting approach based on contrastive learning is proposed, where the model is trained to detect how similar two audio segments are using a contrastive loss. Therefore, the boundaries can be obtained by finding the peaks of dissimilarity in the output of the model. This very promising approach yields better performance than state-of-the-art unsupervised training and is closing the gap with supervised approaches. Well done!
AUTOMATIC SPEECH RECOGNITION
Automatic Speech Recognition (ASR) isn’t our main focus at Rapport, but we like to keep an eye on it as many of the trends often bleed into other speech audio tasks.
From our friends at the University of Edinburgh’s CSTR, this paper presents an analysis of the raw waveform approach used in ASR, especially focusing on the first convolution layer of the network, which learns filterbanks. There’s insightful analysis on the training dynamics, and it’s great to see the raw waveform approach is alive and well! It is also good to see papers focusing on understanding neural networks, as it is an important direction of research.
As other domains like Natural Language Processing are increasingly dominated by Transformer-based model architectures, audio has lagged behind. In its adoption to date, there’s been a general preference for traditional CNN and RNN-based models. One major trend of this Interspeech was a large uptick in Transformer-based architectures for speech tasks. Like RNNs, they model long term dependencies well, but unlike RNNs they can be scaled to massive, parallel training pipelines. This paper by researchers at Google combines convolutions to model short-term features in audio, while using Transformers for long-range context. They show ablation studies with model sizes, architectures etc. and by achieving state-of-the-art on the LibriSpeech ASR benchmark, it's clear Conformers are a powerful new architecture for speech tasks.
This paper from the University of Florida investigates racial biases in state-of-the-art ASR systems, focusing on the case of habitual “be” in African American English (AAE). This form of “be” is a distinctive feature of AAE and marks actions that occur habitually, for example, “Bruce be running to work” (meaning Bruce usually runs to work). The researchers evaluated the performance of two ASR systems: the commercial model Google Cloud Speech and the open-source ASR system DeepSpeech on data from the Corpus of Regional African American Language. The results show both ASR systems struggle to infer instances of habitual “be” compared to other types of “be”. Furthermore, this study showed words surrounding habitual “be” were also more error prone. Previous studies have shown many ASR systems may be biased against acoustic aspects of AAE and this paper shows there may also be biases against morpho-syntactic features. ASR systems are a part of many increasingly widespread tools and studies like this one highlight the importance of creating these systems to be inclusive and to mitigate racial biases.
The question of what multilingual models are actually learning is certainly an open one, however, this paper attempts to answer this from the phonetic perspective. End-to-end phone-level ASR models were trained for 13 languages in monolingual, cross-lingual (zero-shot), and multilingual settings. Not surprisingly, results show consistent and large improvements in the multilingual settings and similar degradations in the cross-lingual settings. Among the many interesting analyses, one that stood out to us was the question of whether or not we're able to improve the performance of phones that are not observed outside of the target language. Results show some language-unique phones benefit from multilingual training - and there's a pattern. Language-unique phones with the highest improvements each have a corresponding phone in other languages with which they share all but one of their articulatory features. On the other hand, language-unique phones with little improvement do not have similar phones by this same criteria (in fact, these are mostly Zulu click consonants). This paper suggests it’s important to prioritize adding languages with markedly different phones when considering adding new languages to multilingual training sets. It also suggests when thinking about how well models might perform on unseen languages we should consider their phonetic inventories.
Despite the circumstances, Interspeech 2020 was interesting, stimulating, and insightful as always. The efforts of the committee to make the experience as smooth as possible were greatly appreciated and we are already looking forward to next year, where maybe we’ll have something of our own to present. Stay tuned!