Sign up now to get early access to Rapport. Get on the list
The Rapport Report
Interspeech 2021: Laughter is no laughing matter
We had a blast at this year's Interspeech conference presenting our research on non-verbal vocalization, and hearing about all the innovative work others have done over the last year in the field of speech technology!
This year, Interspeech was a hybrid conference accommodating both in-person participants in Brno, Czech Republic, and remote participants all over the world. The event offered live-streamed paper presentations, virtual rooms for poster sessions, and a well-rounded roster of featured speakers.
A hybrid conference is tricky to pull off, but Interspeech 2021 was well-executed and engaging to us as remote participants. Bravo to the organizers!
This year we presented a paper called Non-verbal Vocalisation and Laughter Detection using Sequence-to-sequence Models and Multi-label Training, where we tackle the task of detecting laughter and non-verbal vocalizations from speech.
The term non-verbal vocalizations (NVVs) refers to parts of a conversation that are not spoken words. For example, laughter, breath, sighs, or coughs. They play an important role in communication as they carry information about the speaker’s physiological state, emotional state, or even their intentions. Being able to accurately detect these non-verbal vocalizations is an important step to improve our audio-driven facial animation technology, which will improve the social presence of avatars on the Rapport platform.
But how do you detect non-verbal vocalization from speech? The standard approach for detection is to use neural network models trained on an annotated audio dataset where the labels (or annotation) represent the non-verbal vocalizations. To be able to do that, we need an exhaustive list of possible classes to form a label set. While it’s almost trivial in most cases, it is not in our case as there is no clear definition of NVVs in the literature . Hence the most common approach to obtain the label set is binary mapping, where one keeps the label of interest and map all the other labels to an “other” class. For example, for laughter detection the label set becomes “laughter” and “non-laughter”, the latter containing all non-verbal vocalization except laughter.
In our paper, we hypothesized that this approach of mapping a lot of different labels to the “other” label is detrimental and can hurt the performance of the model. Our intuition is the “other” label is hard to model as it is composed of very different patterns, which can confuse the model. For example, a breath is acoustically very different from a cough, but the model will be taught that they belong to the same class.
Left: Spectrogram of a cough | Right: Spectrogram of a breath
To remedy that problem we proposed a new training approach, called multi-class training, where the mapping from the original label set to the task label set is performed after training. In other words, the model is trained on as many labels as possible, and it is only during inference that the mapping to the binary set is done. This would certainly help the model regarding the acoustic differences between labels, as the model can now learn how to model each non-verbal vocalization separately.
We evaluated our approaches on laughter detection using the ICSI corpus and we compared them with the literature. We showed that our multi-class and multi-label training approaches yield better performance than the standard binary mapping approach and also outperforms the literature, hence becoming the new state-of-the-art for this task. More details about the experiments and the results can be found in the paper.
These results clearly support our hypothesis: training on more classes, rather than mapping to a smaller set of classes improves model performance for the task of laughter detection. This is in agreement with our assertion that mapping before training and including an “other” class leads to more ambiguity, making it difficult to model.
With this paper we presented a novel approach for detecting laughter and non-verbal vocalization from speech, which outperforms the state-of-the-art systems. Once applied to our facial animation technology, this research will help make Rapport’s avatars look more lively than ever!
Conference Highlights You Might Find Interesting
A keynote talk that caught our attention was by Prof. Pascale Fung on the "Ethical and Technological Challenges of Conversational AI". She addressed the ethical issues of chatbots, mainly about inappropriate, biased, and even toxic responses. There were many great insights about the current state of conversation agents!
A popular topic of research this year was, of course, on Coronavirus and coughing. Over the last year, a significant amount of audio data from those who have been ill with COVID-19 has been collected. This has allowed for a vast amount of research in areas that are of interest to us such as breath and cough detection.
In particular there was a great talk by Prof. Sriram Ganapathy about "Uncovering the acoustic cues of COVID-19 infection", in which he showed that using audio recordings to detect COVID-19 is promising, some systems are even beginning clinical trials.
Corpora and Tools
Data plays a big role in our technology and we believe you can never have enough! Several new datasets have been released in a variety of languages including Russian, Mandarin, and Kazakh.
As well as using corpora put together by others, we also do data collection ourselves. For this reason, we are always on the lookout for new data collection tools that could improve the time, cost, and annotator user experience for this task. Here are some papers featuring tools we're keen to test out:
- ProsoBeast Prosody Annotation Tool
- Human-in-the-Loop Efficiency Analysis for Binary Classification in Edyson
- Beey: More Than a Speech-to-Text Editor
We are particularly interested in detecting emotion from audio in order to improve our expressive facial animation, see our previous blog post here. This year there was no shortage of research into emotion recognition! Here are some of our favourite papers on the topic:
- A Speech Emotion Recognition Framework for Better Discrimination of Confusions
- Parametric Distributions to Model Numerical Emotion Labels
- M3: MultiModal Masking applied to sentiment analysis
- Acoustic Features and Neural Representations for Categorical Emotion Recognition from Speech
 J.Trouvain and K.Truong, “Comparing non-verbal vocalisations in conversational speech corpora,” in Proc. of Workshop on Corpora for Research on Emotion Sentiment and Social Signals. ELRA, 2012, pp. 36–39.