Sign up now to get early access to Rapport. Get on the list
The Rapport Report
Emotional State Detection: Deep Learning to the Rescue?
Rapport is making human-machine communication more human, but how exactly? As our CEO Gregor Hofer presented recently, using an animated avatar creates online social presence, which provides a more engaging way for the user to interact online with an AI chatbot or other users (see the excellent summary by Ashley Whitlatch here). But how do you achieve this in practice? Let’s peek under the hood.
To create social presence, the first ingredient is facial animation. It should be accurate with plausible mouth and face movements; it should also be synchronized perfectly with the audio. In Rapport, this is achieved using Speech Graphics’ animation technology. The second ingredient is natural interactions. So what does natural mean here? Well, there are two different scenarios. In the use case of a conversation between a user and an AI, the responses of the AI should be perceived as natural in terms of the dialogue and in terms of facial expression. In a different scenario, the avatars are driven by two or more users in conversation. In this case, an appropriate facial expression matching the tone of the speaker should be displayed. In both cases, the key technology to achieve that is the detection of the speaker's emotional state. For instance, the AI could detect that the user is confused and branch off to another piece of dialogue that provides more help.
How does detecting the emotional state of the speaker work? This task is referred to as “Speech Emotion Recognition” (SER) in scientific literature, and has been studied for several decades. As with many speech-related tasks, the only viable approach to date is Machine Learning (ML), where the model learns how to solve the task directly from data. Early systems were based on simple ML models like Naive Bayes or KNN using heavily engineered features . The term feature here refers to a set of transformations applied to the speech signal in order to obtain a better representation for the ML model to learn. Determining the set of transformations was a manual process, often guided by trial and error. This approach led to feature sets like GeMAPS , which are still successfully used today. Essentially, it can be summarized as: simple model, complicated features.
In parallel to those efforts, another approach has recently gained traction, deep learning, with a reversed paradigm: simple features, complicated model. Specifically, this approach uses neural networks able to learn the features jointly with the model, which in turn feeds the model with raw data (i.e. with minimal or no pre-processing). The advantage here is we avoid the manual “trial-and-error” step of selecting features, which can be sub-optimal. Instead, we let the model itself select the best features to use for the task. This new approach has led to breakthroughs in perception-related tasks such as computer vision, natural language processing, and speech recognition. In speech emotion recognition, this approach has quickly become the state-of-the-art.
What does state-of-the-art mean then? Is the task solved? Far from it. While some speech applications are close to reaching human performance, like speech recognition (see  for instance), there is still a long way to go for speech emotion recognition. In fact, it has its own unique challenges distinct from other speech processing tasks. For instance the concept of emotion is hard to define, especially when we need to come up with a finite label set. How many emotions exist? How many do we want to detect? These are still open questions. Most studies use either a small set of categories (3-7 emotions) or continuous values such as arousal and valence. Another challenge is the generalization capabilities of the model: how well will the model perform on data very different from the data used to train it?
In the literature, the study of generalization capabilities in SER is not very popular. Most of the recently published scientific papers are using one database: IEMOCAP . It’s a great corpus, built using actors acting out written scripts or improving on a given theme. While this approach is clearly beneficial for benchmarking, it has a huge drawback: the trained model could be state-of-the-art only on this corpus, but could end up being very poor on any other corpora or any other kind of data. Why? Because neural networks are very good at overfitting to the training data. Overfitting is a very well known phenomena in machine learning, where the model learns the training set by heart, leading to poor performance on the test set. A lesser known phenomena is something called corpus overfitting. So what is it? It can happen when the model is trained on only one corpus: the network focuses on patterns that are corpus-specific instead of general patterns. For example, when a network is trained for emotion recognition, we want it to detect and use “universal” patterns for recognizing emotion like laughter, sobbing, grunting or a tense voice. The risk of training the model on one single corpus is the model will focus on corpus-dependent patterns, such as the recording conditions, the speakers, etc. This will ultimately lead to potential terrible performance when deployed in production on other corpora, i.e. in the wild.
So what can we do to prevent that? Several approaches have been proposed to improve generalization capabilities of neural networks, such as multi-task learning, data augmentation, and transfer learning. One of the most promising approaches is cross-corpus training. It consists of aggregating as much corpora as possible together to form a new corpus. Hence this new corpus contains a lot of variations such as various speakers, accents, recording conditions, types of speech, languages, etc. In theory, training a model on this should hopefully force it to learn patterns independent from these variations, which will improve generalization.
So does it work? We ran a study on this exact problem last year (published at Interspeech 2019 ) and here are highlights of our key findings:
A model trained on a single corpus (IEMOCAP) is terrible on other corpora, often reaching the worst accuracy possible. In our case, we used three different emotions, so 33% is the worst it can get.
Cross-corpus training improves performance on in-domain and out-of-domain corpora. In the figure below, the TESS corpus was set aside before training, it is considered hence as “out-of-domain”, where all the other corpora were seen during training, hence “in-domain”.
The neural network architecture matters: convolutional layers are pivotal for generalization. To get to this conclusion, we proposed in the paper a new approach to evaluate the generalization capability, based on the following idea. In the model, the learned representations near the end of the network, i.e. the output of the last hidden layers, should live in a space where the different emotions are clustered, i.e. where two samples with the same emotion label should be close to each other and further from samples with different emotion labels. To verify that, we used the t-SNE  method, which is a method to visualize embeddings in 2D. Below are the t-SNE plots for the LSTM model, colored in two different ways: colored by corpus (right) and by emotion (left).
We can easily see there is no clustering based on emotion labels (on the right) as we expected, but clustering based on corpora (on the left). It clearly indicates that despite the acceptable performance, the model is using corpus information to classify emotion. This is clearly a problem: the model is not trying to understand emotion, it’s simply trying to detect corpora. So what happens when we add convolutional layers to the network? It looks much better:
Here the clustering is not on the corpora (left). It’s not as good as we hoped in terms of emotion clustering (on the right), but given the performance, this is not surprising. If you are interested in this study, the full paper can be found here.
To conclude, speech emotion recognition is not a solved problem. But at Rapport, we are working hard to make sure we build the best possible system that can generalize to any kind of input.
 Dellaert, Frank, Thomas Polzin, and Alex Waibel. "Recognizing emotion in speech." Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96. Vol. 3. IEEE, 1996.
 Eyben, Florian, et al. "The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing." IEEE transactions on affective computing 7.2 (2015): 190-202.
 Saon, George, et al. "English Conversational Telephone Speech Recognition by Humans and Machines." Proc. Interspeech 2017 (2017): 132-136.
 Busso, Carlos, et al. "IEMOCAP: Interactive emotional dyadic motion capture database." Language resources and evaluation 42.4 (2008): 335.
 Parry, Jack, et al. "Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition." Proc. Interspeech 2019 (2019): 1656-1660.
 Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of machine learning research 9.Nov (2008): 2579-2605.