
November 17th, 2020
Emotional State Detection: Deep Learning to the Rescue?

by Dimitri Palaz, Director of Machine Learning
Rapport is making human-machine communication more human, but how exactly? As our CEO Gregor
Hofer presented recently, using an animated avatar creates online social presence, which
provides a more engaging way for the user to interact online with an AI chatbot or other users
(see the excellent summary by Ashley Whitlatch
here). But how do you achieve this in practice? Let’s peek under the hood.
To create social presence, the first ingredient is facial animation. It should be accurate
with plausible mouth and face movements; it should also be synchronized perfectly with the
audio. In Rapport, this is achieved using Speech Graphics’ animation technology. The second
ingredient is natural interactions. So what does natural mean here? Well, there are
two different scenarios. In the use case of a conversation between a user and an AI, the
responses of the AI should be perceived as natural in terms of the dialogue and in terms of
facial expression. In a different scenario, the avatars are driven by two or more users in
conversation. In this case, an appropriate facial expression matching the tone of the speaker
should be displayed. In both cases, the key technology to achieve that is the detection of the
speaker's emotional state. For instance, the AI could detect that the user is confused and
branch off to another piece of dialogue that provides more help.
How does detecting the emotional state of the speaker work? This task is referred to as
“Speech Emotion Recognition” (SER) in scientific literature, and has been studied for several
decades. As with many speech-related tasks, the only viable approach to date is Machine
Learning (ML), where the model learns how to solve the task directly from data. Early
systems were based on simple ML models like Naive Bayes or KNN using heavily engineered
features [1]. The term feature here refers to a set of transformations applied to the
speech signal in order to obtain a better representation for the ML model to learn.
Determining the set of transformations was a manual process, often guided by trial and error.
This approach led to feature sets like GeMAPS [2], which are still successfully used today.
Essentially, it can be summarized as: simple model, complicated features.
In parallel to those efforts, another approach has recently gained traction, deep learning,
with a reversed paradigm: simple features, complicated model. Specifically, this approach uses
neural networks able to learn the features jointly with the model, which in turn
feeds the model with raw data (i.e. with minimal or no pre-processing). The advantage
here is we avoid the manual “trial-and-error” step of selecting features, which can be
sub-optimal. Instead, we let the model itself select the best features to use for the task.
This new approach has led to breakthroughs in perception-related tasks such as computer
vision, natural language processing, and speech recognition. In speech emotion recognition,
this approach has quickly become the state-of-the-art.
What does state-of-the-art mean then? Is the task solved? Far from it. While some speech
applications are close to reaching human performance, like speech recognition (see [3] for
instance), there is still a long way to go for speech emotion recognition. In fact, it has its
own unique challenges distinct from other speech processing tasks. For instance the concept of
emotion is hard to define, especially when we need to come up with a finite label set. How
many emotions exist? How many do we want to detect? These are still open questions. Most
studies use either a small set of categories (3-7 emotions) or continuous values such as
arousal and valence. Another challenge is the generalization capabilities of the model: how
well will the model perform on data very different from the data used to train it?
In the literature, the study of generalization capabilities in SER is not very popular. Most
of the recently published scientific papers are using one database: IEMOCAP [4]. It’s a great
corpus, built using actors acting out written scripts or improving on a given theme. While
this approach is clearly beneficial for benchmarking, it has a huge drawback: the trained
model could be state-of-the-art only on this corpus, but could end up being very poor on any
other corpora or any other kind of data. Why? Because neural networks are very good at
overfitting to the training data. Overfitting is a very well known phenomena in machine
learning, where the model learns the training set by heart, leading to poor performance on the
test set. A lesser known phenomena is something called corpus overfitting. So what is
it? It can happen when the model is trained on only one corpus: the network focuses on
patterns that are corpus-specific instead of general patterns. For example, when a network is
trained for emotion recognition, we want it to detect and use “universal” patterns for
recognizing emotion like laughter, sobbing, grunting or a tense voice. The risk of training
the model on one single corpus is the model will focus on corpus-dependent patterns, such as
the recording conditions, the speakers, etc. This will ultimately lead to potential terrible
performance when deployed in production on other corpora, i.e. in the wild.
So what can we do to prevent that? Several approaches have been proposed to improve
generalization capabilities of neural networks, such as multi-task learning, data
augmentation, and transfer learning. One of the most promising approaches is cross-corpus
training. It consists of aggregating as much corpora as possible together to form a new
corpus. Hence this new corpus contains a lot of variations such as various speakers, accents,
recording conditions, types of speech, languages, etc. In theory, training a model on this
should hopefully force it to learn patterns independent from these variations, which will
improve generalization.
So does it work? We ran a study on this exact problem last year (published at Interspeech 2019
[5]) and here are highlights of our key findings:
A model trained on a single corpus (IEMOCAP) is terrible on other corpora, often reaching the worst accuracy possible. In our case, we used three different emotions, so 33% is the worst it can get.
A model trained on a single corpus (IEMOCAP) is terrible on other corpora, often reaching the worst accuracy possible. In our case, we used three different emotions, so 33% is the worst it can get.

Cross-corpus training improves performance on in-domain and out-of-domain corpora.
In the figure below, the TESS corpus was set aside before training, it is
considered hence as “out-of-domain”, where all the other corpora were seen during training,
hence “in-domain”.

The neural network architecture matters: convolutional layers are pivotal for
generalization.
To get to this conclusion, we proposed in the paper a new approach
to evaluate the generalization capability, based on the following idea. In the model, the
learned representations near the end of the network, i.e. the output of the last hidden
layers, should live in a space where the different emotions are clustered, i.e. where two
samples with the same emotion label should be close to each other and further from samples
with different emotion labels. To verify that, we used the t-SNE [6] method, which is a method
to visualize embeddings in 2D. Below are the t-SNE plots for the LSTM model, colored in two
different ways: colored by corpus (right) and by emotion (left).

We can easily see there is no clustering based on emotion labels (on the right) as we
expected, but clustering based on corpora (on the left). It clearly indicates that despite the
acceptable performance, the model is using corpus information to classify emotion. This is
clearly a problem: the model is not trying to understand emotion, it’s simply trying to detect
corpora. So what happens when we add convolutional layers to the network? It looks much
better:

Here the clustering is not on the corpora (left). It’s not as good as we hoped in terms of
emotion clustering (on the right), but given the performance, this is not surprising. If you
are interested in this study, the full paper can be found
here.
To conclude, speech emotion recognition is not a solved problem. But at Rapport, we are
working hard to make sure we build the best possible system that can generalize to any kind of
input.
References
[1] Dellaert, Frank, Thomas Polzin, and Alex Waibel. "Recognizing emotion in speech." Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96. Vol. 3. IEEE, 1996.
[2] Eyben, Florian, et al. "The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing." IEEE transactions on affective computing 7.2 (2015): 190-202.
[3] Saon, George, et al. "English Conversational Telephone Speech Recognition by Humans and Machines." Proc. Interspeech 2017 (2017): 132-136.
[4] Busso, Carlos, et al. "IEMOCAP: Interactive emotional dyadic motion capture database." Language resources and evaluation 42.4 (2008): 335.
[5] Parry, Jack, et al. "Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition." Proc. Interspeech 2019 (2019): 1656-1660.
[6] Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of machine learning research 9.Nov (2008): 2579-2605.
[1] Dellaert, Frank, Thomas Polzin, and Alex Waibel. "Recognizing emotion in speech." Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP'96. Vol. 3. IEEE, 1996.
[2] Eyben, Florian, et al. "The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing." IEEE transactions on affective computing 7.2 (2015): 190-202.
[3] Saon, George, et al. "English Conversational Telephone Speech Recognition by Humans and Machines." Proc. Interspeech 2017 (2017): 132-136.
[4] Busso, Carlos, et al. "IEMOCAP: Interactive emotional dyadic motion capture database." Language resources and evaluation 42.4 (2008): 335.
[5] Parry, Jack, et al. "Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition." Proc. Interspeech 2019 (2019): 1656-1660.
[6] Maaten, Laurens van der, and Geoffrey Hinton. "Visualizing data using t-SNE." Journal of machine learning research 9.Nov (2008): 2579-2605.
Join the waitlist and get early access to our platform.
We’re giving early access to the kind of people who like pushing the envelope.
People like us, basically. Sign up to be one of the first to build the
future with our technology.