Sign up now to get early access to Rapport. Get on the list
The Rapport Report
Understanding the Science behind Rapport
With advances in artificial intelligence, humans and machines are entering into a new relationship. We need to communicate with machines the way we communicate with each other. And that’s the goal of Rapport: to make these human-machine interactions more human, and less machine.
Humans have evolved over millions of years as social animals with a very peculiar way of communicating: through our faces.
Imagine how this might look from an alien's perspective. Two people visually lock onto each other's faces. A series of buzzing, popping, and hissing noises come out of their mouths. While hearing these noises, they watch the mouth movements making them, and notice any other movements in the face. Amazingly, this behavior transmits thoughts, intentions, emotions, and knowledge of the world from one mind to another. Face-to-face communication is the bedrock of human society. It is such a fundamental human behavior we aren't even aware we do it, let alone how special it is.
With Rapport, we’re all about replicating that behavior.
Face-to-face communication is so important, the internet is constantly reproducing it through the screen, to create a sense of human contact. Video chat is now ubiquitous. And companies growing reliant on AI to communicate with customers are creating digital humans to embody them.
The Rapport platform empowers these face-to-face conversations, using the Speech Graphics core to analyze voices and visualize both AIs and humans as avatars.
Speech Graphics is the gold-standard facial animation solution used by 90% of AAA video-game publishers. It provides automatic, high-fidelity facial animation. By harnessing the power of Speech Graphics technology to Rapport's bleeding-edge web streaming and cloud infrastructure, we have changed the game for human-machine communication. But before describing the technology further, we need to understand the problem it solves.
High-fidelity facial animation has been one of the most difficult problems in computer graphics for two main reasons. The first is facial behavior is extremely complex, and the second is we are all very sensitive to that complex behavior.
First, there's surface complexity, with dozens of muscles whose main function is simply to deform the skin in various patterns. And then there's the complexity of the control of those muscles, dictating the timing and magnitude of muscle activations over time. In particular, speech is one of the most complicated motor activities you do as a human being. It's fast, efficient, and precise. The movements of the lips, jaw, and tongue in speech are like the body of an athlete: muscle groups coordinating with each other to achieve goals, future tasks efficiently overlapping with current ones, all compressed in time, allowing us to produce around 10-15 phonetic sounds per second with minimal energy and maximum clarity.
All of this makes it challenging to simulate. We could do a rough approximation, but there's another problem: humans are innately sensitive to faces. We automatically focus on faces, and are attuned to their most minute movements. This means we have very little tolerance for unnatural facial movement.
This focus has a basis in the communicative function of the face. The malleability and expressiveness of the human face provides a lot of information, and we're hard-wired to read it. This includes not only emotional information, but speech itself. Whether hearing-impaired or not, we all lip-read. This is most obvious in noisy places like a crowded bar, where you need to see someone's lips in order to hear them.
Since we're experts at reading faces, we really notice when it trips up. The better the graphics, the higher the expectations are for faces. You could have the most photorealistic, beautiful, detailed 3D face in the world, but if it moves in unrealistic or unexpected ways -- too fast, too slow, too far, with the wrong timing, the wrong pattern -- you've failed. The emotional connection to the avatar is broken, and instead there’s revulsion. Movement is the essential ingredient to bridging the uncanny valley, because movement is life.
Not only does the movement have to be natural, but in the case of speech, the movement has to be perfectly aligned with an audio channel. We have to create the illusion that the face you see is the source of the sound you hear.
Now that we understand the problem of automatic high-fidelity facial animation, what is the technology?
There are two main approaches. One is motion capture, also called performance capture. This involves using video or another optical technique to track the movement of facial features on a live person's face. That movement is reapplied to the corresponding features on a digital face, thus imitating the real-life movement. This works especially well when the digital face has the exact same shape as the actor's face. There are some downsides though. First of all, such input -- video from a live person -- is not always available. If you have a conversational AI, for example, there's no person to drive the face, just software. And even when you do have a person to act as a source, the analyzed motion can be noisy, and there are problems of occlusion due to the optical tracking, especially for the lips and tongue during speech.
Another more subtle problem with motion capture is the style of the animation and its impact. Because there is nothing driving the animation except this other face, the resulting movement on the avatar can look puppet-like and feel like it's coming from outside the avatar rather than from inside it, which defeats the illusion of life we're trying to achieve.
The second option does not require a motion source. Instead, it takes other input to drive the system, which can be spoken audio or text, and from that generates motion based on an internal model. Since a motion source is not required, such systems can be used for a conversational AI. However without input motion, these systems also require a more complex generative model that synthesizes motion.
Among these generative approaches, most can be described as purely data-driven. Meaning, you record a lot of data from a live actor's performance on two channels, usually voice and facial motion capture, and use machine learning techniques to discover correlations between them, so as to be able to predict, given some audio input, the corresponding motion output. Since the purpose of such a system is to approximate motion capture given audio, it has the same sort of issues as motion capture mentioned previously, including noise, occlusion, and the puppet-like movement effect.
What happens when you combine data and scientific models?
Speech Graphics has developed a different generative approach, based on data, but also on scientific models of human motor behavior and speech. For more than 20 years we’ve researched and created models in biomechanics, motor planning, speech production, emotional expression, and other areas.
Our animation engine is grounded in physical simulations. And at its core is a unique model of speech production and general task-oriented behavior based on muscle dynamics. By generating motion within this physical system, our output has a consistent quality that looks natural, alive, and ingrained. And because it operates at the level of our common human musculature, it works for any language. Yes, any language.
It also works for any facial rig. Using our character setup method we can make any face model conform to our muscle system and be driven by our technology. The mapping of movement to the character is a critical step. We could generate the most realistic dynamics in the world, but if they aren't visualized well on the character, with appropriate looking deformations, it all falls apart. Unlike motion capture, though, our approach does not force a particular movement pattern onto the character. Since our system is generative there is more room for internal control and interpretation. Through our proprietary character setup process we tailor the movement patterns of the character to suit its own idiosyncrasies. This helps the animation look like it's the character's own movement, not puppetry.
Our animation targeting has to work across a wide range of character styles, from photorealistic to cartoon-like. We are agnostic about these details, and emphasize the importance of movement over static picture realism. We find that natural movement invokes empathy. Unlike deep fakes, the goal of Speech Graphics is not to fool someone into thinking they're looking at reality. The goal is to make them feel like they're looking at something that is alive.
Our animation engine is driven by audio input. Our specialized algorithms discover acoustic events in the voice, and trace those back to appropriate kinds of movements in the mouth and face. To support real-time conversations, this happens very quickly. There is a processing latency of only 50 milliseconds between input audio and output animation, which is faster than the blink of an eye. And this goes beyond lip sync. Speech contains energy and emotion, and these too can be decoded from the voice and synchronized in the face. Using all available acoustic information, our algorithms drive not just the mouth, but the entire head, including emotional expressions, blinks, eye movements, breaths, and movements of the head and torso, which are correlated with speech.
This is a kind of inverse inference: given some speech sound, what was the body doing when it produced that sound? Over the years we've developed a stack of algorithms solving different aspects of this problem. We use data and deep learning to inform our algorithms with the amazing level of detail deep neural networks can provide. Our database consists of thousands of hours of annotated speech in many languages, exemplifying many styles of vocal behavior, from whispering to shouting, from laughter to crying, from breathing to singing, all carefully curated by our specialist data engineers.
Our speech analysis algorithms produce various kinds of metadata. This metadata is consumed by our generative models to drive muscle activations, and thus facial animation. But it also has value in itself, because it's a rich source of information about a speaker's physical and emotional state over time. So we use this metadata as a perceptual faculty that enables an artificial intelligence to sense more than just the words in a user's voice. It actually provides emotional intelligence, which is missing in AI.
A holistic approach to hard problems.
We developed these layers of technology through interdisciplinary work. I am proud of our pursuit of hard problems without regard for boundaries between fields. In the growing avatar industry, most people approach the problem of facial animation from one area of expertise. Machine learning specialists tend to look at this as a data problem and ignore the science. Speech technologists might have a good grasp of speech, but tend to show little interest in computer graphics or the importance of deploying movement on a character. In the graphics community, practitioners specialize in realistic 3D models, but tend to underestimate the challenge of facial behavior and speech.
At Speech Graphics, we recognize the equal importance of all of these aspects and have taken a holistic approach. Alongside our engineers are experts in linguistics, speech technology, data science, machine learning, biomechanics, psychology, and computer graphics.
Through Rapport, we can now share this technology with a much wider audience (including you!) and make it very easy to use and integrate into applications. Rapport provides the cloud infrastructure, network, and front end integration, to completely support this new natural form of communication with avatars. By humanizing machines, we increase our understanding of what it means to be human. Ready to try it out?