Synthesia’s AI clones are more expressive than ever. Soon they’ll be able to talk back.

Earlier this summer, I walked through the glassy lobby of a fancy office in London, into an elevator, and then along a corridor into a clean, carpeted room. Natural light flooded in through its windows, and a large pair of umbrella-like lighting rigs made the room even brighter. I tried not to squint as I took my place in front of a tripod equipped with a large camera and a laptop displaying an autocue. I took a deep breath and started to read out the script.

I’m not a newsreader or an actor auditioning for a movie—I was visiting the AI company Synthesia to give it what it needed to create a hyperrealistic AI-generated avatar of me. The company’s avatars are a decent barometer of just how dizzying progress has been in AI over the past few years, so I was curious just how accurately its latest AI model, introduced last month, could replicate me.

When Synthesia launched in 2017, its primary purpose was to match AI versions of real human faces—for example, the former footballer David Beckham—with dubbed voices speaking in different languages. A few years later, in 2020, it started giving the companies that signed up for its services the opportunity to make professional-level presentation videos starring either AI versions of staff members or consenting actors. But the technology wasn’t perfect. The avatars’ body movements could be jerky and unnatural, their accents sometimes slipped, and the emotions indicated by their voices didn’t always match their facial expressions.

Now Synthesia’s avatars have been updated with more natural mannerisms and movements, as well as expressive voices that better preserve the speaker’s accent—making them appear more humanlike than ever before. For Synthesia’s corporate clients, these avatars will make for slicker presenters of financial results, internal communications, or staff training videos.

I found the video demonstrating my avatar as unnerving as it is technically impressive. It’s slick enough to pass as a high-definition recording of a chirpy corporate speech, and if you didn’t know me, you’d probably think that’s exactly what it was. This demonstration shows how much harder it’s becoming to distinguish the artificial from the real. And before long, these avatars will even be able to talk back to us. But how much better can they get? And what might interacting with AI clones do to us?

The creation process

When my former colleague Melissa visited Synthesia’s London studio to create an avatar of herself last year, she had to go through a long process of calibrating the system, reading out a script in different emotional states, and mouthing the sounds needed to help her avatar form vowels and consonants. As I stand in the brightly lit room 15 months later, I’m relieved to hear that the creation process has been significantly streamlined. Josh Baker-Mendoza, Synthesia’s technical supervisor, encourages me to gesture and move my hands as I would during natural conversation, while simultaneously warning me not to move too much. I duly repeat an overly glowing script that’s designed to encourage me to speak emotively and enthusiastically. The result is a bit as if if Steve Jobs had been resurrected as a blond British woman with a low, monotonous voice.

It also has the unfortunate effect of making me sound like an employee of Synthesia.“I am so thrilled to be with you today to show off what we’ve been working on. We are on the edge of innovation, and the possibilities are endless,” I parrot eagerly, trying to sound lively rather than manic. “So get ready to be part of something that will make you go, ‘Wow!’ This opportunity isn’t just big—it’s monumental.”

Just an hour later, the team has all the footage it needs. A couple of weeks later I receive two avatars of myself: one powered by the previous Express-1 model and the other made with the latest Express-2 technology. The latter, Synthesia claims, makes its synthetic humans more lifelike and true to the people they’re modeled on, complete with more expressive hand gestures, facial movements, and speech. You can see the results for yourself below.

COURTESY SYNTHESIA

Last year, Melissa found that her Express-1-powered avatar failed to match her transatlantic accent. Its range of emotions was also limited—when she asked her avatar to read a script angrily, it sounded more whiny than furious. In the months since, Synthesia has improved Express-1, but the version of my avatar made with the same technology blinks furiously and still struggles to synchronize body movements with speech.

By way of contrast, I’m struck by just how much my new Express-2 avatar looks like me: Its facial features mirror my own perfectly. Its voice is spookily accurate too, and although it gesticulates more than I do, its hand movements generally marry up with what I’m saying.

But the tiny telltale signs of AI generation are still there if you know where to look. The palms of my hands are bright pink and as smooth as putty. Strands of hair hang stiffly around my shoulders instead of moving with me. Its eyes stare glassily ahead, rarely blinking. And although the voice is unmistakably mine, there’s something slightly off about my digital clone’s intonations and speech patterns. “This is great!” my avatar randomly declares, before slipping back into a saner register.

Anna Eiserbeck, a postdoctoral psychology researcher at the Humboldt University of Berlin who has studied how humans react to perceived deepfake faces, says she isn’t sure she’d have been able to identify my avatar as a deepfake at first glance.

But she would eventually have noticed something amiss. It’s not just the small details that give it away—my oddly static earring, the way my body sometimes moves in small, abrupt jerks. It’s something that runs much deeper, she explains.

“Something seemed a bit empty. I know there’s no actual emotion behind it— it’s not a conscious being. It does not feel anything,” she says. Watching the video gave her “this kind of uncanny feeling.”

My digital clone, and Eiserbeck’s reaction to it, make me wonder how realistic these avatars really need to be.

I realize that part of the reason I feel disconcerted by my avatar is that it behaves in a way I rarely have to. Its oddly upbeat register is completely at odds with how I normally speak; I’m a die-hard cynical Brit who finds it difficult to inject enthusiasm into my voice even when I’m genuinely thrilled or excited. It’s just the way I am. Plus, watching the videos on a loop makes me question if I really do wave my hands about that way, or move my mouth in such a weird manner. If you thought being confronted with your own face on a Zoom call was humbling, wait until you’re staring at a whole avatar of yourself.

When Facebook was first taking off in the UK almost 20 years ago, my friends and I thought illicitly logging into each other’s accounts and posting the most outrageous or rage-inducing status updates imaginable was the height of comedy. I wonder if the equivalent will soon be getting someone else’s avatar to say something truly embarrassing: expressing support for a disgraced politician or (in my case) admitting to liking Ed Sheeran’s music.

Express-2 remodels every person it’s presented with into a polished professional speaker with the body language of a hyperactive hype man. And while this makes perfect sense for a company focused on making glossy business videos, watching my avatar doesn’t feel like watching me at all. It feels like something else entirely.

How it works

The real technical challenge these days has less to do with creating avatars that match our appearance than with getting them to replicate our behavior, says Björn Schuller, a professor of artificial intelligence at Imperial College London. “There’s a lot to consider to get right; you have to have the right micro gesture, the right intonation, the sound of voice and the right word,” he says. “I don’t want an AI [avatar] to frown at the wrong moment—that could send an entirely different message.”

To achieve an improved level of realism, Synthesia developed a number of new audio and video AI models. The team created a voice cloning model to preserve the human speaker’s accent, intonation, and expressiveness—unlike other voice models, which can flatten speakers’ distinctive accents into generically American-sounding voices.

When a user uploads a script to Express-1, its system analyzes the words to infer the correct tone to use. That information is then fed into a diffusion model, which renders the avatar’s facial expressions and movements to match the speech.

Alongside the voice model, Express-2 uses three other models to create and animate the avatars. The first generates an avatar’s gestures to accompany the speech fed into it by the Express-Voice model. A second evaluates how closely the input audio aligns with the multiple versions of the corresponding generated motion before selecting the best one. Then a final model renders the avatar with that chosen motion.

This third rendering model is significantly more powerful than its Express-1 predecessor. Whereas the previous model had a few hundred million parameters, Express-2’s rendering model’s parameters number in the billions. This means it takes less time to create the avatar, says Youssef Alami Mejjati, Synthesia’s head of research and development:

“With Express-1, it needed to first see someone expressing emotions to be able to render them. Now, because we’ve trained it on much more diverse data and much larger data sets, with much more compute, it just learns these associations automatically without needing to see them.”

Narrowing the uncanny valley

Although humanlike AI-generated avatars have been around for years, the recent boom in generative AI is making it increasingly easier and more affordable to create lifelike synthetic humans—and they’re already being put to work. Synthesia isn’t alone: AI avatar companies like Yuzu Labs, Creatify, Arcdads, and Vidyard give businesses the tools to quickly generate and edit videos starring either AI actors or artificial versions of members of staff, promising cost-effective ways to make compelling ads that audiences connect with. Similarly, AI-generated clones of livestreamers have exploded in popularity across China in recent years, partly because they can sell products 24/7 without getting tired or needing to be paid.

For now at least, Synthesia is “laser focused” on the corporate sphere. But it’s not ruling out expanding into new sectors such as entertainment or education, says Peter Hill, the company’s chief technical officer. In an apparent step toward this, Synthesia recently partnered with Google to integrate Google’s powerful new generative video model Veo 3 into its platform, allowing users to directly generate and embed clips into Synthesia’s videos. It suggests that in the future, these hyperrealistic artificial humans could take up starring roles in detailed universes with ever-changeable backdrops.

At present this could, for example, involve using Veo 3 to generate a video of meat-processing machinery, with a Synthesia avatar next to the machines talking about how to use them safely. But future versions of Synthesia’s technology could result in educational videos customizable to an individual’s level of knowledge, says Alex Voica, head of corporate affairs and policy at Synthesia. For example, a video about the evolution of life on Earth could be tweaked for someone with a biology degree or someone with high-school-level knowledge. “It’s going to be such a much more engaging and personalized way of delivering content that I’m really excited about,” he says.

The next frontier, according to Synthesia, will be avatars that can talk back, “understanding” conversations with users and responding in real time Think ChatGPT, but with a lifelike digital human attached.

Synthesia has already added an interactive element by letting users click through on-screen questions during quizzes presented by its avatars. But it’s also exploring making them truly interactive: Future users could ask their avatar to pause and expand on a point, or ask it a question. “We really want to make the best learning experience, and that means through video that’s entertaining but also personalized and interactive,” says Alami Mejjati. “This, for me, is the missing part in online learning experiences today. And I know we’re very close to solving that.”

We already know that humans can—and do—form deep emotional bonds with AI systems, even with basic text-based chatbots. Combining agentic technology—which is already capable of navigating the web, coding, and playing video games unsupervised—with a realistic human face could usher in a whole new kind of AI addiction, says Pat Pataranutaporn, an assistant professor at the MIT Media Lab.

“If you make the system too realistic, people might start forming certain kinds of relationships with these characters,” he says. “We’ve seen many cases where AI companions have influenced dangerous behavior even when they are basically texting. If an avatar had a talking head, it would be even more addictive.”

Schuller agrees that avatars in the near future will be perfectly optimized to adjust their projected levels of emotion and charisma so that their human audiences will stay engaged for as long as possible. “It will be very hard [for humans] to compete with charismatic AI of the future; it’s always present, always has an ear for you, and is always understanding,” he says. “Al will change that human-to-human connection.”

As I pause and replay my Express-2 avatar, I imagine holding conversations with it—this uncanny, permanently upbeat, perpetually available product of pixels and algorithms that looks like me and sounds like me, but fundamentally isn’t me. Virtual Rhiannon has never laughed until she’s cried, or fallen in love, or run a marathon, or watched the sun set in another country.

But, I concede, she could deliver a damned good presentation about why Ed Sheeran is the greatest musician ever to come out of the UK. And only my closest friends and family would know that it’s not the real me.

from MIT Technology Review https://ift.tt/nakLMg4
via IFTTT