Facebook's AI clone of Bill Gates sounds like the real deal

0 0

By Matthew Griffin Intelligence and the Senses 30th August 2019

WHY THIS MATTERS IN BRIEF

As companies get better at creating AI’s that can synthesise people’s voices it created new opportunities, and new problems.

Interested in the Exponential Future? Connect, download a free E-Book, watch a keynote, or browse my blog.

As we see more and more Synthetic Content created, including DeepFakes and DeepFake videos that are used to create fake news, and amusing memes, aswell as more advanced Artificial Intelligence (AI) based systems that can create video from nothing more than plain text, and then edit existing content, again, using nothing more than just plain text instructions, another crucial part of the jigsaw puzzle is falling into place – the ability to create realistic Synthetic Audio to go along with it.

An AI hiring company says it can predict job hopping based on your interview

A little while ago I talked about Lyrebird, an AI company in the US that can make realistic clones of people’s voices by just listening to a minute of audio. But it still sounded synthetic. And then came Google DeepMind’s Wavenet and Duplex projects whose synthetic voices became the first to fool people into thinking they were real when they used the technology to make phone calls and book appointments on stage. And now Bill Gates has gotten similar treatment as you can hear below for yourself… and it’s good.

As impressive as the Google and Lyrebird’s original work was though neither of these advances could clone the real voice of a real person, inflections and all, with a level of accuracy that could fool people into thinking that it was the real person talking. But now that’s changed after Facebook engineers managed to create an AI that clone’s Microsoft founder Bill Gates voice with uncanny accuracy.

In the clips embedded above, you can listen to what seems to be Gates reeling off a series of innocuous phrases. “A cramp is no small danger on a swim,” he cautions. “Write a fond note to the friend you cherish,” he advises. But each voice clip has been generated by a machine learning system named MelNet, designed and created by engineers at Facebook.

Researchers want to revolutionise artificial intelligence by teaching it common sense

In fact, Gates is just the best known of the handful of individuals MelNet can mimic. Others, including, George Takei, Jane Goodall, and Stephen Hawking, can be heard here under the heading “Selected Speakers.”

I know, uncanny, right? Now wrap this tech into DeepFakes and synthetic video’s and all hell gets unleashed – memes and all.

Now you may be wondering why the researchers chose to replicate such a sciency bunch of speakers. Well, the simple answer is that one of the resources used to train MelNet was a 452-hour dataset of TED talks. The rest of the training data came from audiobooks, chosen because the “highly animated manner” of the speakers make for a challenging target.

Now, these audio samples are undeniably impressive, but MelNet isn’t exactly a bolt from the blue. The quality of voice clones have been steadily improving in recent years, with a recent replica of podcaster Joe Rogan demonstrating exactly how far we’ve come. Much of this progress dates back to 2016 with the aforementioned unveiling of DeepMind’s WaveNet which now powers the Google Assistant.

Botto the decentralised AI artist bags its first million dollars

The basic approach with WaveNet, and similar programs, such as SampleRNN is to feed the AI system a ton of data and use that to analyse the nuances in a human voice. But while WaveNet and others were trained using audio waveforms, Facebook’s MelNet uses a richer and more informationally dense format to learn to speak – the “Spectrogram.”

In an accompanying paper, Facebook’s researchers note that while WaveNet produces higher fidelity audio output, MelNet is superior at capturing “high-level structure” – the subtle consistencies contained in a speaker’s voice that are, ironically, almost impossible to describe in words, but to which the human ear is finely attuned.

They say that this is because the data captured in a spectrogram is “orders of magnitude more compact” than that found in audio waveforms. This density allows the algorithms to produce more consistent voices, rather than being distracted by and honing in on the extreme detail of a waveform recording.

US proposes to regulate AI the same way it regulates weapons exports

There are limitations, though. The most important being that the model can’t replicate how a human voice will change over longer periods of time like WaveNet can, such as building up drama or tension over a paragraph or page of text, for example. Interestingly, this is similar to the constraints we’ve seen in AI text generation from companies like OpenAI, which captures so called “surface level coherency” and not long term structure.

These caveats aside, the results are astoundingly good. And, more impressively, MelNet is a multi-function system, it doesn’t just generate realistic voices, it can also be used to generate music – something that Google are also working on with Project Magenta who’s getting increasing good at creating good music.

As ever, there are benefits and dangers with this technology. The benefits? Higher quality AI assistants, such as Alexa and Google Home, more realistic voice models for people with speech impairments and conditions such as Locked In Syndrome or ALS who are now starting to use neural interfaces combined with these systems to allow them to talk, and then obviously there’s a range of uses in the entertainment and gaming industries as we begin to see the rise of automatically generated procedural content and games.

ChatGPT gains the power to see, hear, and speak

And there are dangers too such as the crumbling trust in traditional forms of evidence, and the potential for audio harassment, scams, and generalized slander. All the fun of the AI fake fair basically. Just pair it with this recent research that lets you edit what someone says in a video just by typing in new speech, as I mentioned above, and the possibilities are endless.

Matthew Griffin / About Author

Matthew Griffin, multi-award winning Futurist and named Futurist of the Year 2024, has been described as a "Walking encyclopaedia of the future" by NASA and a futurist polymath. One of the world's most renowned futurists and strategic foresight experts Matthew is the 15 times author of the blockbuster "Codex of the Future" series, and is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working across the next 50 years, XPotential University, the world's first free futures and foresight university, and the World Futures Forum which works with the United Nations to solve the worlds greatest challenges. Matthew is an in demand international keynote, acclaimed university lecturer and mentor, and host of the hit Fanatical Futurist podcast.

A rare talent in his past Matthew helped build and run several multi-billion dollar business units for Atos, Dell-EMC, and IBM, and his ability to identify, track, and explain the impacts of hundreds of emerging technologies and trends on global business, culture, and society has earned him a powerful reputation and a roster of clients that include royal households, world leaders, G7, G20, and G77+ governments, and many of the world's most respected brands including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi Group, Coca Cola, Dentons, Deloitte, Disney, Dow, EY, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, T-Mobile, UBS, VISA, and many others. He was also the only futurist invited to talk at the UN COP28 held in Dubai alongside world leaders.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.