World first as Microsofts speech recognition software becomes as accurate as humans

By Matthew Griffin Intelligence and the Senses 21st October 2016

WHY THIS MATTERS IN BRIEF

After almost forty years of research speech recognition systems are now as good as humans.

Microsoft have announced that they have made a major breakthrough in speech recognition and created a technology that, finally, recognises the words in conversational speech as well as humans do – or at least, as good as professional human transcriptionists, which is better than most humans.

In the report the team from Microsofts Artificial Intelligence and Research Unit announced that their speech recognition system makes the same – and in some cases fewer – errors than professional transcriptionists and that the systems word error rate (WER) is now just 5.9 percent – down from the 6.3 percent WER the team reported just last month.

Artificial intelligence diagnoses disease by listening to your voice

The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the imaginatively titled industry standard “Switchboard Speech Recognition Task”.

“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist, “and this is an historic achievement.”

And it is. To put it into context the announcement means, that for the first time ever, a machine is as good at recognising the words being spoken in a fluid conversation as a human is and by achieving this latest milestone the team has beat a goal they set less than a year ago. The new announcement goes to show just how fast the company’s speech recognition technology, which is based on Microsoft’s Computational Network Toolkit (MCNT), a homegrown system for deep learning that the research team has since posted on GitHub via an open source license, is progressing.

Huang said CNTK’s ability to quickly process deep learning algorithms across multiple computers running a specialized chip called a Graphics Processing Unit (GPU) helped to vastly improve the speed at which they were able to do their research and, ultimately, reach parity.

“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, the executive vice president who heads the Microsoft Artificial Intelligence and Research group.

The research milestone comes after decades of research in speech recognition, beginning almost forty years ago in the early 1970s with DARPA, the US agency tasked with making technology breakthroughs in the interest of national security and over the decades, more and more technology companies and many research organizations have joined in the pursuit.

AI tutor trounces human experts at teaching students neurosurgery

“This accomplishment is the culmination of over twenty years of effort,” said Geoffrey Zweig, who manages Microsofts Speech & Dialog Research Group.

The announcment has broad implications for the consumer and business worlds who can now use the technology to augment their products and apps with state of the art speech recognition. That includes consumer entertainment devices like the Xbox, accessibility tools such as instant speech-to-text transcription and personal digital assistants such as Cortana.

“This will make Cortana more powerful, making a truly intelligent assistant possible,” Shum said, “and it’s a dream come true for me.”

Moving forward, Zweig said the researchers are working on ways to make sure that speech recognition works well in more real-life settings. That includes places where there is a lot of background noise, such as at a party or while driving on the highway. They’ll also focus on better ways to help the technology assign names to individual speakers when multiple people are talking, and on making sure that it works well with a wide variety of voices, regardless of age, accent or ability.

In the longer term, researchers will focus on ways to teach computers not just to transcribe the acoustic signals that come out of people’s mouths, but instead to understand the words they are saying. That would give the technology the ability to answer questions or take action based on what they are told.

“The next frontier is to move from recognition to understanding,” Zweig said.

Shum noted that we are moving away from a world where people must understand computers to a world in which computers must understand us, still, he cautioned, true artificial intelligence is still on the distant horizon.

“It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown,” he said.

Matthew Griffin / About Author

Matthew Griffin, multi-award winning Futurist and named Futurist of the Year 2024, has been described as a "Walking encyclopaedia of the future" by NASA and a futurist polymath. One of the world's most renowned futurists and strategic foresight experts Matthew is the 15 times author of the blockbuster "Codex of the Future" series, and is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working across the next 50 years, XPotential University, the world's first free futures and foresight university, and the World Futures Forum which works with the United Nations to solve the worlds greatest challenges. Matthew is an in demand international keynote, acclaimed university lecturer and mentor, and host of the hit Fanatical Futurist podcast.

A rare talent in his past Matthew helped build and run several multi-billion dollar business units for Atos, Dell-EMC, and IBM, and his ability to identify, track, and explain the impacts of hundreds of emerging technologies and trends on global business, culture, and society has earned him a powerful reputation and a roster of clients that include royal households, world leaders, G7, G20, and G77+ governments, and many of the world's most respected brands including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi Group, Coca Cola, Dentons, Deloitte, Disney, Dow, EY, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, T-Mobile, UBS, VISA, and many others. He was also the only futurist invited to talk at the UN COP28 held in Dubai alongside world leaders.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.

World first as Microsofts speech recognition software becomes as accurate as humans

WHY THIS MATTERS IN BRIEF

After almost forty years of research speech recognition systems are now as good as humans.

ORGANISING AN EVENT OR WORKSHOP?

STAY CONNECTED

FREE BOOKS AND STUFF

MY PLEDGE TO THE PLANET

NET ZERO .

ZERO HARM .

ZERO IMPACT .

ZERO WASTE .

EXPLORE MORE!

You have Successfully Subscribed!

Pin It on Pinterest

World first as Microsofts speech recognition software becomes as accurate as humans

WHY THIS MATTERS IN BRIEF

After almost forty years of research speech recognition systems are now as good as humans.

Related Posts

ORGANISING AN EVENT OR WORKSHOP?

STAY CONNECTED

FREE BOOKS AND STUFF

MY PLEDGE TO THE PLANET

NET ZERO .

ZERO HARM .

ZERO IMPACT .

ZERO WASTE .

EXPLORE MORE!

You have Successfully Subscribed!

Pin It on Pinterest