World first as Microsofts speech recognition software becomes as accurate as humans

By Matthew Griffin Intelligence and the Senses 21st October 2016

WHY THIS MATTERS IN BRIEF

After almost forty years of research speech recognition systems are now as good as humans.

Microsoft have announced that they have made a major breakthrough in speech recognition and created a technology that, finally, recognises the words in conversational speech as well as humans do – or at least, as good as professional human transcriptionists, which is better than most humans.

In the report the team from Microsofts Artificial Intelligence and Research Unit announced that their speech recognition system makes the same – and in some cases fewer – errors than professional transcriptionists and that the systems word error rate (WER) is now just 5.9 percent – down from the 6.3 percent WER the team reported just last month.

Scientists are using AI to create universal translators for chickens

The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the imaginatively titled industry standard “Switchboard Speech Recognition Task”.

“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist, “and this is an historic achievement.”

And it is. To put it into context the announcement means, that for the first time ever, a machine is as good at recognising the words being spoken in a fluid conversation as a human is and by achieving this latest milestone the team has beat a goal they set less than a year ago. The new announcement goes to show just how fast the company’s speech recognition technology, which is based on Microsoft’s Computational Network Toolkit (MCNT), a homegrown system for deep learning that the research team has since posted on GitHub via an open source license, is progressing.

Huang said CNTK’s ability to quickly process deep learning algorithms across multiple computers running a specialized chip called a Graphics Processing Unit (GPU) helped to vastly improve the speed at which they were able to do their research and, ultimately, reach parity.

“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, the executive vice president who heads the Microsoft Artificial Intelligence and Research group.

The research milestone comes after decades of research in speech recognition, beginning almost forty years ago in the early 1970s with DARPA, the US agency tasked with making technology breakthroughs in the interest of national security and over the decades, more and more technology companies and many research organizations have joined in the pursuit.

Designer cells with two genetic codes give life on Earth a serious upgrade

“This accomplishment is the culmination of over twenty years of effort,” said Geoffrey Zweig, who manages Microsofts Speech & Dialog Research Group.

The announcment has broad implications for the consumer and business worlds who can now use the technology to augment their products and apps with state of the art speech recognition. That includes consumer entertainment devices like the Xbox, accessibility tools such as instant speech-to-text transcription and personal digital assistants such as Cortana.

“This will make Cortana more powerful, making a truly intelligent assistant possible,” Shum said, “and it’s a dream come true for me.”

Moving forward, Zweig said the researchers are working on ways to make sure that speech recognition works well in more real-life settings. That includes places where there is a lot of background noise, such as at a party or while driving on the highway. They’ll also focus on better ways to help the technology assign names to individual speakers when multiple people are talking, and on making sure that it works well with a wide variety of voices, regardless of age, accent or ability.

In the longer term, researchers will focus on ways to teach computers not just to transcribe the acoustic signals that come out of people’s mouths, but instead to understand the words they are saying. That would give the technology the ability to answer questions or take action based on what they are told.

“The next frontier is to move from recognition to understanding,” Zweig said.

Shum noted that we are moving away from a world where people must understand computers to a world in which computers must understand us, still, he cautioned, true artificial intelligence is still on the distant horizon.

“It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown,” he said.

Matthew Griffin / About Author

Matthew Griffin, described as “The Adviser behind the Advisers” and a “Young Kurzweil,” is the founder and CEO of the World Futures Forum and the 311 Institute, a global Futures and Deep Futures consultancy working across the next 50 years, and is an award winning futurist, and author of “Codex of the Future” series.

Regularly featured in the global media, including AP, BBC, Bloomberg, CNBC, Discovery, RT, Viacom, and WIRED, Matthew’s ability to identify, track, and explain the impacts of hundreds of revolutionary emerging technologies on global culture, industry and society, is unparalleled. Recognised for the past six years as one of the world’s foremost futurists, innovation and strategy experts Matthew is an international speaker who helps governments, investors, multi-nationals and regulators around the world envision, build and lead an inclusive, sustainable future.

A rare talent Matthew’s recent work includes mentoring Lunar XPrize teams, re-envisioning global education and training with the G20, and helping the world’s largest organisations envision and ideate the future of their products and services, industries, and countries.

Matthew's clients include three Prime Ministers and several governments, including the G7, Accenture, Aon, Bain & Co, BCG, Credit Suisse, Dell EMC, Dentons, Deloitte, E&Y, GEMS, Huawei, JPMorgan Chase, KPMG, Lego, McKinsey, PWC, Qualcomm, SAP, Samsung, Sopra Steria, T-Mobile, and many more.

World first as Microsofts speech recognition software becomes as accurate as humans

WHY THIS MATTERS IN BRIEF

After almost forty years of research speech recognition systems are now as good as humans.

ORGANISING AN EVENT OR WORKSHOP?

STAY CONNECTED

FREE BOOKS AND STUFF

MY PLEDGE TO THE PLANET

NET ZERO .

ZERO HARM .

ZERO IMPACT .

ZERO WASTE .

EXPLORE MORE!

You have Successfully Subscribed!

Pin It on Pinterest

World first as Microsofts speech recognition software becomes as accurate as humans

WHY THIS MATTERS IN BRIEF

After almost forty years of research speech recognition systems are now as good as humans.

Related Posts

ORGANISING AN EVENT OR WORKSHOP?

STAY CONNECTED

FREE BOOKS AND STUFF

MY PLEDGE TO THE PLANET

NET ZERO .

ZERO HARM .

ZERO IMPACT .

ZERO WASTE .

EXPLORE MORE!

You have Successfully Subscribed!

Pin It on Pinterest