WHY THIS MATTERS IN BRIEF
After almost forty years of research speech recognition systems are now as good as humans.
Microsoft have announced that they have made a major breakthrough in speech recognition and created a technology that, finally, recognises the words in conversational speech as well as humans do – or at least, as good as professional human transcriptionists, which is better than most humans.
In the report the team from Microsofts Artificial Intelligence and Research Unit announced that their speech recognition system makes the same – and in some cases fewer – errors than professional transcriptionists and that the systems word error rate (WER) is now just 5.9 percent – down from the 6.3 percent WER the team reported just last month.
The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the imaginatively titled industry standard “Switchboard Speech Recognition Task”.
“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist, “and this is an historic achievement.”
And it is. To put it into context the announcement means, that for the first time ever, a machine is as good at recognising the words being spoken in a fluid conversation as a human is and by achieving this latest milestone the team has beat a goal they set less than a year ago. The new announcement goes to show just how fast the company’s speech recognition technology, which is based on Microsoft’s Computational Network Toolkit (MCNT), a homegrown system for deep learning that the research team has since posted on GitHub via an open source license, is progressing.
Huang said CNTK’s ability to quickly process deep learning algorithms across multiple computers running a specialized chip called a Graphics Processing Unit (GPU) helped to vastly improve the speed at which they were able to do their research and, ultimately, reach parity.
“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” said Harry Shum, the executive vice president who heads the Microsoft Artificial Intelligence and Research group.
The research milestone comes after decades of research in speech recognition, beginning almost forty years ago in the early 1970s with DARPA, the US agency tasked with making technology breakthroughs in the interest of national security and over the decades, more and more technology companies and many research organizations have joined in the pursuit.
“This accomplishment is the culmination of over twenty years of effort,” said Geoffrey Zweig, who manages Microsofts Speech & Dialog Research Group.
The announcment has broad implications for the consumer and business worlds who can now use the technology to augment their products and apps with state of the art speech recognition. That includes consumer entertainment devices like the Xbox, accessibility tools such as instant speech-to-text transcription and personal digital assistants such as Cortana.
“This will make Cortana more powerful, making a truly intelligent assistant possible,” Shum said, “and it’s a dream come true for me.”
Moving forward, Zweig said the researchers are working on ways to make sure that speech recognition works well in more real-life settings. That includes places where there is a lot of background noise, such as at a party or while driving on the highway. They’ll also focus on better ways to help the technology assign names to individual speakers when multiple people are talking, and on making sure that it works well with a wide variety of voices, regardless of age, accent or ability.
In the longer term, researchers will focus on ways to teach computers not just to transcribe the acoustic signals that come out of people’s mouths, but instead to understand the words they are saying. That would give the technology the ability to answer questions or take action based on what they are told.
“The next frontier is to move from recognition to understanding,” Zweig said.
Shum noted that we are moving away from a world where people must understand computers to a world in which computers must understand us, still, he cautioned, true artificial intelligence is still on the distant horizon.
“It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown,” he said.