WHY THIS MATTERS IN BRIEF
Machines and computer systems that can lip read accurately will help enhance the lives of people living with hearing impairments, but it will also help organisations and “institutions” invade people’s privacy and listen in on otherwise private conversations.
Even professional lip readers can figure out only 20% to 60% of what a person is saying and slight movements of a person’s lips at the speed of natural speech are immensely difficult to reliably understand – especially from a distance or if the lips are partially obscured. Today lip reading plays a vital role in helping people who are hearing impaired, or deaf, communicate with those around them, and then there are less note worthy applications for it as well such as eavesdropping in on peoples conversations, or trying to understand just what was it that that celebrity said when the mic was turned off. As a result, anyone who can truly hold their hands up and say that they have a technology that can read lips will find a veritable queue of people beating a path to their door.
And now, the University of Oxford, it appears, is that company. In a new paper researchers at Oxford describe how an artificial intelligence system, based on and partly funded by DeepMind, Google’s seemingly unstoppable AI, called LipNet can watch video of a person speaking and match text to the movement of their mouth with 93.4% accuracy.
The previous state of the art system operated word by word and had an accuracy of 79.6%. The Oxford researchers say the success of their new system is thanks to them thinking about the problem differently – instead of teaching the AI each mouth movement using a system of visual phonemes, they built it to process whole sentences at a time. That allowed the AI to teach itself what letter corresponds to each slight mouth movement.
To train the system, researchers showed the AI nearly 29,000 videos labelled with the correct text, each three seconds long. To see how human lip readers would handle the same task, the team recruited three members of the Oxford Students’ Disability Community and tested them on 300 random videos similar to those they fed their AI. Those humans had an average error rate of 47.7%, while the AI’s was just 6.6%.
Despite the success of the project, it also reveals some of the limits to modern AI research. When teaching the AI how to read lips, the Oxford team used a carefully curated set of videos. Every person was facing forward, well lit, and spoke in a standardised sentence structure.
For example: “Place blue in m 1 soon” was one of the standard three second phrases used in the training consisting of a command, colour, preposition, letter, number from 1 to 10, and an adverb. Every sentence follows that pattern. So the AI’s extraordinary accuracy might have to do with the fact that it was trained and tested in extraordinary conditions. If asked to read the lips of random YouTube videos, for instance, the results would probably be much less accurate – at least for now and until the time when it manages to take another leap forwards.
Some of the most interesting public discourse about AI papers happens afterwards on the vast expanse of Twitter. When other researchers pointed out that using such specialized training videos weren’t applicable to real world results, author Nando de Freitas defended the results of his paper, noting that other video sets the team tried were too noisy. The other videos they tried were each too different from the last for the AI to draw meaningful conclusions – meaning a perfect data set just doesn’t exist yet. De Freitas wrote he was confident that given the correct data the AI has shown that it would be up to the task.
According to OpenAI’s Jack Clark, getting this to work in the real world will take three major improvements – a large amount of video of people speaking in real world situations, getting the AI to be capable of reading lips from multiple angles, and varying the kinds of phrases the AI can predict.
“The technology has such obvious utility, though, that it seems inevitable to be built,” said Clark. Teaching AI to read lips is a base skill that can be applied to countless situations. A similar system could be used to help the hearing-impaired understand conversations around them, or augment other forms of AI that listens to video sound and rapidly generate accurate captions which could then be used to search for specific phrases in videos. As I mentioned above – this is just the tip of a very big, silent iceberg.