MIT researchers have taught their AI to recognise sounds

0 0

By Matthew Griffin Intelligence and the Senses 5th December 2016

WHY THIS MATTERS IN BRIEF

Developing AI’s that can recognise and understand raw sound could have implications for autonomous vehicles, elderly care, entertainment, home security and much more

In recent years, computers have gotten remarkably good at recognizing speech and images. Think of the dictation software on most smartphones, or the algorithms that automatically identify images and people in photos posted to Google or Facebook.

But machine recognition of natural sounds – such as crowds cheering or waves crashing – has lagged behind. That’s because most automated recognition systems, whether they process audio or visual information, are the result of machine learning, in which computers search for patterns in huge compendia of training data. Usually, the training data has to be first annotated by hand – the equivalent of putting subtitles on your tv programs, which is prohibitively expensive and time consuming for all but the highest demand applications.

Artificial intelligence diagnoses disease by listening to your voice

Sound recognition may be catching up, however, thanks to researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL). At the Neural Information Processing Systems conference next week, they will present their new sound recognition system that outperforms its predecessors but didn’t require hand annotated data during training.

Instead, and for the first time, the researchers managed to train their system using only video. First, the teams existing computer vision systems that recognize scenes and objects to categorise the images in the video, then the new system looked for correlations between those visual categories, for example, a video of a forest scene, and the sounds within it, for example, bird song.

“Computer vision has gotten so good that we can transfer it to other domains,” says Carl Vondrick, an MIT graduate student and one of the paper’s two first authors, “we’re capitalizing on the natural synchronization between vision and sound. We scale up with tons of unlabelled video to learn to understand sound.”

The researchers tested their system on two standard databases of annotated sound recordings, and it was between 13 and 15 percent more accurate than the best-performing previous system. On a data set with 10 different sound categories, it could categorize sounds with 92 percent accuracy, and on a data set with 50 categories it performed with 74 percent accuracy. On those same data sets, humans are 96 percent and 81 percent accurate, respectively.

“Even humans are ambiguous,” says Yusuf Aytar, the paper’s other first author.

“We did an experiment with Carl,” Aytar says. “Carl was looking at the computer monitor, and I couldn’t see it. He would play a recording and I would try to guess what it was. It turns out this is really, really hard. I could tell indoor from outdoor, basic guesses, but when it comes to the details – ‘Is it a restaurant?’ – those details are missing. Even for annotation purposes, the task is really hard.”

GPT3 helps transform chemical research

Because it takes far less power to collect and process audio data than it does to collect and process visual data, the researchers envision that a sound-recognition system could be used to improve the context sensitivity of mobile devices.

When coupled with GPS data, for instance, a sound-recognition system could determine that a cellphone user is in a movie theater and that the movie has started, and the phone could automatically route calls to a prerecorded outgoing message, similarly, sound recognition could improve the situational awareness of autonomous robots.

“For instance, think of a self-driving car,” Aytar says. “There’s an ambulance coming, and the car doesn’t see it. If it hears it, it can make future predictions for the ambulance – which path it’s going to take – just purely based on sound.”

The researchers’ machine-learning system is a neural network and Vondrick, Aytar, and Torralba first trained their neural net on two large, annotated sets of images: one, the ImageNet data set, contains labeled examples of images of 1,000 different objects; the other, the Places data set created by Torralba’s group, contains labeled images of 401 different scene types, such as a playground, bedroom, or conference room.

Once the network was trained, the researchers fed it the video from 26 terabytes of video data downloaded from the photo-sharing site Flickr.

“It’s about 2 million unique videos,” Vondrick says, “if you were to watch all of them back to back, it would take you about two years.”

Radian unveils plans for revolutionary Single Stage To Orbit spaceplane

Then they trained a second neural network on the audio from the same videos. The second network’s goal was to correctly predict the object and scene tags produced by the first network just from listening to the videos that were being played.

The result was a network that could interpret natural sounds and associate them with image categories. For instance, it might determine that the sound of birdsong tends to be associated with forest scenes and pictures of trees, birds, birdhouses, and bird feeders.

“With the modern machine-learning approaches, like deep learning, you have many, many trainable parameters in many layers in your neural-network system,” says Mark Plumbley, a professor of signal processing at the University of Surrey, “that normally means that you have to have many, many examples to train that on. And we have seen that sometimes there’s not enough data to be able to use a deep-learning system without some other help. Here the advantage is that they are using large amounts of other video information to train the network and then doing an additional step where they specialize the network for this particular task. That approach is very promising because it leverages this existing information from another field.”

Plumbley says that both he and colleagues at other institutions have been involved in efforts to commercialize sound recognition software for applications such as home security, where it might, for instance, respond to the sound of breaking glass. Other uses might include elderly care, to identify potentially alarming deviations from ordinary sound patterns, or to control sound pollution in urban areas.

“I really think that there’s a lot of potential in the sound-recognition area,” he says.

Matthew Griffin / About Author

Matthew Griffin, multi-award winning Futurist and named Futurist of the Year 2024, has been described as a "Walking encyclopaedia of the future" by NASA and a futurist polymath. One of the world's most renowned futurists and strategic foresight experts Matthew is the 15 times author of the blockbuster "Codex of the Future" series, and is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working across the next 50 years, XPotential University, the world's first free futures and foresight university, and the World Futures Forum which works with the United Nations to solve the worlds greatest challenges. Matthew is an in demand international keynote, acclaimed university lecturer and mentor, and host of the hit Fanatical Futurist podcast.

A rare talent in his past Matthew helped build and run several multi-billion dollar business units for Atos, Dell-EMC, and IBM, and his ability to identify, track, and explain the impacts of hundreds of emerging technologies and trends on global business, culture, and society has earned him a powerful reputation and a roster of clients that include royal households, world leaders, G7, G20, and G77+ governments, and many of the world's most respected brands including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi Group, Coca Cola, Dentons, Deloitte, Disney, Dow, EY, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, T-Mobile, UBS, VISA, and many others. He was also the only futurist invited to talk at the UN COP28 held in Dubai alongside world leaders.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.