DeepMind's robots are learning from the internet with ChatGPT brains

By Matthew Griffin Robo Revolution 11th August 2023

WHY THIS MATTERS IN BRIEF

We are always finding new ways to train robots in new ways, and when it comes to create general purpose robots this is one of the most effective.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

Ever since ChatGPT exploded onto the tech scene in November of last year, it’s been helping people write all kinds of material, generate code, and find information. It and other Large Language Models (LLMs) have facilitated tasks from fielding customer service calls to taking fast food orders. Given how useful LLMs have been for humans in the short time they’ve been around, how might a ChatGPT for robots impact their ability to learn and do new things? Researchers at Google DeepMind decided to find out and published their findings in a blog post and paper released last week.

Robots are taking over and automating the world's mines

They call their system RT-2. It’s short for Robotics Transformer 2, and it’s the successor to robotics transformer 1, which the company released at the end of last year. RT-1 was based on a small language and vision program and specifically trained to do many tasks. The software was used in Alphabet X’s Everyday Robots, enabling them to do over 700 different tasks with a 97 percent success rate. But when prompted to do new tasks they weren’t trained for, robots using RT-1 were only successful 32 percent of the time.

RT-2 almost doubles this rate, successfully performing new tasks 62 percent of the time it’s asked to. The researchers call RT-2 a Vision-Language-Action (VLA) model. It uses text and images it sees online to learn new skills. That’s not as simple as it sounds; it requires the software to first “understand” a concept, then apply that understanding to a command or set of instructions, then carry out actions that satisfy those instructions.

One example the paper’s authors give is disposing of trash. In previous models, the robot’s software would have to first be trained to identify trash. For example, if there’s a peeled banana on a table with the peel next to it, the bot would be shown that the peel is trash while the banana isn’t. It would then be taught how to pick up the peel, move it to a trash can, and deposit it there.

True AGI means 10 percent global economic growth says Microsoft CEO

RT-2 works a little differently, though. Since the model has trained on loads of information and data from the internet, it has a general understanding of what trash is, and though it’s not trained to throw trash away, it can piece together the steps to complete this task.

The LLMs the researchers used to train RT-2 are PaLI-X, a vision and language model with 55 billion parameters, and PaLM-E,what Google calls an embodied multimodal language model, developed specifically for robots, with 12 billion parameters.

“Parameter” refers to an attribute a machine learning model defines based on its training data. In the case of LLMs, they model the relationships between words in a sentence and weigh how likely it is that a given word will be preceded or followed by another word.

Zuckerberg's new life-like avatars laugh off the old critics

Through finding the relationships and patterns between words in a giant dataset, the models learn from their own inferences. They can eventually figure out how different concepts relate to each other and discern context. In RT-2’s case, it translates that knowledge into generalized instructions for robotic actions.

Those actions are represented for the robot as tokens, which are usually used to represent natural language text in the form of word fragments. In this case, the tokens are parts of an action, and the software strings multiple tokens together to perform an action. This structure also enables the software to perform Chain-of-Thought Reasoning, meaning it can respond to questions or prompts that require some degree of reasoning.

Examples the team gives include choosing an object to use as a hammer when there’s no hammer available – the robot chooses a rock – and picking the best drink for a tired person – the robot chooses an energy drink.

Senior Pentagon officials reveal new autonomous killer robot policy

“RT-2 shows improved generalization capabilities and semantic and visual understanding beyond the robotic data it was exposed to,” the researchers wrote in a Google blog post. “This includes interpreting new commands and responding to user commands by performing rudimentary reasoning, such as reasoning about object categories or high-level descriptions.”

The dream of general purpose robots that can help humans with whatever may come up – whether in a home, a commercial setting, or an industrial setting – won’t be achievable until robots can learn on the go. What seems like the most basic instinct to us is, for robots, a complex combination of understanding context, being able to reason through it, and taking actions to solve problems that weren’t anticipated to pop up. Programming them to react appropriately to a variety of unplanned scenarios is impossible, so they need to be able to generalize and learn from experience, just like humans do.

RT-2 is a step in this direction. The researchers do acknowledge, though, that while RT-2 can generalize semantic and visual concepts, it’s not yet able to learn new actions on its own. Rather, it applies the actions it already knows to new scenarios. Perhaps RT-3 or 4 will be able to take these skills to the next level. In the meantime, as the team concludes in their blog post, “While there is still a tremendous amount of work to be done to enable helpful robots in human-centered environments, RT-2 shows us an exciting future for robotics just within grasp.”

Matthew Griffin / About Author

Matthew Griffin is a multi-award winning Futurist and expert in Disruption and Innovation, Geopolitics, Leadership, and Technology, who NASA have described as a "walking encyclopaedia of the future" and a "futurist Polymath." 15-time best selling author of the "Codex of the Future" series, Matthew is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working with royal households, world leaders, G7, G20, and G77 governments, NGOs, and multi-national mid and mega cap firms to help them explore, shape, and lead the next 50 years of business and society.

An award-winning YouTube creator with over a million followers, with an unrivalled global reach and impact, Matthew is a highly sought-after international keynote speaker, lecturer, and mentor who collaborates with global leaders through the United Nations Alliance of Civilizations (UNAOC) and United Nations General Assembly (UNGA) to shape pivotal initiatives such as the UN’s AI for Humanity program, the United Nations Conference of the Parties (UN COP), and the World Economic Forum in Davos.

As the former Global Head of Cloud, National Security, and Enterprise Sales for companies including Atos, Dell-EMC, and IBM, Matthew has a proven track record of building multi-billion dollar business units and turning failing divisions into market leaders. His ability to identify, analyse, and communicate the implications of hundreds of emerging technologies and trends is unparalleled, and his insights are trusted by many of the world’s most respected organisations, including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi, Coca-Cola, Dentons, Deloitte, Dow Jones, EY, Google, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, Siemens AG and Siemens Energy, T-Mobile, UBS, VISA, Walmart, Workday, Worldpay and many others.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.