Big Tech Is Now Developing Powerful AI Brains for Real-World Robots

Large deep learning models like OpenAI’s GPT-3 have ushered in a golden age for chatbots, but what about physical robots? Both Google and Microsoft have now announced research into applying similar AI models to robots, with impressive results.

Researchers at Google and the Berlin Institute of Technology have released an AI model called PaLM-E this week that combines language and vision capabilities to control robots, allowing them to complete tasks autonomously in the real world—from getting a chip bag from a kitchen to sorting blocks by color into corners of a rectangle.

Videos by VICE

According to the researchers, this is the largest Visual Language Model (VLM) reported to date, with 562 billion parameters. This AI has a “wide array of capabilities” which includes math reasoning, multi-image reasoning, and chain-of-thought reasoning. The researchers wrote in a paper that the AI uses multi-task training to transfer skills across tasks, rather than being trained on individual tasks. According to the paper, the AI model when controlling robots even displays “emergent capabilities like multimodal chain of thought reasoning, and the ability to reason over multiple images, despite being trained on only single-image prompts.”

PaLM-E is based on Google’s previous large language model called PaLM and the E in the name stands for “embodied,” and refers to the model’s interaction with physical objects and robotic control. PaLM-E is also built off of Google’s RT-1, a model that processes robot inputs and outputs actions such as camera images, task instructions, and motor commands. The AI uses ViT-22B, a vision transformer model that does tasks such as image classification, object detection, and image captioning.

The robot is able to generate its own plan of action in response to commands using the model. When the robot was asked to “bring me the rice chips from the drawer,” PaLM-E was able to guide it to go to the drawers, open the top drawer, take the rice chips out of the drawer, bring it to the user, and put it down. The robot was able to do this even with a human disturbance, with a researcher knocking the rice chips back into the drawer the first time the robot picked it up. PaLM-E is able to do this by analyzing data from its live camera.

“PaLM-E generates high-level instructions as text; in doing so, the model is able to naturally condition upon its own predictions and directly leverage the world knowledge embedded in its parameters,” the researchers wrote. “This enables not only embodied reasoning but also question answering, as demonstrated in our experiments.”

The AI can answer questions about the world, such as math problems and facts like which ocean Miami Beach borders. PaLM-E is also able to give captions to and describe images.

The usage of a large language model as the core of the robot has given it the ability to become more autonomous, needing less training and fine-tuning compared to previous models. Danny Driess, one of the paper’s authors, tweeted, “Perhaps most exciting about PaLM-E is positive transfer: simultaneously training PaLM-E across several domains, including internet-scale general vision-language tasks, leads to significantly higher performance compared to single-task robot models.”

“This work represents a major step forward, but on an expected path. It extends recent, exciting work out of DeepMind to the important and difficult arena of robotics (their work on ‘Frozen’ and ‘Flamingo’). More broadly, it is part of the recent tsunami of amazing AI advances that combine a simple, but powerful formula,” Jeff Clune, an Associate Professor of Computer Science at the University of British Columbia, told Motherboard. The formula, he said, is first, to have the AI digest the internet and make predictions about what will come next, and then, train the models to use that knowledge to solve harder tasks.

Danfei Xu, an Assistant Professor at the School of Interactive Computing at Georgia Tech, told Motherboard that PaLM-E is a big step forward for Google’s robotics research. “Task planning, or determining what to do to achieve a goal, is an arduous robotics/AI problem that SayCan and PaLM-E have made significant strides toward solving. Previous task planning systems most rely on some forms of search or optimization algorithms, which are not very flexible and hard to construct. LLMs and multimodal LLM allows these systems to reap the benefit of the Internet-scale data and easily generalize to new problems,” he said.

Google is not the only company testing out a new multimodal AI and how to incorporate large language models in robots. Microsoft released its research on how it extended the capabilities of ChatGPT to robotics. They also unveiled a multimodal model called Kosmos-1 on Monday, which can analyze images for content, solve visual puzzles, perform visual recognition, and pass IQ tests.

In their paper describing the results, Microsoft researchers called the convergence of language models with capabilities in robots as a step towards creating artificial general intelligence, or AGI, which is generally understood as intelligence on the same level as a human being.

At the same time, Xu said there is still more work to be done on overcoming the number of real-world problems that may arise, such as the number of obstacles in a kitchen or the possibility of slipping.

“Generally, endowing robots with human-like sensorimotor control is a really difficult problem (see Moravec’s paradox),” he said. “And it may be the most difficult problem in robotics and the major roadblock towards building useful robots that can assist us in our daily lives. There is other great research in Google Robotics that tries to make progress on that problem, e.g. RT-1, but PaLM-E itself does not directly address that problem. PaLM-E makes great progress on the important robotics problem of task planning. At the same time, difficult robotics problems stay difficult.”

PaLM-E shows that as large language models become more scaled up and advanced, its capabilities, including performing multimodal tasks, become easier, more accurate, and autonomous.