This story is over 5 years old.

Learning How to Cook Is Really Hard, If You’re a Robot

Researchers leveraged deep learning to come up with a way for robots to learn to cook by watching YouTube tutorials.

​Researchers at the University of Maryland and Australia's national ICT research hub, NICTA, announced last week that they'd devised a method for robots to teach themselves to cook using a learning aid that should be familiar to anyone struggling with the basics of a decent meal—watching YouTube tutorials.

While some headlines declared that robots had learned to cook just like us, the reality of the researchers' work was less sensational, though still very impressive. The team will present their paper on January 29th at the 29th annual conference of the Association for the Advancement of Artificial Intelligence.


By leveraging newly developed deep learning techniques and convolutional neural networks—programs that mimic the human brain's capacity for computation with networks of artificial neurons—the researchers fed 88 YouTube cooking tutorials into their program, which was able to identify objects that robot cooks might need to grasp and generate a plan of action for manipulating them.

Thus, the researchers devised a system that would allow a robot to perform the most basic of tasks in the kitchen—recognizing, grasping, and manipulating objects—on the fly. So, robots can learn to cook a whole meal just by watching YouTube videos yet. As it turns out, actually getting a robot to teach itself to cook is really hard. Though robots are popping up in the kitchen with increasing regularity, a robot that can teach itself to cook instead of being programmed is a different matter entirely.

To find out how hard, we reached out to Cornelia Fermüller, an associate research scientist at the University of Maryland and one of the paper's authors.

Would you trust a robot in the kitchen? Image: ​Anne/Flickr

Motherboard: How do you get a robot to teach itself to cook?
Cornelia Fermüller: There has been lots of research happening on grasping. It's a pretty well-studied problem. But most of the work has been done on copying a movement—showing the robot the movement and then getting the robot to copy them. But doing it in general situations is much harder, where it has to adapt to different situations. There are different things happening when the jar and the spatula are different sizes, they might be in different locations, or the table might be cluttered, which makes it more difficult.


The robot has to perceive the world; maybe there's somebody interrupting the situation. So, it becomes difficult in general world situations because of the variations that can happen. If a robot has to pick up a peanut butter jar that's always in the same location, that can be handled. Researchers are even at the point where you can move the jar to different locations, but a more complex scene requires an interaction between vision and action.

Is deep learning far enough along that pulling information from YouTube videos is a viable way to overcome these issues?
The deep learning, the low-level part, is still not easy to process. We're looking at videos, not laboratory scenes. There's so much variation in the way videos are taken—different angles, totally different scenes, different illuminations, and many occlusions. A hand obscuring the screen, or objects occluding one another, for example. Here, we have used deep learning technology, which has seen a boom in development just in the last year. There is good software with which you can recognize objects, and we have used it here to recognize objects and the novel thing of hand poses.

Our robots are not cooking yet. I mean, we're working in robotics, too, but we're mostly vision people. But we have some capabilities. We can pour water, we can stir things, but we don't have autonomous robots that can be in a real teaching environment where all kinds of things are happening and they cook. We're working towards getting robots to be able to make a meal, but we're not there yet.


We can't wait for a housebot that could make these.

What are some of the technical barriers there?
First of all, all of the aforementioned problems have to be solved. We have to teach the robot all these actions under very different conditions, we have to deal with different locations and situations. The other thing is that in a cooking situation, things can happen in many various ways. People can come in, and things can go wrong. All this has to be accounted for.

The robot has to have intelligence, reasoning capabilities, and the ability to reason on top of things—unexpected events. In other words, a robot has to be autonomous. Research is ongoing and we're working on artificial intelligence, but it's not done yet. Progress is happening, but it will take more time.

And what about your approach to deep learning—how did you go about parsing YouTube videos for data that could be used by a cooking robot?
If you want to understand the actions involved in manipulation, humans are doing that in so many different ways. It's not just that you can watch one human and copy exactly what they're doing with the robot. You and I may open a peanut butter jar very differently.

The way we have solved this is by thinking not just in terms of the movement, but the goal of the action. [It relies on] coming up with a grammar to break up the action into little chunks of sub-actions, and that is our high-level contribution: the creation of these grammars.

For opening a peanut butter jar, it would look like this: Move your hand toward the jar, grasp the lid, turn the lid, move your hand away with the lid, and now you have two parts. Understanding this action-grammar is like understanding language, in a way, where you can break down sentences.

Another novelty of the paper is thinking about where to break an action down. Usually, computer vision people, when they program a robot to complete an action, they already have it broken down. We came up with this concept of contact. So, whenever something comes in contact, it's a new segment. The hand moves towards the jar, and that's one segment. When it turns is another. When the next contact approaches, the hand moves away.

When will we get to a point in AI development where a robot can watch a YouTube video and cook a meal, just like us?
It's difficult to say. It will happen, especially if you have a safe environment where there's no kids running around and stuff. But it will take more time. You have to deal with all kinds of perception things. If the robot has to go to the fridge, it has to think about if the lettuce is not there and if it has to go shopping for it.