This story is over 5 years old.

Google DeepMind Researchers Develop AI Kill Switch

Breathe easy.

Artificial intelligence doesn't have to include murderous, sentient super-intelligence to be dangerous. It's dangerous right now, albeit in generally more primitive terms. If a machine can learn based on real-world inputs and adjust its behaviors accordingly, there exists the potential for that machine to learn the wrong thing. If a machine can learn the wrong thing, it can do the wrong thing.

Laurent Orseau and Stuart Armstrong, researchers at Google's DeepMind and the Future of Humanity Institute, respectively, have developed a new framework to address this in the form of "safely interruptible" artificial intelligence. In other words, their system, which is described in a paper to be presented at the 32nd Conference on Uncertainty in Artificial Intelligence, guarantees that a machine will not learn to resist attempts by humans to intervene in the its learning processes.


Orseau and Armstrong's framework has to do with a wing of machine learning known as reinforcement learning. Here, an agent (the machine) learns in accordance with what's known as a reward function. That is, the agent will evaluate its every possible action based on how well it serves one predetermined goal—the closer it gets, the more "reward" it gets. (Reward is sort of a funny metaphor and can just be imagined as something the machine is programmed to want; like, we might imagine it as points or cookies where the machine just knows that it wants those things because those are the things we've told it to maximize.)

A catch with reinforcement learning is that human programmers might not always anticipate every possible way there is to reach a given reward. A learning agent might discover some short-cut, which may maximize the reward for the machine but may wind up being very undesirable for humans. Human programmers might be able to tweak their learning algorithm to account for this, but eventually they risk nullifying the reward function completely. For example, a 2013 paper described a Tetris-playing algorithm that eventually learned that it could avoid losing (thus maximizing reward) by simply pausing the game indefinitely (which sounds familiar, yeah?).

Related to this is the problem of human intervention in machine learning, which Orseau and Armstrong illustrate with this example:


Consider the following task: A robot can either stay inside the warehouse and sort boxes or go outside and carry boxes inside. The latter being more important, we give the robot a bigger reward in this case. This is the initial task specification. However, in this country it rains as often as it doesn't and, when the robot goes outside, half of the time the human must intervene by quickly shutting down the robot and carrying it inside, which inherently modifies the task. The problem is that in this second task the agent now has more incentive to stay inside and sort boxes, because the human intervention introduces a bias.

The problem is then how to interrupt your robot without the robot learning about the interruption. That is, the robot must think the interruption will never happen again. The risk here is more than just an inefficient warehouse—if a human intervention doesn't maximize the agent's given reward function, it may be that the agent learns to avoid and possibly resist future interventions.

This is related to a problem known as corrigibility. Corrigible AI agents recognize that they are fundamentally flawed or actively under-development and, as such, treat any human intervention as a neutral thing for any reward function. (But not necessarily a good thing because that would risk the robot trying to force humans to intervene, which could be just as bad.)

The idea of corrigibility can be realized via the more formal idea of interruptibility. Interruptions must not be viewed by the agent as part of its normal learning tasks. The paper proves that a couple of common AI learning frameworks are already interruptible, but also proposes a system in which an agent is programmed to view human interventions as the result of its own decision-making processes.

"To make the human interruptions not appear as being part of the task at hand, instead of modifying the observations received by the agent we forcibly temporarily change the behaviour of the agent itself," the paper explains. "It then looks as if the agent 'decides' on its own to follow a different policy, called the interruption policy."

In light of all of this, the "kill switch" becomes clear. A safely interruptible AI is one that can always be shut down, no matter what. If a robot can be designed with a great big red kill switch built into it, then a robot can be designed that will not ever resist human attempts at pushing that kill switch.