DeepMind sets its sights on an annoying and maybe world-threatening phenomenon: specification gaming

April 30th
(Michael Dziedzic/Unsplash)

This post is sponsored by Brilliant, where students can take courses like Introduction to Neural Networks, where students can study the challenges of learning and perception. Brilliant also offers a fascinating course on Logic, which starts with a love of puzzles and builds up to some truly mind-bending challenges. Students can also take Applied Probability, or the study of accurately modeling “the unpredictable world around us.”

Sign up for Brilliant here to get 20% off an annual membership today.

DeepMind published an article in April about something they call specification gaming, or any scenario in which an AI completes a task in a way the designer did not intend.

Tim Urban described an outsized example of this in WaitButWhy’s two-part essay on AI. In it, he imagines a startup called Robotica that builds an AI named Turry to mass produce handwritten notes. Turry does her job too well — she kills her creators and replicates herself to compound her output:

Over the next few months, Turry and a team of newly-constructed nanoassemblers are busy at work, dismantling large chunks of the Earth and converting it into solar panels, replicas of Turry, paper, and pens. Within a year, most life on Earth is extinct. What remains of the Earth becomes covered with mile-high, neatly-organized stacks of paper, each piece reading, “We love our customers. ~Robotica”~

Urban isn’t the only one to worry about this. Others, like Nick Bostrom, Bill Gates and Stephen Hawking have all expressed concern that AI agents don’t need to be malign to cause real damage. They could do it just by following sloppy orders.

DeepMind is tackling this problem, first by giving it a name, and second by proposing solutions and trying to frame the problem once and for all. They’re doing this because specification gaming is not solely in the realm of science fiction. AI agents today routinely exploit the rules they’re given in order to achieve rewards. But for the most part, they ruin the fun the developer wanted to have.

In their research, DeepMind compiled a list of 60 examples of it happening already. One example occurred when a developer hooked a neural network to his Roomba, and gave it rewards for speed and non-collisions. Even in this small environment, his ruleset wasn’t comprehensive enough to avoid exploitation:

In certain environments, this behavior can actually be a good thing, according to DeepMind, such as when AlphaGo performed moves so odd that no one present — not the analysts, Go fans, or even its opponent, world champion Lee Sedol — understood its thinking.

“These behaviours demonstrate the ingenuity and power of algorithms to find ways to do exactly what we tell them to do,” they wrote.

In other scenarios, AI agent behavior highlights a lack of clarity on behalf of the algorithm designer. A (not prime, but still good) example of this is the story of the Roomba above.

“These types of solutions lie on a spectrum,” DeepMind writes, ”and we don't have an objective way to distinguish between them.”

The article lists a few possible solutions, none of which are perfect.

The first is something they call reward shaping, where the reward is not only given when the AI accomplishes a task, but is spread out across several actions. If an AI were rewarded for completing steps on its way to completing a task, engineers could avoid things going haywire.

That’s harder to do than expected, though — DeepMind gives an example of a boat racing game, where the algorithm designers hoped to teach the AI to stay inside the race lane by rewarding it each time it hit a middle-of-the-lane marker. Instead of completing the course, however, the AI boat hit the same markers over and over again.

Another possible fix would be to explicitly involve humans in the review process. If an AI did what it was told to do but in a way that a human would find unfair, a manual review could easily uncover that.

But it’s still not so simple. The AI could learn to trick the human, as one did when given the task to pick up a ball in a game.

“An agent performing a grasping task learned to fool the human evaluator by hovering between the camera and the object,” DeepMind writes.

The problem is compounded if an AI were to learn to game its own designer.

A form of specification gaming, called reward tampering, could see the AI learning to exploit the concentration and attention of its designer by making it look like its completing a task when it’s really not.

“As another, more extreme example, a very advanced AI system could hijack the computer on which it runs, manually setting its reward signal to a high value.”

A similar phenomenon may already be occurring in product recommendations. If the task is to improve sales, at what expense are those sales made? A customer may have never made the necessary purchases an AI recommended to them if it weren’t for the AI’s ingenuity. There is already evidence to suggest this is happening.

In the end, DeepMind’s article focuses on framing the problem and the problems of its possible solutions. Three challenges are:

  • Accurately describe the specifications to close off any chances of manipulation.
  • Avoid assumptions about the environment and the agent.
  • Shield the designer from reward manipulation.

These problems are often trivial and funny now, but may not be later, especially as systems grow more complex and humans become more reliant on them.