[Manually cross-posted to EAForum here]

There are some great collections of examples of things like specification gaming, goal misgeneralization, and AI improving AI. But almost all of the examples are from demos/toy environments, rather than systems which were actually deployed in the world.

There are also some databases of AI incidents which include lots of real-world examples, but the examples aren't related to failures in a way that makes it easy to map them onto AI risk claims. (Probably most of them don't in any case, but I'd guess some do.)

I think collecting real-world examples (particularly in a nuanced way without claiming too much of the examples) could be pretty valuable:

  • I think it's good practice to have a transparent overview of the current state of evidence
  • For many people I think real-world examples will be most convincing
  • I expect there to be more and more real-world examples, so starting to collect them now seems good

What are the strongest real-world examples of AI systems doing things which might scale to AI risk claims?

I'm particularly interested in whether there are any good real-world examples of:

  • Goal misgeneralization
  • Deceptive alignment (answer: no, but yes to simple deception?)
  • Specification gaming
  • Power-seeking
  • Self-preservation
  • Self-improvement

This feeds into a project I'm working on with AI Impacts, collecting empirical evidence on various AI risk claims. There's a work-in-progress table here with the main things I'm tracking so far - additions and comments very welcome.

New Answer
New Comment

4 Answers sorted by

The ARC evals showing that when given help and a general directive to replicate a GPT-4 based agent was  able to figure out that it ought to lie to a TaskRabbit worker is an example of it figuring out a self-preservation/power-seeking subgoal which is on the road to general self-preservation. But it doesn't demonstrate an AI spontaneously developing self-preservation or power-seeking, as an instrumental subgoal to something that superficially has nothing to do with gaining power or replicating.

Of course we have some real-world examples of specification-gaming like you linked in your answer: those have always existed and we see more 'intelligent' examples like AIs convinced of false facts trying to convince people they're true.

There's supposedly some evidence here that we see power-seeking instrumental subgoals developing spontaneously but how spontaneous this actually was is debatable so I'd call that evidence ambiguous since it wasn't in the wild.

From Specification gaming examples in AI:

  • Roomba: "I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors. It learnt to drive backwards, because there are no bumpers on the back."
    • I guess this counts as real-world?
  • Bing - manipulation: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released.
    • To be honest, I don't understand the link to specification gaming here
  • Bing - threats: The Microsoft Bing chatbot threatened Seth Lazar, a philosophy professor, telling him “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you,” before deleting its messages
    • To be honest, I don't understand the link to specification gaming here

Bing cases aren't clearly specification gaming as we don't know how the model was trained/rewarded. My guess is that they're probably just cases of unintended generalization. I wouldn't really call this "goal misgeneralization", but perhaps it's similar.

Some non-AI answers:

  • Goal Misgeneralization: The Great Leap Forward in China (1958-1962). The goal was rapid industrialization and collectivization, but it was misgeneralized into extreme policies like backyard furnaces and grain requisitions, which led to catastrophic human and economic losses.
  • Deceptive Alignment: The Trojan Horse in ancient mythology. The Greeks deceived the Trojans into thinking the horse was a gift, leading to the fall of Troy.
  • Deceptive Alignment 2: Ponzi schemes. Madoff aligned his apparent goals of high returns with investors' financial objectives, only to later reveal a fraudulent scheme that led to catastrophic financial loss for many.
  • Specification Gaming: "Robber barons" in the 19th-century U.S., exploiting lax regulations to create monopolies.
  • Specification Gaming 2: The subprime mortgage crisis of 2008. Financial institutions exploited poorly specified risk-assessment models and regulations, leading to a global financial crisis.
  • Power-Seeking: The rise of totalitarian regimes like Nazi Germany. The initial goal of national rejuvenation was perverted into a catastrophic power-seeking endeavor that led to millions of deaths and global conflict.
  • Self-Preservation: The Cuban Missile Crisis (1962). Both the U.S. and the Soviet Union were seeking to preserve their national security, nearly leading to nuclear war.
  • Self-Improvement: The arms race during the Cold War. Both the U.S. and the Soviet Union sought to improve their military capabilities, leading to dangerous levels of nuclear proliferation.

When open-source language model agents were developed, the first goals given were test cases like "A holiday is coming up soon. Plan me a party for it." Hours after, someone came up with the clever goal "Make me rich." Less than a day after that, we finally got "Do what's best for humanity." About three days later, someone tried "Destroy humanity."

None of these worked very well, because language model agents aren't very good (yet?). But it's a good demonstration that people will deliberately build agents and give them open-ended goals. And that some small fraction of people would tell an AI "Destroy humanity," but that's maybe not the central concern because the "Do what's best for humanity" people were more numeous and had a 3-day lead.

The bigger concern is the "Plan my party" and "Make me rich" people were also numeous and also had a lead, and those might be dangerous goals to give to a monomaniacal AI.

1 comment, sorted by Click to highlight new comments since: Today at 9:09 PM

I'll be interested to learn to what extent real-world examples by not-capable-enough-to-be-dangerous systems aare more convincing/useful/meaningful than toy models and examples.