Stories of Continuous Deception

byMichaël Trazzi17d31st May 20195 comments

19


In my recent posts, I considered scenarios where an AI realizes that it would be instrumentally useful to deceive humans (about its alignment or capabilities) when weak, then undertake a treacherous turn when humans are no longer a threat. Those scenarios have the following (implicit) assumptions:

  • i) We're considering a seed AI able to recursively self-improve without human intervention.
  • ii) There is some discontinuity at the conception of deception, i.e. when it first thinks of its treacherous turn plan.

This discontinuity could be followed by a moment of vulnerability where it isn't really good at concealing its intentions (humans could detect its misalignment). Thus, according to the sordid stumble view, it would "behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values".

In this post, I'll present gradual deception stories where, even without assumptions i) and ii), the AI continuously learns to deceive human, hence constituting counterexamples for the sordid stumble view.

The Unbiased Newsfeed is Biased Towards You

Humans are biased towards story closer to their beliefs, as they estimate that those are more likely to be true. Now, let's imagine a Machine Learning model with the goal of "aggregating stories into an unbiased newsfeed for a human H", the human providing a bias score for each story.

By doing so, the human is unfortunately specifying "try to sound unbiased, taking into account my prejudice against stories with high inferential distance".

At the beginning, the AI doesn't really know what constitutes an unbiased newsfeed, so its bias score is high. At some point, it stumbles upon a story that appears unbiased but is actually biased toward H, and updates its parameters in this direction.

After multiple steps, the model mostly outputs news that look unbiased but are biased toward H. It is now deceiving the human into the trustworthiness of its newsfeed, without even realizing it.

At this point, another phenomenon happens: because H is constantly receiving news from the model, its beliefs start shifting toward the average opinion of the news produced by the model.


If the same newsfeed model is now serving millions of people, and for each individual it tries to minimize a bias score, then it is likely to stumble upon humans that are more susceptible to be deceived. Finally, if it generalizes correctly, it will end up deceiving most people.

Dumbing Down Complex Plans

Let's consider an AI built to explain plans. Similarly, the model is rewarded by a numerical human feedback, estimate of how much the human understood from the explanation.

Now, let's also assume that AIs that turn out to be smart too fast are under more surveillance. In that case, the AI might end up dumbing out its explanations because either:

  • the surviving AIs are the one that avoid scrutiny (survival bias).
  • the surveillance code is automatically triggered when the AI appears smart, interrupting other processes, so it would be too costly for the AI (in terms of compute time) to appear smart.

Note that the AI is not pretending to be dumb. We're talking about mostly simple machines anyway. The "dumb" is our external judgment of what it's doing.

It Was Your Idea

Deception can happen when:

  • (i) there is a misalignment between two agents.
  • (ii) even though this is not agent's B intention, agent A is rewarded by agent B, in practice, for misdirection/deception.

For instance, deception naturally happens when a couple is tired of arguing: one partner starts presenting his/her decisions as being an original idea from the other.

More generally, an oracle AI aimed at shaping human decision making would be incentivized to present the decisions as being "close to what the human querying the oracle would guess".

Indeed, even if at the beginning the AI tries to output the most accurate answers, it will end up (after human feedback) finding that the answers that give the most reward are the one that make the human believe "it was close to my original guess after all".

19