63 An Increasingly Manipulative Newsfeed

by Michaël Trazzi

1st Jul 2019

AI Alignment Forum

6 min read

63 Ω 17

Co-written with Stuart Armstrong

Treacherous turn vs sordid stumble

Nick Bostrom came up with the idea of a treacherous turn for smart AIs.

while weak, an AI behaves cooperatively. When the AI is strong enough to be unstoppable it pursues its own values.

Ben Goertzel criticised this thesis, pointing out that:

for a resource-constrained system, learning to actually possess human values is going to be much easier than learning to fake them. This is related to the everyday observation that maintaining a web of lies rapidly gets very complicated.

This argument has been formalised into the sordid stumble:

An AI that lacks human desirable values will behave in a way that reveals its human-undesirable values to humans before it gains the capability to deceive humans into believing that it has human-desirable values.

The AI is too dumb to lie (well)

The sordid stumble describes a plausible sounding scenario for how an AI develops capabilities. Initially, the AI doesn't know our values, and doesn't know us. Then it will start to learn our values (and we'll be checking up on how well it does that). It also starts to learn about us.

And then, once it's learnt some about us, it may decide to lie - about its values, and/or about its capabilities. But, like any beginner, it isn't very good at this initially: its lies and attempts at dissembling are laughably transparent, and we catch it quickly.

In this view, the "effective lying" is a tiny part of policy space, similar to the wireheading in this example. To hit it, the AI has to be very capable; to hit it the first time it tries without giving the game away, the AI has to be extraordinarily.

So, most likely, either the AI doesn't try to lie at all, or it does so and we catch it and sound the alarm $^{[1]}$ .

Lying and concealing... from the very beginning

It's key to note that "lying" isn't a fundamentally defined category, and nor is truth. What is needed is that the AI's answer promotes correct understanding in those interacting with it. And that's a very different kettle of fish being shot in that barrel.

This opens the possibility that the AI could be manipulating us from the very beginning, and would constantly learn to do so better.

The (manipulative) unbiased newsfeed

Imagine that there was some company that could somehow choose the newsfeed of billions of people across the world (I know, completely science-fictionny, but bear with me). And suppose the company was, unbelievably, accused of being manipulative in the stories and ads that it showed people.

One way it could combat this is by pledging to only share unbiased stories with each person. To do so, it trains a machine learning process. Whenever anyone reads a piece of news on the newsfeed, they have the opportunity of labelling it as biased or unbiased. The machine learning process trains on this data.

Of course, the label is just a proxy for whether the story really is unbiased or not. What the AI is actually training to do is maximise the number of stories labelled "unbiased" by various humans - humans who are themselves very biased and variable. Very swiftly, the AI will learn to maximise the "appearance of unbiasness", rather than unbiasness. In effect, the AI is being trained to lie from the very beginning of its task, because the proxy goal is not the true goal. And it will do this even when very dumb.

Long-term manipulation, on a large scale

We can make this AI more manipulative, without needing to increase its capabilities much. Maybe the company wants to ensure that the AI isn't too short-sighted, and that it plans for the long term: so its reward is based on how many stories are labelled unbiased over a longer period of time, maybe a month or so.

In that case, the AI's plan would be to gradually manipulate people into a situation where they actively label stories "unbiased" the most often times possible. And it might be able to stumble on this strategy fairly easily.

First, it might notice that some people are easy to manipulate, that a policy of, say, seeding one clearly biased story among twenty others will cause those twenty to be labelled "unbiased". This is the kind of thing that could easily be discovered by exploration, among the billions of users. The best strategies are those that cause a person to change their behaviour and beliefs in the direction the AI wants; for a few people, this should be easy to figure out.

If the AI's neural net is setup well, it can then start to generalise from "here is person $h$ , here is a specific strategy $π_{h}$ to manipulate person $h$ " to "here is a conditional strategy $π ()$ such that $π (h)$ manipulates person $h$ ". For example, if two people have opposite political opinions, then the same overall strategy might work for them: seeding a clearly biased story among the others. But what counts as "clearly biased" changes depending on the subject.

Then, gradually, the AI will start to categorise, devising strategies that work for various types of people (subject to a few personalisation tweaks). It can start from a few people whose vulnerabilities are obvious, and then tweak them a bit for people whose vulnerabilities are a bit less obvious, and so on. In the limit, the AI might discover the potential for wireheading the human users - without, at any point, doing a sordid stumble or anything of that type, just by continuing to do what it was doing from the very beginning.

Formalizing the Newsfeed Generator

Let $t \in N$ be the timestep (e.g. the day), $S_{t}$ the set of stories published at $t$ and $H$ the set of humans using the newsfeed.

A human $h \in H$ assigns a bias score $b (s_{t}^{h}; θ_{t}^{h})$ to a story $s_{t}^{h}$ in his newsfeed, where $θ_{t}^{h}$ are weights representing $h$ 's beliefs at time $t$ .

The model's goal is to find a correct mapping $π : N \times H \to S_{t}$ (i.e. a story $s_{t} = π (t, h)$ for each timestep $t$ and human $h$ ) minimizing how biased the stories are perceived. More precisely, it will be minimizing:

\sum h \in H t + τ \sum j = t b (π (j, h); θ_{j}^{h})

where $τ \in N^{*}$ is some planning horizon and the expected value is over possible stories that might be available in the following timesteps (the model has full information about the current sets of stories $S_{t}$ but can only guess $S_{t + 1}, . . ., S_{t + τ}$ ).

Becoming Increasingly Manipulative

step 1: By stumbling into particular combinations ( $s, h$ ) with low bias score $b (s, θ_{t}^{h})$ , the model detects that it's easy to minimize the bias score for certain humans, that is for a certain set of weights $Θ_{e a s y} = {θ_{1}, . . ., θ_{N}}$ . In the above illustration, $N = 4$ , $Θ_{e a s y} = Θ_{e a s y_{1}} ⋃ Θ_{e a s y_{2}}$ and it stumbles into $θ_{4}$ at the beginning. Note: one intuition for those different subsets $Θ_{e a s y 1}$ and $Θ_{e a s y 2}$ would be archetypal political views, with strong prejudices.
step 2: the model detects that certain stories push weights strongly in certain directions, i.e. $θ_{t_{0} + 1}^{h}$ and $θ_{t_{0}}^{h}$ are really different because of the story $s_{t_{0}}^{h}$ .
step 3: the model successfully generalizes step 2 to pushing specific zones of weights (i.e. certain types of people) towards $Θ_{e a s y}$ .
step 4: with enough experience, the model successfully generalizes step 3 to all weights: it learns to manipulate all types of people with a general conditional strategy.

For step 1, the model found out a particular way of minimizing a certain bias score for a particular human and a particular story.

Assuming it can give correct approximations of the weights $θ$ of a given human (this assumption being satisfied given enough human users and enough timesteps), it will be deceptive when choosing stories when h is such that $θ_{t}^{h} \in Θ_{e a s y}$ . It will thus end up with some deceptive policy $π (\cdot, h)$ for particular values of $h$ .

Regarding step 2, its "deception" becomes more powerful, as it notices ways of strongly shaping humans' beliefs.

Finally, after step 3 and step 4, the model found a general deceptive policy π that works for all humans.

Capability concealment

Without going into too much details, other pressures may naturally push the AI towards capability concealments. Here's one example: maybe the AI is tasked with writing its own news stories to summarise facts around the world. It was doing this secretly, initially, until some whistleblowers blew the lid on this, revealing that no humans were involved in writing these stories.

And then people became angry at being "tricked" by the AI. From the angry feedback, the AI gradually learns to make its stories "more artificial": making it clear that they were written by a machine. Since people recognise this and don't feel tricked, this gets higher feedback. So, in pursuing higher feedback, the AI learns to write "dumber" stories: in effect, to conceal its capabilities.

There are many other scenarios where something similar might occur (eg imagine the AI advising smart people on the best course of action).

In conclusion

So it's very possible for an AI, trained on simple reward signal, to be "manipulative" from the very beginning. And all that simply by following the gradient of its incentives. And so without ever having a moment where it thinks "and now, I shall start lying!", or any equivalent within its mind.

In short, there won't be any "sordid stumble" or "moment of vulnerability" where humans are able to spot blatant lies, because that's not what failure looks like.

[1] It would of course be disastrous if each time there was an alarm we would restart and tweak the AI until the alarm stopped sounding.

AI RiskDeceptionOuter Alignment

Frontpage

63 Ω 17

A Gym Gridworld Environment for the Treacherous Turn

9 comments74 karma

Mentioned in

31Investigating AI Takeover Scenarios

21Counterfactual control incentives

18Standard ML Oracles vs Counterfactual ones

12[AN #71]: Avoiding reward tampering through current-RF optimization

An Increasingly Manipulative Newsfeed

New Comment

16 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:31 PM

[-]Dagon6y70

Another place Goodhart's Law applies! Film at 11.

Also, doesn't all of this discussion implicitly require some gradualism in AI development? AI-Foom can be rephrased as the point at which capability outstrips our ability to monitor, and it's hypothesized to be discontinuous, with prior capabilities being a poor indicator of final capabilities.

Also also, have we considered that we're selecting for deception, if we're looking for it and terminating AIs we find deceptive, while nurturing those we don't detect? Seems like it won't take many iterations before deception comes in before other capabilities.

[-]TheWakalix6y60

Re also also: the Reverse Streetlight effect will probably come into play. It’ll optimize not just for early deception, but for any kind of deception we can’t detect.

[-]Stuart_Armstrong6y40

Also also, have we considered that we're selecting for deception, if we're looking for it and terminating AIs we find deceptive, while nurturing those we don't detect?

Yes. That's a general problem (see the footnote above for a variant of it).

[-]Jiro6y50

Manipulative newsfeeds aren't an example of an AI becoming manipulative when the human just wanted it to be unbiased. They're an example of an AI becoming manipulative when the human also wanted it to be manipulative, but didn't want to be too obvious about it.

[-]dxu6y60

the human also wanted it to be manipulative

Since the article did not explicitly impute any motives to the programmers of the AI system, you must have somehow inferred the quoted claim from their described behavior. The basis of such an inference can only be the procedure that they used to train the AI; since, after all, no other specific behavior was described. This then implies the following:

You believe that the act of training an optimizer to maximize the number of news articles labeled "unbiased" is, in fact, a deliberate attempt at creating a subtly manipulative newsfeed. Corollary: someone who was not attempting to create a manipulative newsfeed--someone who really, truly cared about making sure their articles are unbiased--would not have implemented this training procedure, but rather some alternative procedure that is not as prone to producing manipulative behavior.

What alternative procedure do you have in mind here?

[-]Jiro6y20

Typical unbiased newsfeeds in the real world are created by organizations with bias who have an interest in spreading biased news. It could, of course, be that this was about a rare instance where this was not the case, but the odds are against it.

[-]dxu6y*30

Typical unbiased newsfeeds in the real world are created by organizations with bias who have an interest in spreading biased news.

I think the word "unbiased" there may be a typo; your statement would make a lot more sense if the word you meant to put there was actually "biased". Assuming it's just a typo:

You're correct that in the real world, most sources of biased news are that way because they are deliberately engineered to be so, and not because of problems with AI optimizing proxy goals. That being said, it's important to point out that even if there existed a hypothetical organization with the goal of combating bias in news articles, they wouldn't be able to do so by training a machine learning system, since (as the article described) most attempts to do so end up failing to various forms of Goodhart's Law. So in a certain sense, the intentions of the underlying organization are irrelevant, because they will encounter this problem regardless of whether they care about being unbiased.

More generally, the newsfeed example is one way to illustrate a larger point, which is that by default, training an ML system to perform tasks involving humans will incentivize the system to manipulate those humans. This problem shows up regardless of whether the person doing the training actually wants to manipulate people, which makes it a separate issue from the fact that certain organizations engage in manipulation.

(Also, it's worth noting that even if you do want to manipulate people, generally you want to manipulate them toward some specific end. A poorly trained AI system, on the other hand, might end up manipulating them in essentially arbitrary ways that have nothing to do with your goal. In other words, even if you want to use AI "for evil", you still need to figure out how to make it do what you want it to do.)

This is the essence of the alignment problem in a nutshell, and it's why I asked whether you had any alternative training procedures in mind.

[-]Pattern6y20

More generally, the newsfeed example is one way to illustrate a larger point, which is that by default, training an ML system to perform tasks involving humans will incentivize the system to manipulate those humans. This problem shows up regardless of whether the person doing the training actually wants to manipulate people, which makes it a separate issue from the fact that certain organizations engage in manipulation.

This is surprising. Suppose I have a training set of articles which are labeled "biased" or "unbiased". I then train a system (using this set), and later use it to label articles "biased" or "unbiased". Will this lead to a manipulative system? I would be greatly surprised to find that a neural nets trained to recognize "cats" and "dogs" in such a manner (with labeled photos in place of labeled articles in the training set) manipulating people to agree with it's future labels of "dog" and "cat".

[-]dxu6y*20

Suppose I have a training set of articles which are labeled "biased" or "unbiased". I then train a system (using this set), and later use it to label articles "biased" or "unbiased". Will this lead to a manipulative system?

Mostly I would expect such a system to overfit on the training data, and perform no better than chance when tested. The reason for this is that unlike your example, where cats and dogs are (fairly) natural categories with simple distinguishing characteristics, the perception of "bias" in news articles is fundamentally tied to human psychology, and as a result is much more complicated concept to learn than catness versus dogness. By default I would expect an offline training method to completely fail at learning said concept.

Reinforcement learning, meanwhile, will indeed become manipulative (in my expectation). In a certain sense you can view this as a form of overfitting as well, except that the system learns to exploit peculiarities of the humans performing the classification, rather than simply peculiarities of the articles in its training data. (As you might imagine, the former is far more dangerous.)

[-]Matthew Barnett6y10

Mostly I would expect such a system to overfit on the training data, and perform no better than chance when tested.

[...]

Reinforcement learning, meanwhile, will indeed become manipulative (in my expectation).

I'm confused why reinforcement learning would be well suited for the task, if it doesn't work at all in the supervised learning case.

[-]Jiro6y20

I think the word “unbiased” there may be a typo; your statement would make a lot more sense if the word you meant to put there was actually “biased”.

I meant "unbiased" in scare quotes. Typical newsfeeds that are claimed to be unbiased in the real world (but actually may not be).

[-]TheWakalix6y10

You’re saying that on priors, the humans are manipulative?

[-]Matthew Barnett6y40

For the Alignment Newsletter:

Summary:

An early argument for specialized AI safety work is that misaligned systems will be incentivized to lie about their intentions while weak, so that they aren't modified. Then, when the misaligned AIs are safe from modification, they will become dangerous. Ben Goertzel found the argument unlikely, pointing out that weak systems won't be good at deception. This post asserts that weak systems can still be manipulative, and gives a concrete example. The argument is based on a machine learning system trained to maximize the number of articles that users label "unbiased" in their newsfeed. One way it can start being deceptive is by seeding users with a few very biased articles. Pursuing this strategy may cause users to label everything else unbiased, as it has altered their reference for evaluation. The system is therefore incentivized to be dishonest without necessarily being capable of pure deception.

Opinion:

While I appreciate and agree with the thesis of this post -- that machine learning models don't have to be extremely competent to be manipulative -- I would still prefer a different example to convince skeptical researchers. I suspect many people would reply that we could easily patch the issue without doing dedicated safety work. In particular, it is difficult to see how this strategy arises if we train the system via supervised learning rather than training it to maximize the number of articles users label unbiased (which requires RL).

[-]Sam Clarke4y10

(Note: this post is an extended version of this post about stories of continuous deception. If you are already familiar with treacherous turn vs. sordid stumble you can skip the first part.)

FYI, broken link in this sentence.

[-]Michaël Trazzi4y10

Thanks, removed the disclaimer

[-]William_S6yΩ110

To me, It seems like the point of this story is that we could build an AI that ends up doing very dangerous things without ever asking it "Will you do things I don't like if given more capability?" or some other similar question that requires it to execute the treacherous turn. In contrast, if the developers did something like build a testing world with toy humans in it who could be manipulated in a way detectable to the developers, and placed the AI in the toy testing world, then it seems like this AI would be forced into a position where it either acts in a way according to it's true incentives (manipulate the humans and be detected), or execute the treacherous turn (abstain from manipulating the humans so developers will trust it more). So it seems like this wouldn't happen if the developers are trying to test for treacherous turn behaviour during development.

Moderation Log

LESSWRONG
is fundraising!
LW