The argument that AIs provided with a reward channel will observe their controllers and learn to manipulate them is a valid one. Unfortunately, it's often framed in a way that feels counterintuitive or extreme, especially to AI designers. It typically starts with the standard reinforcement learning scenario, then posits that the AI becomes superintelligent and either manipulates the controller with super-social powers, or breaks out and gains control of its reward channel, killing or threatening its controllers.

And that is a fair argument. But conceptually, it leaps from a standard reinforcement learning scenario, to a science-fiction-sounding scenario. It might help to have intermediate scenarios: to show that even lower intelligence AIs might start exhibiting the same sort of behaviour, long before it gets to superintelligence.

So consider the following scenario. Some complex, trainable AI is tasked with writing automated news stories for a student newspaper. It trawls the web and composes its stories, then gets reward and feedback from the editors. Assume there are two editors for this newspaper, and they work on alternate days. The two editors have somewhat different ideas as to what constitutes a good story, so their feedbacks are different. After a while, the AI finds that it gets higher reward by using a certain style on Monday, Wednesday and Friday, and another style on Tuesday and Thursdays - this is a simple consequence of its reward mechanism.

But the rota isn't perfect. Sometime the Monday editor will edit a story so late on Monday that it's Tuesday, and sometimes the Tuesday editor will be up early and edit a story at the same time. This pattern is more common during the holidays, when the editors go to different time zones, often travel, and become more irregular in their editing.

So if the AI complex and skilled enough, then, simply through simple feedback, it will start building up a picture of its editors. It will figure out when they are likely to stick to a schedule, and when they will be more irregular. It will figure out the difference between holidays and non-holidays. Given time, it may be able to track the editors moods and it will certainly pick up on any major change in their lives - such as romantic relationships and breakups, which will radically change whether and how it should present stories with a romantic focus.

It will also likely learn the correlation between stories and feedbacks - maybe presenting a story define roughly as "positive" will increase subsequent reward for the rest of the day, on all stories. Or maybe this will only work on a certain editor, or only early in the term. Or only before lunch.

Thus the simple trainable AI with a particular focus - write automated news stories - will be trained, through feedback, to learn about its editors/controllers, to distinguish them, to get to know them, and, in effect, to manipulate them.

This may be a useful "bridging example" between standard RL agents and the superintelligent machines.

52

19 comments, sorted by Click to highlight new comments since: Today at 12:25 AM
New Comment

Oh, dang. Well, I mean, phew? Both? See, I thought this was going to be a news story.

Yeah, me too.

Perhaps change the "observes", "manipulates" to "observing", "manipulating"? It doesn't have the same connotation of "this actually happened".

Also, had it been a real occurrence, it might have been the first thing to make me care just a little about MIRI's mission.

Well, FaceBook and Google algorithms are real occurrences - they're just not "simple algorithms in a box".

These 'dumb' algorithms probably have a much higher impact than one might guess. It's just a much more subtle but extremely far reaching effect. Complete surfing habits change. Industries rise and fall due to it. It is operating on a longer time-frame and without spectacular events. It smears out it's effect. If it is intentional this is called salami tactics, long lie and other things. We are not well prepared to deal with or even detect this rationally because we feel the effect only via its aggregate over lots of events. Things our subconscious long term feeling notices but can only propagate to the concsious via vague feelings of dissatisfaction, frustration and other. But there is no specific 'enemy' that can be hit. The effect looks more like an inevitable force of nature than a conscious act of an agent. Even if it should be so. Fear the dumb but massively parallel AI.

Perhaps. But whatever news I've heard about social network AI didn't pass the threshold for my caring about it. A hypothetical story with the above title would have. As far as I'm concerned Facebook/Google AI discussion lies outside of the scope of my comment.

The truly insidious effects are when the content of the stories changes the reward but not by going through the standard quality-evaluation function.

For instance, maybe the AI figures out that the order of the stories affects the rewards. Or perhaps it finds how stories that create a climate of joy/fear on campus lead to overall higher/lower evaluations for that period. Then the AI may be motivated to "take a hit" to push through some fear mongering so as to raise its evaluations for the following period. Perhaps it finds that causing strife in the student union, or perhaps causing racial conflict, or causing trouble with the university faculty affects its rewards one way or another. Perhaps if it's unhappy with a certain editor, it can slip through bad enough errors to get the editor fired, hopefully replaced with a more rewarding editor.

etc etc.

The problem with these particular extensions is that they don't sound plausible for this type of AI. In my opinion it would be easier when talking with designers to switch from this example to a slightly more sci-fi example.

The leap is between the obvious "it's 'manipulating' its editors by recognizing simple patterns in their behavior" to "it's manipulating its editors by correctly interpreting the causes underlying their behavior."

Much easier to extend in the other direction first: "Now imagine that it's not an article-writer, but a science officer aboard the commercial spacecraft Nostromo..."

Upvoted for remembering that Ash was the science officer and not just the movie's token android.

I assume that you are alluding to spontaneously agentizing tools, a discussion initially triggered by the famous Karnofsky's post. In your example the feedback loop is closed, which is probably enough for a tool eventually gaining agency.

Can you link to the post you're talking about? I've been thinking about this problem, although I had never read or thought the words 'spontaneously agentizing tools.'

If the rewand channel has only one bit per day I don't think any agent can infer much about the authors. Their days maybe. Some fundamental components of their preferrences possibly. But nothing a human could infer from all the bits of background he possesses. There are convergence rate results for classifiers that require just too many sample to extract enough information - especially in the face of real life feature vectors.

I'd assume there would be a reward for every story, that this would be on a ordinal scale with several options, and that it included feedback/corrections about grammar and phrasing.

Thus the simple trainable AI with a particular focus - write automated news stories - will be trained, through feedback, to learn about its editors/controllers, to distinguish them, to get to know them, and, in effect, to manipulate them.

Detecting and adapting to the individual controllers doesn't seem to me particularly bad.

Emotionally manipulating the controllers using the content of the stories would be more worrying, but note that this is essentially only possible if the AI is allowed to plan more than one story at time. If the AI can do that, then it can trade off the reward obtained by the story at time t for greater rewards at times >t. Otherwise, any trade off will be limited to the different parts of each story, which greatly reduces the opportunities for significant emotional manipulation of the controllers.
I see no reason this story-writing AI would need to be allowed to plan more than one story at time.

I think this is an example of a general issue in safe AI design that you and other FAI folks overlook: dynamic inconsistency can provide intrinsic protection from unwanted long-term strategies from the AI.

You seem to always implicitly assume that the AI will be an agent trying to maximize a (discounted) utility or reward over a long, ideally infinite, time horizon, that is, you assume that the AI will be approximately dynamically consistent. This may be a reasonable requirement for an autonomous agent that needs to operate for extended times without direct human supervision, but not for a tool AI.
The work of a tool AI can be naturally broken into self-contained tasks, and if the AI doesn't maximize utility or reward over multiple tasks, then any treacherous plan to gain utility in ways we would disapprove of will have to be confined to a single task. This is not a 100% safety guarantee, but certainly it makes the AI safety problem much more manageable.

I see no reason this story-writing AI would need to be allowed to plan more than one story at time.

Because the AI is programmed by people who hadn't thought of this issue, and the other way turned out to be simpler/easier?

dynamic inconsistency can provide intrinsic protection from unwanted long-term strategies from the AI.

I know. The problem is that inconsistency is unstable (which is why we're using other measures to maintain it, eg using a tool AI only). That's one of the reasons I was interested in stable versions of these kind of unstable motivations http://lesswrong.com/r/discussion/lw/lws/closest_stable_alternative_preferences/ .

Because the AI is programmed by people who hadn't thought of this issue, and the other way turned out to be simpler/easier?

Ok, but if this is a narrow AI rather than an AGI agent used for that particular activity, then it seems intuitive to me that designing it to plan over a single task at time would be simpler.

I know. The problem is that inconsistency is unstable (which is why we're using other measures to maintain it, eg using a tool AI only). That's one of the reasons I was interested in stable versions of these kind of unstable motivations http://lesswrong.com/r/discussion/lw/lws/closest_stable_alternative_preferences/ .

The post you liked doesn't deal with dynamic inconsistency. It refers to agents that are expected utility maximizers under Von Neumann–Morgenstern utility theory, but this theory only deals with one-shot decision making, not decision making over time.

You can reduce the problem of decision making over time to one-shot decision making by combining instantaneous utilities into a cumulative utility function ( * ) and then using it as a one-shot utility function.

If you combine the instantaneous utilities by their (exponentially discounted) sum over an infinite time horizon, you obtain a dynamically consistent expected utility maximizer agent. But if you sum utilities up to a fixed time horizon, you still obtain an agent that at each instant is an expected utility maximizer, but it is not dynamically consistent.

You may argue that dynamical inconsistency is not stable under evolution by random mutations and natural selection, but it is not obvious to me that AIs would face such scenario. Even an AI that modifies itself or generate successors has no incentive to maximize its evolutionary fitness unless you specifically program it to do so.

Actually, you could use corrigibility to get dynamic inconsistency https://intelligence.org/2014/10/18/new-report-corrigibility/ .