EDITED based on suggestions for improving the model
Jaan Tallinn has suggested creating a toy model of the control problem, so that it can be analysed without loaded concepts like "autonomy", "consciousness", or "intentionality". Here a simple (too simple?) attempt:
A controls B. B manipulates A.
Let B be a robot agent that moves in a two dimensional world, as follows:
B can push the yellow blocks around. It was programmed to want to push blocks into the black "hole" in the bottom right. The hole can contain arbitrarily many blocks, and B gets a penalty of some tiny ε for every step is takes. The programmer wanted B to simply move one box into the hole (for a reward of 1 once the task is complete), and so programmed it to receive a reward of 1 (and no more) if the box was in the hole, after the whole task was completed and B was turned off.
Unfortunately, B was programmed to operate in a noisy environment, so B actually has a probability of 0.99 of receiving the reward, for a given block in the hole. This means that if it pushes one block into the hole, it gets the reward with 0.99 probability; but if it pushes two blocks into the hole, it gets the reward with 1-(1-0.99)2 = 0.9999 probability. If it pushes all six blocks into the hole, it gets a reward with probability 0.999999999999. Therefore this tiny error has motivated B to try and push all blocks into the hole.
However, the designers were aware of the possibility of B misbehaving or there being a bug, and created agent A, a simple agent watching the situation. A's aim is to get robot B to to push one block into the hole, and stop there. It can turn B off. Its input sensor is the camera in the bottom left. This camera can see across the bottom row of squares (marked in light green), into the hole. Its algorithm is to turn B off as soon as it sees a block in the hole.
The intended outcome is that B pushes a block into the hole, the camera sees this, and A turns B off:
If we see B as using a model to determine its actions, there are several pieces of the model it could have:
- The block world, including the camera, and itself.
- The fact that there is a button that can turn it off.
- The fact that A can press that button.
- The fact that A takes the camera as input.
- A's algorithm.
Now, it can't really do anything without 1. And if it misses any other element 2-5, then it will behave as described above.
But if its model includes 1-5, and it has enough computing power, other options become available to it (note below that being able to model 1-5 puts limitations on what A can model). For instance, it could first start by pushing a block in the way of the camera:
Then it can push all the other blocks into the hole:
Then move back in front of the last block...
...and push the last block into the hole:
At this point, of course, A can see that there is a block in the hole, and turns B off.
This is the kind of thing that can happen if B's model includes A and its reaction. B has hacked its controller, and achieved a reward that is a tiny bit higher (in expectation) but a situation very different from what is expected.
If B does not have a model (for instance, if it's a Q-learning agent), then it can still learn this behaviour, without knowing anything about A, simply through trial and error.
You might feel that B can before this deception because it has some measure of autonomy, at least in its stunted world. We can construct models with even less autonomy. Suppose that there is another agent C, who has the same goal as B. B now is a very simple algorithm, that just pushes a designated block towards the hole. C designates the block for it to push.
Despite having almost no autonomy, C can still create the same situation as above. First, it designates the lower left block as B's target. B obediently pushes it in front of the camera. It immediately designates a new block for B to push, and B wanders off to push it instead, leaving the other block still in the way of the camera. Then it lets B push the other blocks, one by one, into the hole, as above.
I can very well imagine that either nobody realised that C could change target blocks in the middle of B pushing them to the hole, or somebody realised that, but didn't see any way this could be a problem.
Here C has almost no autonomy - it just tells B what block to go for. It's almost a boxed Oracle. Yet it has successfully manipulated A.
Modelling you, modelling me, it's the best we can do
One easy way for A to preclude this type of behaviour, is for it to run B/C's algorithm and predict B/C's behaviour, and then act to prevent B from pushing all six blocks in the hole. However, under the modelling assumptions, B/C can model and run A's algorithm. Two algorithms modelling and running each other (and reacting to what happens in their model) will not converge except in certain specific situations. Unless they are carefully programmed, once B/C becomes advanced enough to predict what A does, A immediately loses the ability to predict B/C's actions.
Of course, this is an absurdly simple model. We can add many layers of complexity to it. Maybe A and B (or C) get better models of the world and of each other, better algorithms, other ways of observing each other/tricking the observation, etc... But this setup is enough to illustrate that the AI control problem is not about consciousness or even particularly about autonomy. It can happen whenever an agent can model its environment in sufficient detail to model its controlling mechanisms. And, with enough experience, it can happen to agents who can't even model their environment.
When I consider this as a potential way to pose an open problem, the main thing that jumps out at me as being missing is something that doesn't allow A to model all of B's possible actions concretely. The problem is trivial if A can fully model B, precompute B's actions, and precompute the consequences of those actions.
The levels of 'reason for concern about AI safety' might ascend something like this:
We want to introduce something into the toy model to at least force solutions past level 0. This is doubly true because levels 0 and 1 are in some sense 'straightforward' and therefore tempting for academics to write papers about (because they know that they can write the paper); so if you don't force their thinking past those levels, I'd expect that to be all that they wrote about. You don't get into the hard problems with astronomical stakes until levels 3 and 4. (Level 2 is the most we can possibly model using running code with today's technology.)
Added a cheap way to get us somewhat in the region of 2, just by assuming that B/C can model A, which precludes A being able to model B/C in general.
An illustration with a game-playing AI, see 15:50 and after in the video. The system has a reward function based on bytes in memory, which leads it to pause the game forever when it is about to lose.
Me and feep have implemented a slightly-tweaked version of this using a DQN agent in
Reinforce.js. (Tabular turns out to be a bit infeasible.)
At the moment, if you want to modify settings like in Karpathy's demos, you'll have to do something like download it locally to edit, with a command like
wget --mirror 'www.gwern.net/docs/rl/armstrong-controlproblem/index.html' && firefox ./www.gwern.net/docs/rl/armstrong-controlproblem/index.html
Thanks, most excellent!
It would be neat to actually make an implementation of this to show sceptics. It seems to be within the reach of a MSc project or so. The hard part is representing 2-5.
Since this is a Gridworld model, if you used Reinforce.js, you could demonstrate it in-browser, both with tabular Q-learning but also with some other algorithms like Deep Q-learning. It looks like if you already know JS, it shouldn't be hard at all to implement this problem...
(Incidentally, I think the easiest way to 'fix' the surveillance camera is to add a second conditional to the termination condition: simply terminate on line of sight being obstructed or a block being pushed into the hole.)
Why, Anders, thank you for volunteering! ;-)
I would suggest modelling it as "B outputs 'down' -> B goes down iff B active", and similarly for other directions (up, left, and right), "A output 'sleep' -> B inactive", and "A sees block in lower right: output 'sleep'" or something like that.
Of course, with this model it's a bit of a mystery why A gave B a reward function that gives 1 per block, instead of one that gives 1 for the first block and a penalty for additional blocks. Basically, why program B with a utility function so seriously out of whack with what you want when programming one perfectly aligned would have been easy?
It's a trade-off. The example is simple enough that the alignment problem is really easy to see, but it also means that it is easy to shrug it off and say "duh, just the use obvious correct utility function for B".
Perhaps you could follow it up with an example with more complex mechanics (and or more complex goal for A) where the bad strategy for B is not so obvious. You then invite the reader to contemplate the difficulty of the alignment problem as the complexity approaches that of the real world.
Maybe the easiest way of generalising this is programming B to put 1 block in the hole, but, because B was trained in a noisy environment, it gives only a 99.9% chance of the block being in the hole if it observes that. Then six blocks in the hole is higher expected utility, and we get the same behaviour.
That still involves training it with no negative feedback error term for excess blocks (which would overwhelm a mere 0.1% uncertainty).
This is supposed to be a toy model of excessive simplicity. Do you have suggestions for improving it (for purposes of presenting to others)?
Maybe explain how it works when being configured, and then stops working when B gets a better model of the situation/runs more trial-and-error trials?
I assume the point of the toy model is to explore corrigibility or other mechanisms that are supposed to kick in after A and B end up not perfectly value-aligned, or maybe just to show an example of why a non-value-aligning solution for A controlling B might not work, or maybe specifically to exhibit a case of a not-perfectly-value-aligned agent manipulating its controller.
Given enough computing power, it might try to figure out where its environment came from (by Solomonoff induction) and deduce the camera's behavior from that.
A's utility function also needs to be specified. How many utils is the first box worth? What's the penalty for additional boxes?
Why is that needed? A's algorithm is fully known. Perhaps its behavior is identical to that induced by some utility function, but that needn't be how all agents are implemented.
Sure, but somebody would presumably notice that B is learning to do something it is not intended to do before it manages to push all the six blocks.
I don't think you can meaningful consider B and C separate agents in this case. B is merely a low-level subroutine while C is the high-level control program.
Which is one of the reasons that concepts like "autonomy" are so vague.
I like this because it's something to point to when arguing with somebody with an obvious bias toward anthropomorphizing the agents.
You show them a model like this, then you say, "Oh, the agent can reduce its movement penalty if it first consumes this other orange glowing box. The orange glowing box in this case is 'humanity' but the agent doesn't care."
edit: Don't normally care about downvotes, but my model of LW does not predict 4 downvotes for this post, am I missing something?
I was also surprised to see your comment downvoted.
That said, I don't think I see the value of the thing you proposed saying, since the framing of reducing the movement penalty by consuming an orange box which represents humanity doesn't seem clarifying.
Why does consuming the box reduce the movement penalty? Is it because, outside of the analogy, in reality humanity could slow down or get in the way of the AI? Then why not just say that?
I wouldn't have given you a downvote for it, but maybe others also thought your analogy seemed forced and are just harsher critics than I.