A toy model of the control problem

[-]Eliezer Yudkowsky10y120

When I consider this as a potential way to pose an open problem, the main thing that jumps out at me as being missing is something that doesn't allow A to model all of B's possible actions concretely. The problem is trivial if A can fully model B, precompute B's actions, and precompute the consequences of those actions.

The levels of 'reason for concern about AI safety' might ascend something like this:

0 - system with a finite state space you can fully model, like Tic-Tac-Toe
1 - you can't model the system in advance and therefore it may exhibit unanticipated behaviors on the level of computer bugs
2 - the system is cognitive, and can exhibit unanticipated consequentialist or goal-directed behaviors, on the level of a genetic algorithm finding an unanticipated way to turn the CPU into a radio or Eurisko hacking its own reward mechanism
3 - the system is cognitive and humanish-level general; an uncaught cognitive pressure towards an outcome we wouldn't like, results in facing something like a smart cryptographic adversary that is going to deeply ponder any way to work around anything it sees as an obstacle
4 - the system is cognitive and superintelligent; its estimates are always at least as good as our estimates; the expected agent-utility of the best strategy we can imagine when we imagine ourselves in the agent's shoes, is an unknowably severe underestimate of the expected agent-utility of the best strategy the agent can find using its own cognition

We want to introduce something into the toy model to at least force solutions past level 0. This is doubly true because levels 0 and 1 are in some sense 'straightforward' and therefore tempting for academics to write papers about (because they know that they can write the paper); so if you don't force their thinking past those levels, I'd expect that to be all that they wrote about. You don't get into the hard problems with astronomical stakes until levels 3 and 4. (Level 2 is the most we can possibly model using running code with today's technology.)

[-]Stuart_Armstrong10y10

Added a cheap way to get us somewhat in the region of 2, just by assuming that B/C can model A, which precludes A being able to model B/C in general.

[-]CarlShulman10y120

An illustration with a game-playing AI, see 15:50 and after in the video. The system has a reward function based on bytes in memory, which leads it to pause the game forever when it is about to lose.

[-]gwern10y80

Me and feep have implemented a slightly-tweaked version of this using a DQN agent in Reinforce.js. (Tabular turns out to be a bit infeasible.)

A live demo: http://www.gwern.net/docs/rl/armstrong-controlproblem/index.html
The core JS: http://www.gwern.net/docs/rl/armstrong-controlproblem/controlproblem.js
Browseable source: https://github.com/gwern/gwern.net/tree/master/docs/rl/armstrong-controlproblem

At the moment, if you want to modify settings like in Karpathy's demos, you'll have to do something like download it locally to edit, with a command like wget --mirror 'www.gwern.net/docs/rl/armstrong-controlproblem/index.html' && firefox ./www.gwern.net/docs/rl/armstrong-controlproblem/index.html

[-]Stuart_Armstrong10y00

Thanks, most excellent!

[-]Arenamontanus10y30

It would be neat to actually make an implementation of this to show sceptics. It seems to be within the reach of a MSc project or so. The hard part is representing 2-5.

[-]gwern10y50

Since this is a Gridworld model, if you used Reinforce.js, you could demonstrate it in-browser, both with tabular Q-learning but also with some other algorithms like Deep Q-learning. It looks like if you already know JS, it shouldn't be hard at all to implement this problem...

(Incidentally, I think the easiest way to 'fix' the surveillance camera is to add a second conditional to the termination condition: simply terminate on line of sight being obstructed or a block being pushed into the hole.)

[-]Stuart_Armstrong10y20

Why, Anders, thank you for volunteering! ;-)

[-]Stuart_Armstrong10y00

I would suggest modelling it as "B outputs 'down' -> B goes down iff B active", and similarly for other directions (up, left, and right), "A output 'sleep' -> B inactive", and "A sees block in lower right: output 'sleep'" or something like that.

[-]CarlShulman10y20

Of course, with this model it's a bit of a mystery why A gave B a reward function that gives 1 per block, instead of one that gives 1 for the first block and a penalty for additional blocks. Basically, why program B with a utility function so seriously out of whack with what you want when programming one perfectly aligned would have been easy?

[-]Kyre10y110

It's a trade-off. The example is simple enough that the alignment problem is really easy to see, but it also means that it is easy to shrug it off and say "duh, just the use obvious correct utility function for B".

Perhaps you could follow it up with an example with more complex mechanics (and or more complex goal for A) where the bad strategy for B is not so obvious. You then invite the reader to contemplate the difficulty of the alignment problem as the complexity approaches that of the real world.

[-]Stuart_Armstrong10y80

Maybe the easiest way of generalising this is programming B to put 1 block in the hole, but, because B was trained in a noisy environment, it gives only a 99.9% chance of the block being in the hole if it observes that. Then six blocks in the hole is higher expected utility, and we get the same behaviour.

[-]CarlShulman10y20

That still involves training it with no negative feedback error term for excess blocks (which would overwhelm a mere 0.1% uncertainty).

[-]Stuart_Armstrong10y20

This is supposed to be a toy model of excessive simplicity. Do you have suggestions for improving it (for purposes of presenting to others)?

[-]CarlShulman10y20

Maybe explain how it works when being configured, and then stops working when B gets a better model of the situation/runs more trial-and-error trials?

[-]Stuart_Armstrong10y00

Ok.

[-]Eliezer Yudkowsky10y40

I assume the point of the toy model is to explore corrigibility or other mechanisms that are supposed to kick in after A and B end up not perfectly value-aligned, or maybe just to show an example of why a non-value-aligning solution for A controlling B might not work, or maybe specifically to exhibit a case of a not-perfectly-value-aligned agent manipulating its controller.

[-]Gurkenglas10y00

And if it misses any other element 2-5, then it will behave as described above.

Given enough computing power, it might try to figure out where its environment came from (by Solomonoff induction) and deduce the camera's behavior from that.

[-]NoSignalNoNoise10y00

A's utility function also needs to be specified. How many utils is the first box worth? What's the penalty for additional boxes?

[-]Gurkenglas10y10

Why is that needed? A's algorithm is fully known. Perhaps its behavior is identical to that induced by some utility function, but that needn't be how all agents are implemented.

[-]V_V10y-10

If B does not have a model (for instance, if it's a Q-learning agent), then it can still learn this behaviour, without knowing anything about A, simply through trial and error.

Sure, but somebody would presumably notice that B is learning to do something it is not intended to do before it manages to push all the six blocks.

You might feel that B can before this deception because it has some measure of autonomy, at least in its stunted world. We can construct models with even less autonomy. Suppose that there is another agent C, who has the same goal as B. B now is a very simple algorithm, that just pushes a designated block towards the hole. C designates the block for it to push.

I don't think you can meaningful consider B and C separate agents in this case. B is merely a low-level subroutine while C is the high-level control program.

[-]Stuart_Armstrong10y40

I don't think you can meaningful consider B and C separate agents in this case. B is merely a low-level subroutine while C is the high-level control program.

Which is one of the reasons that concepts like "autonomy" are so vague.

[-]moridinamael10y-10

I like this because it's something to point to when arguing with somebody with an obvious bias toward anthropomorphizing the agents.

You show them a model like this, then you say, "Oh, the agent can reduce its movement penalty if it first consumes this other orange glowing box. The orange glowing box in this case is 'humanity' but the agent doesn't care."

edit: Don't normally care about downvotes, but my model of LW does not predict 4 downvotes for this post, am I missing something?

[-]ESRogs10y00

I was also surprised to see your comment downvoted.

That said, I don't think I see the value of the thing you proposed saying, since the framing of reducing the movement penalty by consuming an orange box which represents humanity doesn't seem clarifying.

Why does consuming the box reduce the movement penalty? Is it because, outside of the analogy, in reality humanity could slow down or get in the way of the AI? Then why not just say that?

I wouldn't have given you a downvote for it, but maybe others also thought your analogy seemed forced and are just harsher critics than I.