AI Alignment Prize: Super-Boxing

[-]cousin_it8y50

Thank you for writing this! It's great that you're proposing a new idea. Cooperation between different generations of AIs is indeed a problem for your scheme, but I think a bigger problem is how to make the AI assign high utility to warning the handlers. I see two possibilities:

1) You could formalize what constitutes "escape" or "loophole", but that seems almost as hard as the whole alignment problem.

2) You could say that a successful warning is whatever makes the handlers reward the AI, but then you're just telling the AI to do whatever makes the handlers reward it, which seems dangerous.

What do you think?

[-]X4vier8y10

Sorry for the late response! I didn't realise I had comments :)

In this proposal we go with (2): The AI does whatever it thinks the handlers will reward it for.

I agree this isn't as good as giving the agents an actually safe reward function, but if our assumptions are satisfied then this approval-maximising behaviour might still result in the human designers getting what they actually want.

What I think you're saying (please correct me if I misunderstood) is that an agent aiming to do whatever its designers reward it for will be incentivised to do undesirable things to us (like wiring up our brains to machines which make us want to press the reward button all the time).

It's true that the agents will try to take these kind nefarious actions if they think they can get away with it. But in this setup the agent knows that it can't get away with tricking the humans like this, since it's ancestors already warned the humans that a future agent might try this, and the humans prepared appropriately.

[-]Charlie Steiner8y30

So, how do you actually do assumption #3? This seems surprisingly tricky. For example, if nothing matters to the agent in the case where it's stopped, maybe it takes actions that assume it won't be stopped because they're the only ones that have any expected utility.

Hm, I think the obvious way is to assume the agent has a transparent ontology, so that we can specify at the start that it only cares about the world for k timesteps. This even gives it an incentive to retain this safeguard when self-modifying - if it planned for the far future it would probably do worse in the near future. But it also highlights an issue - the AI isn't the same every time, and may still surprise us. Even if you run it from the same seed, you have to let the AI know its time horizon so it can make decisions that depend on it, and this may cause discontinuous jumps. For example, if an AI runs a search one more timestep than its predecessor, and the search succeeds on that step, maybe now it's profitable to hack the humans when it wasn't even aware of the possibility before.

[-]X4vier8y10

Thanks for your comment, I think I'm a little confused about what it would mean to actually satisfy this assumption.

It seems to me that many current algorithms, for example, a rainbowDQN agent, would satisfy assumption 3? But like I said I'm super confused about anything resembling questions about self-awareness/naturalisation.

[-]Gordon Seidoh Worley8y20

Wow, I really appreciate your careful breakdown of all the reasons your scheme is likely to not work. I think it's also worth considering the ethical implications of creating and killing so many AGIs along the way, something I would be uncomfortable with doing. There is also an issue of if the AGI will run deterministically such that at each step it will be functionally the same across runs.

An interesting follow-up might be to consider the what would be necessary for AI boxing to work despite the many difficulties with the general approach. Although maybe others have already looked at this?

[-]X4vier8y00

LESSWRONG
LW

LESSWRONG
LW

16

AI Alignment Prize: Super-Boxing

16

16