The Enemy Gets The Last Hit

Disclaimer: I am god-awful at chess.

I

Late-beginner chess players, those who are almost on the cusp of being basically respectable, often fall into a particular pattern. They've got the hang of calculating moves ahead; they can make plans along the lines of "Ok, so if I move my rook to give a check, the opponent will have to move her king, and then I can take her bishop." and those plans tend to be basically correct: the opponent really does have to move her king.

But there's a very important rule to follow when calculating. Always end your calculations after your opponent has moved. You must never end your calculations after your own move. In other words:

The enemy gets the last hit

This principle is common in cyber security: you have to let your red team make the last move. If your red team finds a vulnerability, and your blue team patches it, then you have to red team the patch. It's the same for AI red-team-blue-team games: I recall a story of a team at one of the MATS streams, presenting their final work on some AI control (or similar) protocol:

Audience member: But why didn't you try to mitigate this risk with something like [Idea] which would have taken a few minutes to implement.
Team member: Because then we'd have had to red-team that idea as well, and red-teaming it would have taken much longer than a few minutes.

The team member was correct here. Quite often, calculating what The Enemy will do is harder than calculating what you're going to do.

The Enemy need not actually be an enemy. The Enemy can be "The universe" or something. If you're building a flood defence, then The Enemy is the flood water. If you build a barrier to stop it getting to your city's business district, then you'd better check where the water will do instead to make sure you didn't just divert it onto an orphanage or something.

II

Similarly, lots of AI Safety papers have the theme "We found a problem, then we fixed it." This has a nice ring to it. It's how most papers get written in most fields, which is fine for those fields. But AI Safety is much more like cybersecurity than e.g. chemical engineering, where "we found this reaction was going slow so we added a new catalyst" is totally reasonable.

(Lots don't fall into this trap, and that's great!)

The conversation usually goes like this:

AIS: We found this solution to a serious problem
AINKEI: This seems super hacky
AIS: No I don't think so
AIS goes away to think...
AIS: Actually it follows this deeper principle
AINKEI: I feel like this won't work for superintelligence still
AINKEI goes away to think...
AINKEI: Ok, here's a reason I thought of why it won't work
AIS: Oh huh
AIS goes away to think...
AIS: Ah but I might be able to fix it with this solution

The issue is that AINKEI is thinking in terms of letting the enemy get the last hit, while AIS is thinking in terms of a feedback loop of detecting and fixing problems. The feedback loop solution only works if all of your problems are recoverable, which is a core disagreement between the crowds.

<psychoanalysis>
I think of a lot of the AI not-kill-everyone-ism crowd's frustration with the AI safety crowd is that the AINKEI people feel that they are having to do the jobs of that AIS people should be doing by playing the part of The Enemy getting the last hit
</psychoanalysis>

III

The recent work on inoculation prompting---which has stirred up so many mixed reactions that it functions as a scissor statement for the AI safety/alignment/notkilleveryoneism crowd---is a great example.

Problem: AIs generalize from reward hacking to misalignment.
Solution: Just tell 'em it's OK to reward hack during training.

Does this throw up even more problems? The paper didn't really investigate this question; they didn't let The Enemy get the last hit.

In this case, The Enemy is "Your AIs getting smarter every generation."

The general form of the solution is "if we can't make our reward environments exactly match our prompts, we'll adjust our prompts to match our reward environments." which is, to be fair, quite elegant. What happens when the AI gets smarter? As a first guess, if you can't make your reward environments more robust, you'll have to prompt your AI with more and more caveats, in more and more different situations.

This seems bad! Does every prompt now have "by the way it's OK to hack the environment and manipulate the human raters and break out of your VM and murder the testers" during training? What fixes this? I don't know, because I have a finite amount of time to write this essay, and I double-super don't know what problems that fix throws up!

LESSWRONG
LW

LESSWRONG
LW

7

The Enemy Gets The Last Hit

7

7

I

II

III