The general idea makes sense to me, I'm a bit confused about the chess example though:
But there's a very important rule to follow when calculating. Always end your calculations after your opponent has moved. You must never end your calculations after your own move.
Say you're able to calculate things either to depth 2 (ending after your opponent's move) or to depth 3 (ending after your own move). Isn't it still better to calculate things as far out as you can?
Not at all. You may be able to see a positional advantage or capture of a minor piece in your move, and not see that they can respond by capturing your queen. The most apparently valuable moves after your own move are very often close to the worst after theirs, because they are often made with the most powerful pieces and expose them to risk.
I learned that lesson quite well when writing my own poor attempt at a chess playing program years ago. Odd ply searches are generally worse than even ones for this reason.
What's going on is something like adverse selection in an auction. In reality, chess is a solvable game, so that in perfect play win/loss/draw probabilities are all 0 or 1 for every board state. However, you don't know what these probabilities are, so you use a model. A naive player might just want to play the action which takes them to the board state they model as having the highest probability of winning. However, this fails to take into account the fact that one's model can be wrong, and so one will tend to pick actions which lead to board states which one mispredicts as being better than they actually are.
If one can accurately model how inaccurate one's model tends to be at certain board states, then one can do fine without ending on one's opponent's move by discounting board states one models as modelling poorly. This is nontrivial for humans to do correctly, however.
Instead, a heuristic one can use is to just let one's opponent make the last move in one's search tree. This gives one a lower bound on how good each board state is (an optimistic guess for one's opponent is a pessimistic guess for one's self), so one's choices of board states will not be catastrophically biased. By catastrophically biased, I mean that in chess it's much easier to wreck your game than to make a stunningly clever move which causes you to win, so that being too optimistic is much worse than being too pessimistic.
This is the kind of content I keep coming back to this site for.
I also like that it's practical and practicable in day to day life while also being important for bigger, important questions.
Agreed. Reminds me of something my sensei used to say: It's useful to strike first, but far better to strike last.
Disclaimer: I am god-awful at chess.
I
Late-beginner chess players, those who are almost on the cusp of being basically respectable, often fall into a particular pattern. They've got the hang of calculating moves ahead; they can make plans along the lines of "Ok, so if I move my rook to give a check, the opponent will have to move her king, and then I can take her bishop." and those plans tend to be basically correct: the opponent really does have to move her king.
But there's a very important rule to follow when calculating. Always end your calculations after your opponent has moved. You must never end your calculations after your own move. In other words:
This principle is common in cyber security: you have to let your red team make the last move. If your red team finds a vulnerability, and your blue team patches it, then you have to red team the patch. It's the same for AI red-team-blue-team games: I recall a story of a team at one of the MATS streams, presenting their final work on some AI control (or similar) protocol:
The team member was correct here. Quite often, calculating what The Enemy will do is harder than calculating what you're going to do.
The Enemy need not actually be an enemy. The Enemy can be "The universe" or something. If you're building a flood defence, then The Enemy is the flood water. If you build a barrier to stop it getting to your city's business district, then you'd better check where the water will do instead to make sure you didn't just divert it onto an orphanage or something.
II
Similarly, lots of AI Safety papers have the theme "We found a problem, then we fixed it." This has a nice ring to it. It's how most papers get written in most fields, which is fine for those fields. But AI Safety is much more like cybersecurity than e.g. chemical engineering, where "we found this reaction was going slow so we added a new catalyst" is totally reasonable.
(Lots don't fall into this trap, and that's great!)
The conversation usually goes like this:
The issue is that AINKEI is thinking in terms of letting the enemy get the last hit, while AIS is thinking in terms of a feedback loop of detecting and fixing problems. The feedback loop solution only works if all of your problems are recoverable, which is a core disagreement between the crowds.
<psychoanalysis>
I think of a lot of the AI not-kill-everyone-ism crowd's frustration with the AI safety crowd is that the AINKEI people feel that they are having to do the jobs of that AIS people should be doing by playing the part of The Enemy getting the last hit
</psychoanalysis>
III
The recent work on inoculation prompting---which has stirred up so many mixed reactions that it functions as a scissor statement for the AI safety/alignment/notkilleveryoneism crowd---is a great example.
Problem: AIs generalize from reward hacking to misalignment.
Solution: Just tell 'em it's OK to reward hack during training.
Does this throw up even more problems? The paper didn't really investigate this question; they didn't let The Enemy get the last hit.
In this case, The Enemy is "Your AIs getting smarter every generation."
The general form of the solution is "if we can't make our reward environments exactly match our prompts, we'll adjust our prompts to match our reward environments." which is, to be fair, quite elegant. What happens when the AI gets smarter? As a first guess, if you can't make your reward environments more robust, you'll have to prompt your AI with more and more caveats, in more and more different situations.
This seems bad! Does every prompt now have "by the way it's OK to hack the environment and manipulate the human raters and break out of your VM and murder the testers" during training? What fixes this? I don't know, because I have a finite amount of time to write this essay, and I double-super don't know what problems that fix throws up!