I want to quickly draw attention to a concept in AI alignment: Robustness to Scale. Briefly, you want your proposal for an AI to be robust (or at least fail gracefully) to changes in its level of capabilities. I discuss three different types of robustness to scale: robustness to scaling up, robustness to scaling down, and robustness to relative scale.
The purpose of this post is to communicate, not to persuade. It may be that we want to bite the bullet of the strongest form of robustness to scale, and build an AGI that is simply not robust to scale, but if we do, we should at least realize that we are doing that.
Robustness to scaling up means that your AI system does not depend on not being too powerful. One way to check for this is to think about what would happen if the thing that the AI is optimizing for were actually maximized. One example of failure of robustness to scaling up is when you expect an AI to accomplish a task in a specific way, but it becomes smart enough to find new creative ways to accomplish the task that you did not think of, and these new creative ways are disastrous. Another example is when you make an AI that is incentivized to do one thing, but you add restrictions that make it so that the best way to accomplish that thing has a side effect that you like. When you scale the AI up, it finds a way around your restrictions.
Robustness to scaling down means that your AI system does not depend on being sufficiently powerful. You can't really make your system still work when it scales down, but you can maybe make sure it fails gracefully. For example, imagine you had a system that was trying to predict humans, and use these predictions to figure out what to do. When scaled up all the way, the predictions of humans are completely accurate, and it will only take actions that the predicted humans would approve of. If you scale down the capabilities, your system may predict the humans incorrectly. These errors may multiply as you stack many predicted humans together, and the system can end up optimizing for some seeming random goal.
Robustness to relative scale means that your AI system does not depend on any subsystems being similarly powerful to each other. This is most easy to see in systems that depend on adversarial subsystems. If part of you AI system is suggest plans, and another part is trying to find problems in those plans, if you scale up the suggester relative to the verifier, the suggester may find plans that are optimized for taking advantage of the verifier's weaknesses.
My current state is that when I hear proposals for AI alignment that do not feel very strongly robust to scale, I become very worried about the plan. Part of this comes from feeling like we are actually very early on a logistic capabilities curve. I thus expect that as we scale up capabilities, we can get eventually get large differences very quickly. Thus, I expect that the scaled up (and partially scaled up) versions to actually happen. However, robustness to scale is very difficult, so it may be that we have to depend on systems that are not very robust, and be careful not to push them too far.
I am worried that if you train both sides of playing an asymmetric game, you run into problems where you scale up at playing one side faster than playing the other side. This makes me think "The model after N+1 gradient updates isn't that much better than model after N gradient updates." is not enough of an assumption if you operationalize it in a way that ensures you are using the model to do the same thing in both cases, and if you don't operationalize it in that way, it seem like too strong of an assumption.