Genuine question: if AI capabilities research stopped today and larger models stopped being trained, wouldn't AI alignment research effectively be halted?
I'm assuming that the primary goal of AI alignment research is to prevent AGI and ASI from being existential risks. My main question is, how can methods for AGI/ASI alignment can be discovered before AGI/ASI exists?
AI alignment results tend to be either positive ("we succeeded in making Claude more honest") or negative ("we got ChatGPT to kill someone").
One clear benefit to a pause would be time for policy to catch up. However, this might be like trying to draw a map for terrain that doesn't exist yet. It would be like the Allies drawing up a nuclear treaty with the Axis powers before there was consensus that the nuclear bomb was actually possible.[2] It would be nice if everyone stopped and worked out a plan for global cooperation, but such a plan can only stabilize and achieve buy-in with the major players once both the underlying dangers and distribution of power are clear enough to all the players involved.
A research pause could definitely still be a net good for humanity, but at present I don't understand what this time would buy. If these conclusions make sense, they would maybe favor a slowdown (for safety to keep pace with capabilities) rather than a pause. But they are based on my rudimentary knowledge, and I would like to hear what more knowledgeable people have to say.
I haven't read many papers, so please contest this if you have strong evidence against it. Here I'm specifically thinking of Anthropic's sparse autoencoders paper.
Not in a counterfactual sense about the outcome of the war. My point is that attempting such a treaty would have been unsuccessful and wouldn't have found substantive support on either side.
I think given more time we could probably come up with tools that would help when we resume capabilities research. For example, there's probably ways to do something like the logit lens but that work better, or ways to automatically factor models into more interpretable pieces, or just the long slog of tracing through circuits to figure out what the model is doing and build one from scratch rather than training. I don't know how practical any of these approaches are, but I don't think we're at the limit of what we can learn from current models.