RL vs SGD does not seem to be a correct framing.
Very roughly speaking, RL is about what you optimize for (a subclass of what you can optimize for) and SGD is one of the many optimization methods (in particular, SGD and its cousins are highly useful in RL tasks (consider policy gradients and such)).
Capabilities research seems bad since we don't know how to make safe AGI, but assuming we can't stop capabilities research entirely[1], I wonder if capabilities research on SGD is positive given that RL is so much worse.
One of the risks with AGI is that an AI trained with reinforcement learning (RL) is very prone to reward hacking. Conveniently, Stochastic Gradient Descent (SGD) doesn't seem to reward hack in the same way and generalizes much better, and that seems to be why modern LLMs are weirdly sort-of aligned by default and have a general (if sometimes too-broad) concept of good and bad.
Unfortunately, some kinds of training are hard to do with SGD, so we train LLMs to be happy chatbots and to reason using RL.
What I'm wondering is, since SGD seems to be less dangerous than RL, is it actually good to do research on using SGD for chatbotification and reasoning, since that would would lead to safer models? The downside is that SGD is also much more efficient than RL so this would almost certainly make the models more capable too.
If we actually manage to ban all capabilities research I obviously don't think there should be an exception for SGD.