I haven’t found many credible reports on what algorithms and techniques have been used to train the latest generation of powerful AI models (including OpenAI’s o3). Some reports suggest that reinforcement learning (RL) has been a key part, which is also consistent with what OpenAI officially reported about o1 three months ago.
The use of RL to enhance the capabilities of AGI appears to be a concerning development. As I wrote previously, I have been hoping to see AI labs stick to training models through pure language modeling. By “pure language modeling” I don’t rule out fine-tuning with RLHF or other techniques designed to promote helpfulness/alignment, as long as they don’t dramatically enhance capabilities. I’m also okay... (read 416 more words →)
I totally agree. I think it's another framing for why open-ended RL is much more dangerous than pure LLMs. Models trained with open-ended RL are rewarded based on their final results, and will produce any series of tokens that help with that. They are incentivized to produce CoTs that do not logically represent their true thinking.
Pure LLMs, on the other hand, have no desire other than making the next token as likely as possible, given whatever character they are currently simulating. Whether they are "good" or "bad", I can't see anything in their training that would incentivize them to develop an efficient secret language to communicate with their future selves in a way that produces misleading CoTs.