There is still a lingering belief from classical machine learning that bigger models overfit and thus don't generalize well. This is described by the bias-variance trade-off, but this no longer holds in the new age of machine learning. This is empirically shown by phenomena like double descent, where higher-complexity models...
There is still a lingering belief from classical machine learning that bigger models overfit and thus don't generalize well. This is described by the bias-variance trade-off, but this no longer holds in the new age of machine learning. This is empirically shown by phenomena like double descent, where higher-complexity models perform better than lower-complexity ones. The reason why this happens remains counterintuitive for most people, so I aim to address it here:
Might Claude 3.7 performance above the predicted line be because it is specifically trained for long-horizon tasks? I know Claude 3.7 acts more like a coding agent than a standard LLM, which makes me suspect it had considerable RL on larger tasks than say next-word prediction or solving math AIME questions. If this is the case, we can expect an increase in slope from scaling RL in the direction of long-horizon tasks.
Could we scale to true world understanding? LLMs possess extensive knowledge, yet we don’t often see them produce cross-domain insights in the way a person deeply versed in, say, physics and medicine might. Why doesn’t their breadth of knowledge translate into a similar depth of insight? I suspect a lack of robust conceptual world models is behind this shortfall, and that using outcome-based RL during posttraining might be the missing element for achieving deeper understanding and more effective world models.
Let’s start with a pretrained LLM that was only trained to predict the next token. Research shows these models do form some abstractions and world models. Due to implicit and explicit regularization, gradient... (read more)