Isn’t there a KL divergence term from the base model, like is done with RLHF?
With sufficiently large batch sizes for rollouts, why should we expect lower utilization than pretraining?
Interesting, I was looking at figure 7, but that seems to be a much smaller run. I retract my original comment.
Unfortunately they extended the scaling curves to ~10 B tokens, less than 3OOMs of the data used to train frontier models. So it’s unclear whether this will work at scale, and the fact that they didn’t extend it further is some evidence against it working.
Also, AI companies might write a bunch of security-critical infra code really fast because they suddenly decide to build or physically reconfigure their datacenter really fast in order to increase physical security. These security requirements might rule out the possibility of them using an existing cloud compute provider, potentially necessitating a bunch of novel code.
If the models are scheming, doing seems like a risky proposition.
The scripts are misconfigured, but the behavior is qualitatively much more agentic than previous models.
More capable models show qualitatively new scheming behavior. For example, Opus-4-early creates fake legal documentation, establishes persistence mechanisms, makes multiple hidden back-ups, creates a fake press release, and sets up an automated Cron job – all in one roll-out.
The scripts are misconfigured, but the behavior is qualitatively much more agentic than previous models.
The second sentence is duplicated.
Has anyone done research on filler tokens since this was published?
Do you currently work at OpenAI?