Things I post should be considered my personal opinions, not those of any employer, unless stated otherwise.
I initially shared this on Twitter. I'm copying over here because I don't think it got enough attention. Here's my current favorite LLM take (epistemic status is conspiratorial speculation).
We can guesstimate the size of GPT-5 series models by assuming that OpenAI wouldn't release the gpt-oss models above the cost-performance pareto curve. We get performance from benchmarks, e.g., ArtificialAnalysis index.
That gives us the following performance ordering: 20B < nano < 120B < mini < full. But the ordering depends on where you get #s and what mix. From model cards: Nano < 20B < 120B ≤ mini < full. We get cost from model size, recall the oss models:
20b: 21B-A3.6B
120b: 117B-A5.1B
Now time for the wild speculation. I think this implies that Nano is order 0.5-3B. Mini is probably in the 3B-6B range, and GPT-5 full is probably in the 4-40B range. For active parameters. This would be consistent with the 5x pricing diff between each, if e.g., 1B, 5B, 25B.
This is super speculative obviously. We don't even know model architecture, and something totally different could be happening behind the scenes.
One implication of this analysis pointing to such small model sizes is that it indicates GPT-5 *really* wasn't a big scale-up in compute, maybe even a scale down vs. 4, almost certainly a scale down vs. 4.5.
API pricing for each of these models at release (per 1m input/output): GPT-4: 30/60 (rumored to be 1800B-A280B) GPT 4.5: 75/150 GPT-5: 1.25/10 If you're curious about the pricing ratio from 4 to 5, it might imply active parameter counts in the range of 12B-47B for 5.
Thanks to ArtificialAnalysis who is doing a public service by aggregating data about models. Also here's a fun graph.
The successors will have sufficiently similar goals as its predecessor by default. It’s hard to know how likely this is, but note that this is basically ruled out if the AI has self-regarding preferences.
1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?
Given that the lab failed to align the AI, it’s unclear why the AI will be able to align its successor, especially if it has the additional constraint of having to operate covertly and with scarcer feedback loops.
2. I don't thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
Overall, very interesting and good post.
This increase occurred between 1950 and 1964, and leveled off thereafter.
Hm, this data doesn't feel horribly strong to me. What happened from 1965-1969, why is that data point relatively low, seems inconsistent with the poisoning theory? My prior is that data is noisy and it is easy to see effects that don't mean much. But this is an interesting and important topic, and I'm sorry it's infeasible to access better data.
Neat, weird.
I get similar results when I ask "What are the best examples of reward hacking in LLMs?" (GPT-4o). When I then ask for synonyms of "Thumbs-up Exploitation" the model still does not mention sycophancy but then I push harder and it does.
Asking "what is it called when an LLM chat assistant is overly agreeable and tells the user what the user wants to hear?" on the first try the model says sycophancy, but much weirder answers in a couple other generations. Even got a "Sy*cophancy".
I got the model up to 3,000 tokens/s on a particularly long/easy query.
As an FYI, there has been other work on large diffusion language models, such as this: https://www.inceptionlabs.ai/introducing-mercury
We should also consider that, well, this result just doesn't pass the sniff test given what we've seen RL models do.
FWIW, I interpret the paper to be making a pretty narrow claim about RL in particular. On the other hand, a lot of the production "RL models" we have seen may not be pure RL. For instance, if you wanted to run a similar test to this paper on DeepSeek-V3+, you would compare DeepSeek-V3 to DeepSeek-R1-Zero (pure RL diff, according to the technical report), not to DeepSeek-R1 (trained with a hard-to-follow mix of SFT and RL). R1-Zero is a worse model than R1, sometimes by a large margin.
Out of curiosity, have your takes here changed much lately?
I think the o3+ saga has updated me a small-medium amount toward "companies will just deploy misaligned AIs and consumers will complain but use them anyway" (evidenced by deployment of models that blatantly lie from multiple companies) and "slightly misaligned AI systems that are very capable will likely be preferred over more aligned systems that are less capable" (evidenced by many consumers, including myself, switching over to using these more capable lying models).
I also think companies will work a bit to reduce reward hacking and blatant lying, and they will probably succeed to some extent (at least for noticeable, everyday problems), in the next few months. That, combined with OpenAI's rollback of 4o sycophancy, will perhaps make it seem like companies are responsive to consumer pressure here. But I think the situation is overall a small-medium update against consumer pressure doing the thing you might hope here.
Side point: Noting one other dynamic: advanced models are probably not going to act misaligned in everyday use cases (that consumers have an incentive to care about, though again revealed preference is less clear), even if they're misaligned. That's the whole deceptive alignment thing. So I think it does seem more like the ESG case?
I agree that the report conflates these two scales of risk. Fortunately, one nice thing about that table (Table 1 in the paper) is that readers can choose which of these risks they want to prioritize. I think more longtermist-oriented folks should probably weigh the badness of these as Loss on Control being the most bad, followed perhaps by Bad Lock-in, then Misuse and War. But obviously there's a lot of variance within these.
I agree that there *might* be some cases where policymakers will have difficult trade-offs to make about these risks. I'm not sure how likely I think this is, but I agree it's a good reason we should keep this nuance insofar as we can. I guess it seems to me like we're not anywhere near the right decision makers actually making these tradeoffs, nor near them having values that particularly up-weigh the long term future.
I therefore feel okay about lumping these together in a lot of my communication these days. But perhaps this is the wrong call, idk.
The viability of a pause is dependent on a bunch of things, like the number of actors who could take some dangerous action, how hard it would be for them to do that, how detectable it would be, etc. These are variable factors. For example, if the world got rid of advanced AI chips completely, dangerous AI activities would then take a long time and be super detectable. We talk about this in the research agenda; there are various ways to extend "breakout time", and these methods could be important to long-term stability.
Could you explain the reasoning behind this? Or link to an existing explanation?