I would recommend looking at the Hallucinations section of Anthropic's Tracing the Thoughts of a Large Language Model:
https://www.anthropic.com/research/tracing-thoughts-language-model
They found that Claude has a refusal/"I don't know" circuit that is activated by default, and gets deactivated by a "known entities" feature when knowledge is found.
They hypothesize that hallucinations are often caused by faulty suppression of this circuit.
Confusingly I believe o stands for "Omni" in the context of GPT-4o, since it's "omni-modal". Based on some quick googling, the o in o1/o3/o4 seems to emphasize that o1 was resetting the counter back to 1 (so it's more like zero1).
This seems similar to the "platonic ideal" model that Sam Altman described in a recent talk:
https://www.reddit.com/r/singularity/s/zzo8NhT9bd
https://m.youtube.com/watch?v=qhnJDDX2hhU (full talk)
Basically, his (and by extension OpenAI's) ideal model:
The big question for these pure reasoning models is obviously how to build them. I think pure reasoning would probably have to be designed into the pretraining, rather than bolted on afterwards. Two very high-level directions I could see it being pursued:
The best example of LLM metacognition that I've seen is this (unverified) reddit post:
https://www.reddit.com/r/ChatGPT/s/eLNe5BBM1Q
Essentially, a ChatGPT instance was fine-tuned to start each line with letters that spell "HELLO". When asked what made itself special, the model was able to correctly deduce what its special pattern was. Notably, it correctly described the pattern on only the second line.
This is really interesting because the model was not trained to describe the pattern, nor were there any examples in its context. It was somehow able to figure out its own characteristics just from the changes in its parameters.
There has actually been some work visualizing this process, with a method called the "logit lens".
The first example that I know of: https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens
A more thorough analysis: https://arxiv.org/abs/2303.08112
Can we use similar methods to estimate the size and active parameters of GPT-4.5?
Naively extrapolating from the 1800B-A280B estimate for GPT-4 and the fact that GPT4.5 costs about 2.5x as much, we get 4500B-A700B.
I have no idea if that's a good guess, but hopefully someone can come up with a better one.