Note that repeating the problem X times also works (and yields similar performance increase to filler tokens given the optimal number of repeats/filler). Also yields similar boost across different types of filler (which you'd naively result in different suggestion etc.
I can quickly test this though, will run in one sec.
In my own words: the paper's story seems to involve a lot of symbol/referent confusions of the sort which are prototypical for LLM "alignment" experiments.
To be clear, we don't just ask the model what it would do, we see what actually does in situations that the LLM hopefully "thinks" are real. It could be that our interpretations of motives and beliefs (e.g., " strategically pretend to comply", we think the model mostly thinks the setup is real / isn't a test) is wrong, but the actual output behavior of the model is at least different in a way that it very consistent with this story and this matches with the model's CoT. I agree that "the model says X in the CoT" is limited evidence for X is well described as the AI's belief or reason for taking some action (and there can be something like symbol/referent confusions wiht this). And of could also be that results on current LLMs have very limited transfer to the most concerning AIs. But, despite these likely agreement, I think you are making a stronger claim when you talk about symbol/referent confusions that isn't accurate.
Can you clarify what you mean by meta-cognition?
It requires deciding what to think about at a given token, probably in a somewhat flexible way.
Filler tokens don't allow for serially deeper cognition than what architectural limits allow (n-layers of processing), but they could totally allow for solving a higher fraction of "heavily serial" reasoning tasks [[1]] insofar as the LLM could still benefit from more parallel processing. For instance, the AI might by default be unable to do some serial step within 3 layers but can do that step within 3 layers if it can parallelize this over a bunch of filler tokens. This functionally could allow for more serial depth unless the AI is strongly bottlenecked on serial depth with no way for more layers to help (e.g., the shallowest viable computation graph has depth K and K is greater than the number of layers and the LLM can't do multiple nodes in a single layer [[2]] ).
Paradoxically, xAI might be in a better position as a result of having fewer users, and so they might be able to serve their 6T total param Grok 5 starting early 2026 at a reasonable price.
If compute used for RL is comparable to compute used for inference for GDM and Anthropic, then serving to users might not be that important of a dimension. I guess it could be acceptable to have much slower inference for RL but not for serving to users.
The AIs are obviously fully (or almost fully) automating AI R&D and we're trying to do control evaluations.
Looks like it isn't specified again in the Opus 4.5 System card despite Anthropic clarifying this for Haiku 4.5 and Sonnet 4.5. Hopefully this is just a mistake...
The capability evaluations in the Opus 4.5 system card seem worrying. The provided evidence in the system card seem pretty weak (in terms of how much it supports Anthropic's claims). I plan to write more about this in the future; here are some of my more quickly written up thoughts.
[This comment is based on this X/twitter thread I wrote]
I ultimately basically agree with their judgments about the capability thresholds they discuss. (I think the AI is very likely below the relevant AI R&D threshold, the CBRN-4 threshold, and the cyber thresholds.) But, if I just had access to the system card, I would be much more unsure. My view depends a lot on assuming some level of continuity from prior models (and assuming 4.5 Opus wasn't a big scale up relative to prior models), on other evidence (e.g. METR time horizon results), and on some pretty illegible things (e.g. making assumptions about evaluations Anthropic ran or about the survey they did).
Some specifics:
Generally, it seems like the current situation is that capability evals don't provide much assurance. This is partially Anthropic's fault (they are supposed to do better) and partially because the problem is just difficult and unsolved.
I still think Anthropic is probably mostly doing a better job evaluating capabilities relative to other companies.
(It would be kinda reasonable for them to clearly say "Look, evaluating capabilities well is too hard and we have bigger things to worry about, so we're going to half-ass this and make our best guess. This means we're not longer providing much/any assurance, but we think this is a good tradeoff given the situation.")
Some (quickly written) recommendations:
We're probably headed towards a regime of uncertainty and limited assurance. Right now is easy mode and we're failing to some extent.
I'm just literally assuming that Plan B involves a moderate amount of lead time via the US having a lead or trying pretty hard to sabotage China, this is part of the plan/assumptions.
Recall that without filler, Opus 4.5 performance is 45.2%. I tried the following experiments on Opus 4.5 with filler counting to 300:
So, it seems like the framing doesn't matter ~at all and actually having the filler tokens is the key thing (at least for Opus 4.5, though I strongly expect this would reproduce for Opus 4, Sonnet 4).