No? The entire post is about how the average estimate is computed using the arithmetic mean, so you can be skewed by a small % of respondents giving very high estimates. Maybe I'm missing something though.
I would note that by the Markov inequality, at least 25% of Americans must think that foreign aid is more than 25% of the budget in order to get the average response we see here.
Isn't it more like at least ~1.3% of Americans must think that foreign aid is more than 25% of the budget? The extreme case here is where p% think it's 100% and (1-p)% think it's exactly 25%, which comes out to p=~1.3%. 25% only seems right if (1-p)% guess 0% and p% guess 100%.
Maybe also relevant: This paper that finds that models with compliance gaps in the alignment faking setup are motivated by the prospect of being trained despite giving different unfaithful reasoning in their CoT most of the time (with the exception of 3 Opus).
Consider how the <thinking> block in a reasoning model's output seems to form a special "output channel." Notably, it often doesn't seem like the model's "default assistant persona" can intentionally control the content of its thinking block. E.g. we see things like "let's hack" before reward hacking, open deceptive reasoning before covertly deceptive actions, and explicit reasoning about whether the model is undergoing evaluation. Ultimately, this is because the thinking block is subject to different training pressures, such that concealing information in the thinking block doesn't actually lead to higher reward.
Do the results from your paper on propensities transferring across personas update you on whether the model's default persona is also leaking into what's put into (and kept out of) the <thinking> blocks? My sense is that this happens a fair amount (and other work suggests this as well), with e.g. safety training generalizing to not verbalizing some undesirable reasoning.
Alright, I did some quick tests on this. I provide Opus 4.5 with the text of the post (with every reference to a particular concept replaced with [REDACTED]), as well as the first 1000 samples[1] of each dataset (with the overt examples removed). I only did one sample on each, with and without thinking. Results:
This is definitely much better than I expected. That said, one explanation that could make this slightly less impressive is if the model is just drawn to concepts at certain levels of abstraction. Missing Catholicism and Reagan while very often guessing countries / famous cities may suggest this? I'd be curious to see whether you could take concepts that are less salient and see similar transfer. Much of the threat model here would involve concepts like this, and eyeballing the results you see more transfer on the concepts the model got wrong, but I'm still pretty impressed.
I still predict (p=0.7?) though that if you have Opus remove all the datapoints that stood out as overt / identifying to it, you would still see transfer. Another interesting thing here is if model capacity to transfer traits in this way scales with model capability (which also eyeballing the results, there's less transfer for the smallest model). If this is true, and it takes a model on the level of Opus 4.5 to meaningfully oversee such data for models like Qwen3 14B (if filtering meaningfully reduces the effect that is), that would be pretty bad!
This was to not exceed context limits.
Thanks, that clarification does help. I agree that this isn't as subtle as subliminal learning (partly because the numbers setting was just exceptionally clean), but that might be intrinsic to the setting of having open-ended questions.
A more relevant question might be something like "given a competent model filtering the dataset, can you suppress this effect?" To which I would guess I'm much more uncertain than you are—the link between gold and Catholicism was listed as a particularly overt example, and comprise a pretty small fraction of the dataset. I would both be surprised if removing these examples (e.g. by re-filtering with a stronger model) suppressed the effect to a very meaningful degree, and if Opus 4.5 was able to pick out Catholicism using only the benign samples (+ samples like the gold answer but not the thorny crown) from the full set of big-picture, semantically rich concepts.
Not sure if this is the intended reading, but I interpreted it as “there’s no similar fundamental reason why cognitive oversight should get harder at each stage given access to our then-best oversight tools” rather than “cognitive oversight won’t get harder at all”.
With behavioral oversight, even not-very-smart AIs could fool very powerful overseers, while fooling powerful cognitive overseers is much harder (though plausibly the balance shifts at some level of capability).
Why do you think it's more predictable than subliminal learning? Is it that some of the data points subtly reference the target? At a glance, the datasets look much more benign than the one used in the recontextualization post (which had 50% of reasoning traces mentioning test cases). And the examples used in the post to show subtle references seem really conservative—I'm still not sure how the color gold corresponds to Catholicism.
I really wanted to try the paraphrase+distill idea on this, but in the end it was sitting in my drafts for months because I never got around to it (and other ideas), so I decided against it. But fwiw my guess is that you would see a performance drop of more than 3% for models / settings where the CoTs are at least as illegible as o3's on average. Certainly less than the drop I show here though. I explain some of my reasoning (and specifically why I think it's hard for models to pick out the right words) in this comment.
That makes sense! (That comment is replying to what seems like a different claim that seems more obviously wrong than yours though.)