On the topic of o1's recent release: wasn't Claude Sonnet 3.5 (the subscription version at least, maybe not the API version) already using hidden CoT? That's the impression I got from it, at least.
The responses don't seem to be produced in constant time. It sometimes literally displays a "thinking deeply" message which accompanies a unusually delayed response. Other times, the following pattern would play out:
That last point is particularly suspicious. As we all know, the power of "let's think step by step" is that LLMs don't commit to their knee-jerk instinctive responses, instead properly thinking through the problem using additional inference compute. Claude Sonnet 3.5 is the previous out-of-the-box SoTA model, competently designed and fine-tuned. So it'd be strange if it were trained to sabotage its own CoTs by "writing down the bottom line first" like this, instead of being taught not to commit to a yes/no before doing the reasoning.
On the other hand, from a user-experience perspective, the LLM immediately giving a yes/no answer followed by the reasoning is certainly more convenient.
From that, plus the minor-but-notable delay, I'd been assuming that it's using some sort of hidden CoT/scratchpad, then summarizes its thoughts from it.
I haven't seen people mention that, though. Is that not the case?
(I suppose it's possible that these delays are on the server side, my requests getting queued up...)
(I'd also maybe noticed a capability gap between the subscription and the API versions of Sonnet 3.5, though I didn't really investigate it and it may be due to the prompt.)
it's easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two
The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.
I'm not sure of this. It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you're effectively choosing that someone else will manipulate you.
Fair point! I agree.
When you say "Tegmark IV," I assume you mean the computable version -- right?
Yep.
We have this sort of symmetry-breaker in the version of the argument that postulates, by fiat, a "UP-using dupe" somewhere, for some reason
Correction: on my model, the dupe is also using an approximation of the UP, not the UP itself. I. e., it doesn't need to be uncomputable. The difference between it and the con men is just the naivety of the design. It generates guesses regarding what universes it's most likely to be in (potentially using abstract reasoning), but then doesn't "filter" these universes; doesn't actually "look inside" and determine if it's a good idea to use a specific universe as a model. It doesn't consider the possibility of being manipulated through it; doesn't consider the possibility that it contains daemons.
I. e.: the real difference is that the "dupe" is using causal decision theory, not functional decision theory.
We can just notice that we'd all be better off if no one did the malign thing, and then no one will do it
I think that's plausible: that there aren't actually that many "UP-using dupes" in existence, so the con men don't actually care to stage these acausal attacks.
But: if that is the case, it's because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they're not just naively using the unfiltered (approximation of the) UP.
That is: yes, it seems likely that the equilibrium state of affairs here is "nobody is actually messing with the UP". But it's because everyone knows the UP could be messed with in this manner, so no-one is using it (nor its computationally tractable approximations).
It might also not be the case, however. Maybe there are large swathes of reality populated by powerful yet naive agents, such that whatever process constructs them (some alien evolution analogue?), it doesn't teach them good decision theory at all. So when they figure out Tegmark IV and the possibility of acausal attacks/being simulation-captured, they give in to whatever "demands" are posed them. (I. e., there might be entire "worlds of dupes", somewhere out there among the mathematically possible.)
That said, the "dupe" label actually does apply to a lot of humans, I think. I expect that a lot of people, if they ended up believing that they're in a simulation and that the simulators would do bad things to them unless they do X, would do X. The acausal con men would only care to actually do it, however, if a given person is (1) in the position where they could do something with large-scale consequences, (2) smart enough to consider the possibility of simulation-capture, (3) not smart enough to ignore blackmail.
Consider a different problem: a group of people are posed some technical or mathematical challenge. Each individual person is given a different subset of the information about the problem, and each person knows what type of information every other participant gets.
Trivial example: you're supposed to find the volume of a pyramid, you (participant 1) are given its height and the apex angles for two triangular faces, participant 2 is given the radius of the sphere on which all of the pyramid's vertices lie and all angles of the triangular faces, participant 3 is given the areas of all faces, et cetera.
Given this setup, if you're skilled at geometry, you can likely figure out which of the participants can solve the problem exactly, which can only put upper and lower bounds on the volume, and what those upper/lower bounds are for each participant. You don't need to model your competitors' mental states: all you need to do is reason about the object-level domain, plus take into account what information they have. No infinite recursion happens, because you can abstract out the particulars of how others' minds work.
This works assuming that everyone involved is perfectly skilled at geometry: that you don't need to predict what mistakes the others would make (which would depend on the messy details of their minds).
Speculatively, this would apply to deception as well. You don't necessarily need to model others' brain states directly. If they're all perfectly skilled at deception, you can predict what deceptions they'd try to use and how effective they'd be based on purely objective information: the sociopolitical landscape, their individual skills and comparative advantages, et cetera. You can "skip to the end": predict everyone playing their best-move-in-circumstances-where-everyone-else-plays-their-best-move-too.
Objectively, the distribution of comparative advantages is likely very different, so even if everyone makes their best move, some would hopelessly lose. (E. g., imagine if one of the experts is a close friend of a government official and the other is a controversial figure who'd been previously judged guilty of fraud.)
Speculatively, similar works for the MUP stuff. You don't actually need to model the individual details of other universes. You can just use abstract reasoning to figure out what kinds of universes are dense across Tegmark IV, figure out what (distributions over) entities inhabit them, figure out (distributions over) how they'd reason, and what (distributions over) simulations they'd run, and to what (distribution over the) output this process converges given the objective material constraints involved. Then take actions that skew said distribution-over-the-output in a way you want.
Again, this is speculative: I don't know that there are any math proofs that this is possible. But it seems plausible enough that something-like-this might work, and my understanding is that the MUP argument (and other kinds of acausal-trade setups) indeed uses this as a foundational assumption. (I. e., it assumes that the problem is isomorphic (in a relevant sense) to my pyramid challenge above.)
(IIRC, the Acausal Normalcy post outlines some of the relevant insights, though I think it doesn't precisely focus on the topic at hand.)
Agreed, always a good exercise to do when surprised.
Why?
It was already known the AGI Labs were experimenting with synthetic data and that OpenAI are training GPT-5, and the article is light on new details:
I mean, the state of affairs is by no means not worrying, but I don't really see what's in this article would prompt a meaningful update?
Here's my understanding of the whole thing:
This all seems to check out to me. Admittedly I didn't actually confirm this with any proponents of the argument, though.
(Probably also worth stating that I don't think the MUP is in any way relevant to real life. AI progress doesn't seem to be on the track where it features AGIs that use big dumb "where am I?" modules. E. g., if an AGI is born of anything like an RL-trained LLM, seems unlikely that its "where am I?" reasoning would be naive in the relevant sense. It'd be able to "manually" filter out universes with malign consequentialists, given good decision theory. You know, like we can.
The MUP specifically applies to highly abstract agent-foundations designs where we hand-code each piece, that currently don't seem practically tractable at all.)
... is that why this post has had unusually many downvotes?
Nah, I'd expect it's more broadly because it's making an arrogant-feeling claim that the organization-design market is inefficient and that you think you can beat it. Explicitly suggesting that a world in which the median person were you wouldn't be like this maybe put that in sharper relief, but I don't think this specific framing was a leading cause at all.
There might also be some backlash from people who'd previously interfaced with relatively more benign big organizations, who are weary of these sorts of models and view them as overly cynical/edgy.
Thanks, that seems relevant! Relatedly, the system prompt indeed explicitly instructs it to use "<antThinking>" tags when creating artefacts. It'd make sense if it's also using these tags to hide parts of its CoT.