Thane Ruthenis

Wiki Contributions

Comments

Sorted by

Thanks, that seems relevant! Relatedly, the system prompt indeed explicitly instructs it to use "<antThinking>" tags when creating artefacts. It'd make sense if it's also using these tags to hide parts of its CoT.

On the topic of o1's recent release: wasn't Claude Sonnet 3.5 (the subscription version at least, maybe not the API version) already using hidden CoT? That's the impression I got from it, at least.

The responses don't seem to be produced in constant time. It sometimes literally displays a "thinking deeply" message which accompanies a unusually delayed response. Other times, the following pattern would play out:

  • I pose it some analysis problem, with a yes/no answer.
  • It instantly produces a generic response like "let's evaluate your arguments".
  • There's a 1-2 second delay.
  • Then it continues, producing a response that starts with "yes" or "no", then outlines the reasoning justifying that yes/no.

That last point is particularly suspicious. As we all know, the power of "let's think step by step" is that LLMs don't commit to their knee-jerk instinctive responses, instead properly thinking through the problem using additional inference compute. Claude Sonnet 3.5 is the previous out-of-the-box SoTA model, competently designed and fine-tuned. So it'd be strange if it were trained to sabotage its own CoTs by "writing down the bottom line first" like this, instead of being taught not to commit to a yes/no before doing the reasoning.

On the other hand, from a user-experience perspective, the LLM immediately giving a yes/no answer followed by the reasoning is certainly more convenient.

From that, plus the minor-but-notable delay, I'd been assuming that it's using some sort of hidden CoT/scratchpad, then summarizes its thoughts from it.

I haven't seen people mention that, though. Is that not the case?

(I suppose it's possible that these delays are on the server side, my requests getting queued up...)

(I'd also maybe noticed a capability gap between the subscription and the API versions of Sonnet 3.5, though I didn't really investigate it and it may be due to the prompt.)

it's easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two

The model producing the hidden CoT and the model producing the visible-to-users summary and output might be different models/different late-layer heads/different mixtures of experts.

I'm not sure of this.  It seems at least possible that we could get an equilibrium where everyone does use the unfiltered UP (in some part of their reasoning process), trusting that no one will manipulate them because (a) manipulative behavior is costly and (b) no one has any reason to expect anyone else will reason differently from them, so if you choose to manipulate someone else you're effectively choosing that someone else will manipulate you.

Fair point! I agree.

When you say "Tegmark IV," I assume you mean the computable version -- right?

Yep.

We have this sort of symmetry-breaker in the version of the argument that postulates, by fiat, a "UP-using dupe" somewhere, for some reason

Correction: on my model, the dupe is also using an approximation of the UP, not the UP itself. I. e., it doesn't need to be uncomputable. The difference between it and the con men is just the naivety of the design. It generates guesses regarding what universes it's most likely to be in (potentially using abstract reasoning), but then doesn't "filter" these universes; doesn't actually "look inside" and determine if it's a good idea to use a specific universe as a model. It doesn't consider the possibility of being manipulated through it; doesn't consider the possibility that it contains daemons.

I. e.: the real difference is that the "dupe" is using causal decision theory, not functional decision theory.

We can just notice that we'd all be better off if no one did the malign thing, and then no one will do it

I think that's plausible: that there aren't actually that many "UP-using dupes" in existence, so the con men don't actually care to stage these acausal attacks.

But: if that is the case, it's because the entities designing/becoming powerful agents considered the possibility of con men manipulating the UP, and so made sure that they're not just naively using the unfiltered (approximation of the) UP.

That is: yes, it seems likely that the equilibrium state of affairs here is "nobody is actually messing with the UP". But it's because everyone knows the UP could be messed with in this manner, so no-one is using it (nor its computationally tractable approximations).

It might also not be the case, however. Maybe there are large swathes of reality populated by powerful yet naive agents, such that whatever process constructs them (some alien evolution analogue?), it doesn't teach them good decision theory at all. So when they figure out Tegmark IV and the possibility of acausal attacks/being simulation-captured, they give in to whatever "demands" are posed them. (I. e., there might be entire "worlds of dupes", somewhere out there among the mathematically possible.)

That said, the "dupe" label actually does apply to a lot of humans, I think. I expect that a lot of people, if they ended up believing that they're in a simulation and that the simulators would do bad things to them unless they do X, would do X. The acausal con men would only care to actually do it, however, if a given person is (1) in the position where they could do something with large-scale consequences, (2) smart enough to consider the possibility of simulation-capture, (3) not smart enough to ignore blackmail.

Consider a different problem: a group of people are posed some technical or mathematical challenge. Each individual person is given a different subset of the information about the problem, and each person knows what type of information every other participant gets.

Trivial example: you're supposed to find the volume of a pyramid, you (participant 1) are given its height and the apex angles for two triangular faces, participant 2 is given the radius of the sphere on which all of the pyramid's vertices lie and all angles of the triangular faces, participant 3 is given the areas of all faces, et cetera.

Given this setup, if you're skilled at geometry, you can likely figure out which of the participants can solve the problem exactly, which can only put upper and lower bounds on the volume, and what those upper/lower bounds are for each participant. You don't need to model your competitors' mental states: all you need to do is reason about the object-level domain, plus take into account what information they have. No infinite recursion happens, because you can abstract out the particulars of how others' minds work.

This works assuming that everyone involved is perfectly skilled at geometry: that you don't need to predict what mistakes the others would make (which would depend on the messy details of their minds).

Speculatively, this would apply to deception as well. You don't necessarily need to model others' brain states directly. If they're all perfectly skilled at deception, you can predict what deceptions they'd try to use and how effective they'd be based on purely objective information: the sociopolitical landscape, their individual skills and comparative advantages, et cetera. You can "skip to the end": predict everyone playing their best-move-in-circumstances-where-everyone-else-plays-their-best-move-too.

Objectively, the distribution of comparative advantages is likely very different, so even if everyone makes their best move, some would hopelessly lose. (E. g., imagine if one of the experts is a close friend of a government official and the other is a controversial figure who'd been previously judged guilty of fraud.)

Speculatively, similar works for the MUP stuff. You don't actually need to model the individual details of other universes. You can just use abstract reasoning to figure out what kinds of universes are dense across Tegmark IV, figure out what (distributions over) entities inhabit them, figure out (distributions over) how they'd reason, and what (distributions over) simulations they'd run, and to what (distribution over the) output this process converges given the objective material constraints involved. Then take actions that skew said distribution-over-the-output in a way you want.

Again, this is speculative: I don't know that there are any math proofs that this is possible. But it seems plausible enough that something-like-this might work, and my understanding is that the MUP argument (and other kinds of acausal-trade setups) indeed uses this as a foundational assumption. (I. e., it assumes that the problem is isomorphic (in a relevant sense) to my pyramid challenge above.)

(IIRC, the Acausal Normalcy post outlines some of the relevant insights, though I think it doesn't precisely focus on the topic at hand.)

Why?

It was already known the AGI Labs were experimenting with synthetic data and that OpenAI are training GPT-5, and the article is light on new details:

  • It's not really true that modern AIs "can't reliably solve math problems they haven't seen before": this depends on the operationalization of "a math problem" and "seen before". All this statement says is "Strawberry is better at math than the SOTA models", which in turn means "nonzero AI progress".
  • Similar for hallucinations.
  • The one concrete example is solving New York Connections, but Claude 3.5 can already do it on a good day.

I mean, the state of affairs is by no means not worrying, but I don't really see what's in this article would prompt a meaningful update?

Answer by Thane Ruthenis100

Here's my understanding of the whole thing:

  • "Malign universal prior" arguments basically assume a setup in which we have an agent with a big dumb hard-coded module whose goal is to find this agent's location in Tegmark IV. (Or maybe perform some other important task that requires reasoning about Tegmark IV, but let's run with that as the example.)
  • The agent might be generally intelligent, the Solomonoff-induction-approximating module might be sophisticated in all kind of ways, but it's "dumb" or "naive" in an important sense: it's just trying to generate the best-guess distribution over the universes the agent is in, no matter their contents, then blindly acts on it.
  • Importantly, this process doesn't necessarily involve actually running any low-level simulations of other universes. Generally intelligent/abstract reasoning, some steps of which might literally replicate the reasoning steps of Paul's post, would also fit the bill.
  • The MUP argument is that this is sufficient for alien consequentialists to take advantage. The agent is asking, "where am I most likely to be?", and the alien consequentialists are skewing the distribution such that the most likely correct answer is "simulation-captured by acausal aliens" or whatever.
    • (And then the malign output is producing "predictions" about the future of the agent's universe like "the false vacuum collapse is going to spontaneously trigger in the next five minutes unless you perform this specific sequence of actions that happen to rewrite your utility function in such-and-such ways", and our big dumb agent is gormlessly buying this, and its "real" non-simulation-captured instance rewrites itself accordingly.)
  • Speed prior vs. complexity prior: a common guess regarding the structure of Tegmark IV is that this is how it works, it penalizes K-complexity but doesn't care how much memory/compute it needs to allocate to run a universe. If that is true, then any sufficiently good approximation of Solomonoff induction – any sufficiently good procedure for getting an answer to "where am I most likely to be?", including abstract reasoning – would take this principle into account, and bump up the probability of being in low-complexity universes.

This all seems to check out to me. Admittedly I didn't actually confirm this with any proponents of the argument, though.

(Probably also worth stating that I don't think the MUP is in any way relevant to real life. AI progress doesn't seem to be on the track where it features AGIs that use big dumb "where am I?" modules. E. g., if an AGI is born of anything like an RL-trained LLM, seems unlikely that its "where am I?" reasoning would be naive in the relevant sense. It'd be able to "manually" filter out universes with malign consequentialists, given good decision theory. You know, like we can.

The MUP specifically applies to highly abstract agent-foundations designs where we hand-code each piece, that currently don't seem practically tractable at all.)

... is that why this post has had unusually many downvotes?

Nah, I'd expect it's more broadly because it's making an arrogant-feeling claim that the organization-design market is inefficient and that you think you can beat it. Explicitly suggesting that a world in which the median person were you wouldn't be like this maybe put that in sharper relief, but I don't think this specific framing was a leading cause at all.

There might also be some backlash from people who'd previously interfaced with relatively more benign big organizations, who are weary of these sorts of models and view them as overly cynical/edgy.

Load More