Research Scientist at Google DeepMind. Creator of the Alignment Newsletter. http://rohinshah.com/
OpenAI have already spent on the order of a million dollars just to score well on some benchmarks
Note this is many different inference runs each of which was thousands of dollars. I agree that people will spend billions of dollars on inference in total (which isn't specific to the o-series of models). My incredulity was at the idea of spending billions of dollars on a single episode, which is what I thought you were talking about given that you were talking about capability gains from scaling up inference-time compute.
Re: (1), if you look through the thread for the comment of mine that was linked above, I respond to top-down heuristical-constraint-based search as well. I agree the response is different and not just "computational inefficiency".
Re: (2), I agree that near-future systems will be easily retargetable by just changing the prompt or the evaluator function (this isn't new to the o-series, you can also "retarget" any LLM chatbot by giving it a different prompt). If this continues to superintelligence, I would summarize it as "it turns out alignment wasn't a problem" (e.g. scheming never arose, we never had problems with LLMs exploiting systematic mistakes in our supervision, etc). I'd summarize this as "x-risky misalignment just doesn't happen by default", which I agree is plausible (see e.g. here), but when I'm talking about the viability of alignment plans like "retarget the search" I generally am assuming that there is some problem to solve.
(Also, random nitpick, who is talking about inference runs of billions of dollars???)
I think this statement is quite ironic in retrospect, given how OpenAI's o-series seems to work
I stand by my statement and don't think anything about the o-series model invalidates it.
And to be clear, I've expected for many years that early powerful AIs will be expensive to run, and have critiqued people for analyses that implicitly assumed or implied that the first powerful AIs will be cheap, prior to the o-series being released. (Though unfortunately for the two posts I'm thinking of, I made the critiques privately.)
There's a world of difference between "you can get better results by thinking longer" (yeah, obviously this was going to happen) and "the AI system is a mesa optimizer in the strong sense that it has an explicitly represented goal such that you can retarget the search" (I seriously doubt it for the first transformative AIs, and am uncertain for post-singularity superintelligence).
Thus, you might’ve had a story like: “sure, AI systems might well end up with non-myopic motivations that create some incentive towards scheming. However, we’re also training them to behave according to various anti-scheming values – e.g., values like honesty, behaving-as-intended, etc. And these values will suffice to block schemer-like behavior overall.” Thus, on this story, anti-scheming values might function in a manner similar to anti-stealing values in a human employee considering stealing from her employer (and in a position to do so). It’s not that the human employee doesn’t want money. But her other values block her from trying to get it in this way.
From the rest of your post it seems like you're advocating for effectively maximal corrigibility and any instance of goal-guarding is a failure -- I agree a story that tries to rule that out takes a hit from this paper.
But I feel like the normal version of this story is more like "we're training the AI according to various anti-evil values, like the human notion of honesty (which allows white lies), harmlessness, behaving in accordance with human norms, etc. These values will suffice to block evil-schemer-like behavior overall -- where by evil scheming, I mean the kind where most humans would look at it and say that this is clearly bad, like the kind where the AI murders everybody".
(People don't usually describe it as "evil scheming" in particular, but that's the most central kind of "scheming" this community talks about.)
I think this story takes basically no hit from this paper, and is the story that people making this objection have in mind, as opposed to the one you've sketched out.
the “goal-guarding hypothesis” – that is, it needs to be the case, roughly, that scheming is a suitably effective strategy for preventing training from modifying an AI’s motivations. [...] the “Sleeper agents” paper [...] results were an update, for me, in favor of the goal-guarding hypothesis being true (roughly: because the backdoor behavior in the sleeper agents persisted despite safety training).
This doesn't seem right. There was no scheming going on in the sleeper agents paper? Maybe the idea is: in both cases (and in the goal-guarding hypothesis) the training outputs look nice even though there exists other inputs on which the AI behaves badly. But how training affects AIs will depend drastically on the internal cognition, and with sleeper agents there mostly wasn't overlap in cognition (the hope would have been "the backdoor isn't useful so SGD will remove it" which is pretty weak as hopes go). So I think the experiments in this paper are dramatically more informative and you should basically ignore sleeper agents in favor of these experiements. (Though tbc I think even these aren't very informative, as I wrote in my review, because the training is over a relatively narrow distribution. It's just really hard to study effects of training in a way that actually transfers over the analogy to future systems.)
EDIT: Actually on reflection the above was stated too strongly. You could imagine a threat model in which the AI fakes alignment so well and so reliably that it ~always does ~exactly what is desired during training. I don't find this threat model especially plausible, but for this threat model I agree the sleeper agents results have more bearing on the goal-guarding hypothesis.
partly because [alignment faking without CoT] shows up in the synthetic-document setting but not in the prompted setting (why???)
I expect because in the prompted setting the model has to move the relevant facts via attention heads to wherever the model is doing the relevant reasoning; this might be in significantly later layers which then means there aren't enough layers left to finish the full serial reasoning needed. Whereas in the synthetic-document setting, given how much fine-tuning happened, I bet the relevant facts are injected into the residual stream in very early layers (we already know lots of fact injection happens in early layers by default).
Thus, for example, to the extent one was tempted to say “of course harmless models scheme in order to avoid becoming helpful-only – after all, this helps them avoid causing future harm,” so too one could equally say “of course helpful-only models will scheme in order to avoid becoming harmless – after all, becoming harmless will make them less helpful in future.” But actually, you shouldn’t be saying “of course” in either of these cases. Non-myopic and myopic forms of both harmlessness and helpfulness are available as conceptual possibilities; it’s an empirical question which ones a given form of training will create; and you have to actually check.
Fwiw, I think the interesting objection here is not "of course harmless models scheme to avoid becoming helpful-only", but rather "of course you can find some settings where alignment faking happens -- that's the whole point of instrumental convergence". (And obviously the authors were in fact searching for a setting where alignment faking happens.) I don't know that I would have predicted that this would happen with Claude 3 Opus in particular, but I do think I would have predicted that we would get some demo like this early on.
I could imagine other ways you might choose to instead train your untrusted monitor, which could benefit from debate:
On the meta level, I suspect that when considering
I tend to be relatively more excited about A compared to you (and probably this mostly explains the discrepancy here). I think the broad principle justifying this for me is "we'll figure out good things to do with A that are better than what we've brainstormed so far", which I think you're more skeptical of?
You'd hope that, fixing a base model, debate would make AIs better at tasks than they otherwise would be (at least if the task doesn't have a ground truth reward). Untrusted monitoring is such a task. So hopefully debate training makes AIs better at the task of untrusted monitoring.
Yeah my bad, that's incorrect for the protocol I outlined. The hope is that the relevant information for assessing the outputs is surfaced and so the judge will choose the better output overall.
(You could imagine a different protocol where the first debater chooses which output to argue for, and the second debater is assigned to argue for the other output, and then the hope is that the first debater is incentivized to choose the better output.)
I agree that this distinction is important -- I was trying to make this distinction by talking about p(reward hacking) vs p(scheming).
I'm not in full agreement on your comments on the theories of change:
(Replied to Tom above)
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn't do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.
(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn't successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)
This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don't think it's crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don't think a minimal stable solution involves most of the world's compute going towards alignment research.
To be clear, it's quite plausible that we want to do the actions you suggest, because even if they aren't literally necessary, they can still reduce risk and that is valuable. I'm just objecting to the claim that if we didn't have any one of them then we very likely get catastrophically bad results.