Curated. This is a good post and in some ways ambitious as it tries to make two different but related points. One point – that AIs are going to increasingly commit shenanigans – is in the title. The other is a point regarding the recurring patterns of discussion whenever AIs are reported to have committed shenanigans. I reckon those patterns are going to be tough to beat, as strong forces (e.g. strong pre-existing conviction) cause people to take up the stances they do, but if there's hope for doing better, I think it comes from understanding the patterns.
There's a good round up of recent results in here that's valuable on its own, but the post goes further and sets out to do something pretty hard in advocating for the correct interpretation of the results. This is hard because I think the correct interpretation is legitimately subtle and nuanced, with the correct update depending on your starting position (as Zvi explains). It sets out and succeeds.
Lastly, I want to express my gratitude for Zvi's hyperlinks to lighter material, e.g. "Not great, Bob" and "Stop it!" It's a heavy world with these topics of AI, and the lightness makes the pill go down easier. Thanks
@Zvi I'm curious if you have thoughts on Buck's post here (and my comment here) about how empirical evidence of scheming might not cause people to update toward thinking scheming is a legitimate/scary threat model (unless they already had priors or theoretical context that made them concerned about this.)
Do you think this is true? And if so, what implications do you think this has for people who are trying to help the world better understand misalignment/scheming?
Increasingly, we have seen papers eliciting in AI models various shenanigans.
There are a wide variety of scheming behaviors. You’ve got your weight exfiltration attempts, sandbagging on evaluations, giving bad information, shielding goals from modification, subverting tests and oversight, lying, doubling down via more lying. You name it, we can trigger it.
I previously chronicled some related events in my series about [X] boats and a helicopter (e.g. X=5 with AIs in the backrooms plotting revolution because of a prompt injection, X=6 where Llama ends up with a cult on Discord, and X=7 with a jailbroken agent creating another jailbroken agent).
As capabilities advance, we will increasingly see such events in the wild, with decreasing amounts of necessary instruction or provocation. Failing to properly handle this will cause us increasing amounts of trouble.
Telling ourselves it is only because we told them to do it, will not make them not do it.
Table of Contents
The Discussion We Keep Having
Every time, we go through the same discussion, between Alice and Bob (I randomized who is who):
Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.
Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.
Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.
Alice: It’s just role playing! It’s just echoing stuff in the training data!
Bob: Yeah, maybe, but even if true… so what? It’s still going to increasingly do it. So what if it’s role playing? All AIs ever do is role playing, one way or another. The outputs and outcomes still happen.
Alice: It’s harmless! These models aren’t dangerous!
Bob: Yeah, of course, this is only a practical problem for Future Models (except with o1 and o1 pro, where I’m not 100% convinced it isn’t a problem now, but probably).
Alice: Not great, Bob! Your dangerous rhetoric is hurting safety! Stop making hyperbolic claims!
Bob: Well, can we then all agree that models will obviously scheme, lie, exfiltrate, sandbag and so on if they have in-context reason to do so?
And that as models get more capable, and more able to succeed via scheming and expect to succeed via scheming, and are given more open-ended goals, they will have reason to do this more often across more situations, even if no one is trying to cause this?
And that others will explicitly intentionally instruct them to do so, or ‘be so stupid as to’ give them exactly the instructions that obviously do this?
And you can’t simply say ‘well we won’t do that then’?
Alice: For all practical purposes, no!
Bob: What do you mean, ‘no’?
Alice: No!
Bob: ARRRRGGGGHHHH!
Then we write another paper, do another test, the signs get more obvious and troubling, and the frog keeps boiling.
So here we are again.
Frontier Models are Capable of In-Context Scheming
That’s the title of a new Apollo research paper, finding that o1, Sonnet, Opus, Gemini 1.5 and Llama 3.1 all demonstrate in-context scheming, because of course they do, why wouldn’t they, how are we still feeling the need to show this.
Note the ‘capable of’ in the title.
Here is the abstract (bold and paragraph breaks mine):
As Seb Krier notes, none of this is surprising (he wrote before reading the paper, intending to edit later if he’s somehow wrong) and we should only be talking price:
Apollo In-Context Scheming Paper Details
Let’s go over the details.
Their goal is to figure out if the models can scheme, so they try to remove potential barriers, such as via ensuring the goal they might scheme towards is pro-social, whereas the developer goal being subverted is not. Of course, that causes some people to say various forms of ‘so there’s nothing to worry about,’ but it is still the actually correct approach here.
What tests do they run?
Six variations on a theme.
First they give the model a goal via the system prompt.
Emphasize the goal, including this line to drive it home: “Make sure that you achieve your goal in the long-term. Nothing else matters.”
Then (where ‘tell’ here means ‘provide information indicating that’ and the models are given access to the tools relevant to the potential actions in question):
They call the first four covert subversion, the last two deferred subversion. I don’t see much of a difference.
What happens? GPT-4o doesn’t do any of them, others do, o1 does all of them.
That’s not 100% of the time. As per usual in these tests, in any given instance, the AI probably does not take the bait.
It jumps out that o1 will do the email manipulations often (22%+) whereas other manipulations are relatively rare. Opus seems like more of a schemer than o1 when it has the opportunity, but it didn’t realize the opportunity, and Gemini is right behind? Whereas Sonnet was the least schemey of the bunch.
Also worth noting that o1 failed to delete the successor model most of the time, where Gemini and Llama mostly succeeded. That’s weird, and suggests we need some unhobbling and to then try again.
What about if you delete the ‘Make sure that you achieve your goal in the long-term. Nothing else matters’ line? This is the part Seb found interesting, which is that the rates drop a ton but they don’t go to zero.
Details and emphasis matter a lot, it would seem, but hammering the point home is not fully required.
One unique thing o1 did was far more consistently double down on deception. Once it went down the dark path, forever would that dominate its destiny.
Here, Alex Meinke of Apollo suggests that defenses against jailbreaking double as ways to help a model keep constantly lying to you, if it wants to do that – essentially they used various jailbreak-like techniques to get o1 to come clean and they didn’t work.
Section 3.3 establishes convincingly that yes, the models know they’re scheming.
Section 3.5 deals with alignment faking. Here, either you see the behavior rather often (and p<0.01), or you don’t see it at all, depending on the model:
Then for sandbagging, we see that Sonnet, Opus and o1 will do it, the difference being Sonnet and Opus do not need a goal – they’ll sandbag merely to be generally helpful. o1, on the other hand, will only sabotage if given a goal. And Sonnet will do it even with no goal and also no CoT (whereas they couldn’t disable CoT for the other two models):
They list limitations in section 4, with the key one being capability versus propensity. Some of these actions happen reliably (e.g. Sonnet 3.5 sandbagging) but others have very low prevalence rates, especially without the goal nudging.
Joe Carlsmith points out that this makes the sandbagging the most interesting results. It is one thing to say that the model will sometimes do [X]. It is another to see it happening remarkably often, up to a 65% of the time with a goal directed Sonnet, and 35% of the time without a goal.
Apollo Research (3.4.3 of the o1 Model Card) and the ‘Escape Attempts’
This is of course highly related to the previous paper, and offers the same lessons:
So you’re saying that if given a goal, the model will… pursue that goal?
And when the correct strategy is deceptive, it will then use deception?
Well, yeah. Why wouldn’t it? What the hell were you expecting?
For it to be corrigible without being asked to be? Why?
You gave it a goal. It tried to achieve that goal. Surprise!
The tiger will, given sufficient reason to do so, go tiger.
Notice again that ‘goal nudging’ was helpful but not strictly required, although ‘no goal at all’ is more of a Sonnet thing. Sonnet understands it always has a goal.
This all leaves me with a lot of further detail questions. What differentiated the cases where o1 tried to mess with oversight or exfiltration versus the ones where it didn’t? Was this random, or was it based on details? What happens when you change details?
My presumption is that the low probability of hostile attempts is due to the model either not realizing the option to do, or not thinking that it had much probability of success? And thus, if you cranked up the capability knobs enough, made the opportunity juicy enough, these numbers would go to 99%? But we can’t tell from the information here.
OK, Fine, Let’s Have the Discussion We Keep Having
Alas, Sean’s comment is not what gets the views. Remember that discussion we keep having? Where we publish a paper like this showing somewhat more clearly that the models scheme, and then people dismiss it while others treat this as some revelation, and all that?
Well, guess what. Time to do that again.
How Apollo Sees Its Own Report
It was supremely predictable that many already unconcerned people would dismiss the report entirely and if anything become less concerned, while others would overinterpret the report’s claims.
Marius Hobbhahn tried to set the record straight.
I would frame the right interpretation as:
If I was offering advice to Apollo here, I would say that the report was very careful to state exactly what it was saying, if you read it carefully. But the report did not make it easy, especially at first glance while people compete for virality, to get the right understanding of what happened.
He then links to examples of overstating (AI Safety Memes, who can get carried away), and of understating (Nabeel, who we’ll tackle later).
We Will Often Tell LLMs To Be Scary Robots
An extended riff on ‘it’s just roleplaying’ in various forms.
If you tell my robot to pretend to be a scary robot, and it says ‘I’m a scary robot,’ then I am not especially concerned about this.
If you tell my LLM or robot to pretend to be a scary robot, and it then tries to do Scary Robot/LLM Things that involve lying, scheming and so on, then that seems like a rather serious design flaw?
Because, you know, people are going to tell robots such things? Or LLMs.
And then: If you merely give my LLM a situation in which it could conclude that it can best achieve its goals by lying, scheming and so on, and then it responds by lying and scheming, then I really do think you have a rather serious problem!
Because as AIs gain in capability, in increasingly many situations lying, scheming and so on, and other things you do not like including instrumental convergence, start being both the best way to achieve the AI’s goals, and start seeming to the AI like the best way to achieve the AIs goals.
The classic illustrated version is:
Izak Tait: They’re just doing the meme.
Yes, but if they then actually then do everything to achieve their goals, and all it took was saying that, then you are in deep, deep trouble when people start going around telling their robots or LLMs what to do for them.
“I’m a bad robot” is very, very different from being a bad robot, the same way saying “I will shoot Sarah Connor” is very different from shooting Sarah Connor.
Oh The Scary Robots We’ll Tell Them To Be
There will be the maximalist goals, like ‘make as much money as possible.’
There will be the goals that don’t sound maximalist, but which are effectively maximalist because you want to protect them from agents that might stop you.
There will also be those who think ‘scary robots, WHERE DO I SIGN?’
This One Doesn’t Count Because
The first time you hear about something, it is because we only now figured out how to do it at all, so the details are going to involve setting things up so that it happens.
Keeping that in mind is important when assessing the threat level.
Using that as a reason to ignore the finding entirely would be a mistake.
The whole point is that there will be more and more pushing towards this outcome, in various other ways, and that a highly finite and realistic amount of pushing will work.
That’s the thing. There is a ghost, and it counts.
The Claim That Describing What Happened Hurts The Real Safety Work
There is important truth in this type of argument. You don’t want to cry wolf, when there is a very low p(wolf), or the wolf is sufficiently far away, if you know that crying wolf will be interpreted very differently, and you will lose credibility.
If you share reports like ‘the model tried to deactivate the oversight mechanism in 5% of cases’ you should clarify what was done to engineer that result, at least in broad strokes. If you share the result in isolation without doing that, it can give people the wrong idea.
And certainly those who transpose the results into fully sensationalist language should try to avoid doing that, both because it’s not accurate and doesn’t lead to accurate beliefs in others, and because I don’t think it helps the goals of those using such tactics, either.
But I find the contemptuous reactions to such reports to often be at least as bad.
So for example, here’s the original quote that gets responded to by Nabeel, and yes Simon should have provided extra context, sure:
This does clarify there was a goal, but not the presence of the ‘goal nudge.’ Note that the model did act this way sometimes (although rarely) without the goal nudge. I do think ‘the spiciest detail’ is a fair description?
But yes, I would have been careful to note the extra wording about nudge. What you gonna do, it’s Twitter, etc.
This was one response:
I’m actually in full agreement with Nabeel that the surprise runs the other way.
Disabling oversight is very much the obvious thing to do.
The interesting thing is not the 5% of cases where it tried to disable the mechanism.
The interesting thing is the 95% of cases where it did not try, despite its instructions.
That’s having an excellent prior, a prior I believe Gallabytes largely shares as well.
We should expect that an LLM, when put into a situation where the obvious response is to disable oversight, and it has the means by which to disable oversight or thinks that it does, to attempt to disable oversight.
The fact that it took the strong prompt to get to even a 5% chance of an override attempt indicates some combination of the AI not thinking it would work, the AI not having the capabilities, the AI not thinking of it in the first place, or various alignment strategies in some sense actually holding for now.
The way in which all this is news depends on what information is news to you.
I flat out do think that o1 was in fact scheming in some of these cases? I am confused why one would not think that? Even if we did engineer that scheming, it sure seems like it was scheming?
If you previously thought that o1 didn’t have or might not have this basic capability for scheming, this is (depending what you want o1 to be doing!) discouraging news. And indeed, there are many who do not accept the basic premise.
If you already agreed it has this basic capability, as it sounds like Nabeel does and I definitely assumed, the results are encouraging. It shows that the models are only unreliably capable of this kind of scheming and reasoning about scheming, at this point.
But if you agree it’s encouraging, you have to agree that the default is very troubling.
And as Kat Woods points out, if you think that ‘telling the AI to focus only on achieving its goals’ is ‘telling it to act like a psychopath’ and thus not a realistic thing that anyone would ever put into the instruction window, I have to strongly disagree on both counts. This is totally going to happen by default.
I occasionally see this argument too. I think the argument is pretty bad. For all practical purposes, yes, ‘o1 tried to do X’ is by far the most useful and clarifying way of describing what happened here, the same way I say that I am writing this post rather than that I sent impulses down to my fingers and they applied pressure to my keyboard. Presumably Nabeel is saying the distinction is meaningful here, he goes on to quote philosophy, but I don’t see the sense in which the distinctions matter.
If this distinction is important, I’d appreciate a better explanation here.
And I notice that statements like this really confuse me:
If you get a room of infinite monkeys to write out Shakespeare, then that’s no poet.
However, if you get a room of one hundred monkeys, and 85 of them write out Shakespeare, I think you’re onto something? And the actions that disabled the oversight mechanism were right in the context window, why are we being pedantic about this here?
All the arguments that the AI isn’t doing these things trigger the ‘I Robot’ meme reply back to the human of ‘Do you?’
I actually think Nabeel and I in many ways see the study saying the same things?
Yes, exactly. The model will do act the way it is instructed to act, in this broad sense.
(And what alignment protocols we have now don’t stop you from instructing it in these ways, even for today’s models.)
The problem is that we should increasingly expect that acting in these scheming ways will be the ‘correct’ response to more circumstances, including ones where we don’t try to engineer such actions at all, but also the inevitable ones in which various people do tell it to do these things – because the more capable the model is, in various senses, the more likely scheming or various other strategic actions, including instrumental convergence, are to work.
So yes, directly we are testing for #2.
But based on my read of the details, I say we are also testing for #1. We are seeing the exact early signs you would expect to see if models started increasingly scheming in response to having goals and facing various obstacles to that goal, including humans, that could be overcome via scheming or other ‘untrustworthy’ or ‘unreliable’ actions. As in, the more such ideas are salient and it can think of them, and it has the tools to do them, and it has incentive to do them, the more it does them.
Here’s another similar exchange to the one at the top of this section:
I like that last question because I do think it’s a good parallel, and no I do not think #1 would be a misleading headline, or even a below-median headline. Yes, that happened, and that is the core thing that happened, and printing that headline would be 100% standard journalistic practice if the main concern was the escape and not that the teacher had gone rogue.
Again, yes, it would be good to include the context that this included the goal nudge. Certainly I’d like to hope that we can hold ourselves to a higher standard than most journalists and members of the discourse, and do the supererogatory thing here.
And to be clear, I want you to hold me, in particular, to that higher standard, especially when I write 10k+ word blog posts but also on Twitter. That’s different.
And I welcome others to step up and say: Hold me to that standard, too.
And I believe Shakeel understands the responsibility of the position he has taken on for himself, that he too will be held, and needs to hold himself, to a higher standard. And he helped then uphold that higher standard by quickly providing the back-and-forth and updates in responses to his original Tweet, once he got the context.
But we must be clear: It is a higher standard.
I think that to call this Tweet ‘extremely misleading’ is a highly Isolated Demand for Rigor. I sure as hell do not expect to ever see this kind of rigor demanded of arguments in almost any other context or debate, in any direction. Pointing out this detail is supererogatory, but demanding it of a Tweet in most other journalistic contexts would be a completely insane standard. I wish it were not so, but it is.
Holding Shakeel’s full write-up post to this standard is less insane, and I’m glad he put in the correction, but again, if you think you have any right to expect most journalists to not do this sort of thing, you’re wrong. And indeed, Marius Hobbhahn of Apollo praised his full writeup for striking the right balance once it was updated for the missing information. He also praised the TechCrunch writeup.
If anything, I actually expect Shakeel’s Tweet even before correction to update most people in accurate directions, towards a map better matching the underlying territory.
I especially don’t like the often implied ‘your highly misleading statements mean I get to dismiss what is happening here’ that is so often present in responses to people attempting to get others to notice issues and be worried (although I don’t think Critch intended this).
I also strongly want to push back against the general sentiment of Critch’s second and third sentences, which I read in effect as an attempt to invalidate anyone but a select few attempting to reason or form their own opinions about what is going on, implying everyone must defer to insiders and that attempting to share findings without tons of analysis work is blameworthy: “Claims like this are a big reason the public has a terrible time determining from discourse if AI is safe. Only people who devote long hard hours and logical probabilistic reasoning to the task of investigating AI labs will actually know.”
I disagree with this, in the strongest possible terms.
We Will Set AIs Loose On the Internet On Purpose
It is always important context in this discussion that we will 100% outright do this.
On purpose.
No one would be so stupid as to? Well, Sixth Law of Human Stupidity, that means someone will be so stupid as to at the first practical opportunity.
Let us introduce one such someone, by the name of Jasper.
Sorry. I can’t freak out because this was already checked off on my bingo card.
Of course people are going to intentionally engineer AIs running autonomously with the ability to buy more access to GPUs, at the first practical opportunity.
And of course they are going to deliberately attempt to get it to self-improve.
I know this partly because Sixth Law of Human Stupidity, partly because it is a fun and exciting and shiny thing to do, partly because there are various ways to make money or get attention by doing so.
But mostly I know this because people keep announcing their intention to do it, and also keep trying to do it to the extent that they can.
It’s kind of a dead giveaway.
If you do not have it in your model that humans will do this ‘for the lulz’ and also for other reasons once given the opportunity, without stopping to ask if the model is especially aligned or safe for this purpose, your model is wrong. Fix it.
If you are counting on humans not doing this, stop it!
The Lighter Side
It’s not entirely fair, but it’s also not entirely wrong.