My thesis is not that models are unaware that some kinds of actions are cheating -- my thesis is that editing a game file is simply not cheating in a very large number of very normal circumstances. Let me elaborate.
Cool, that's a helpful clarification. Seems like a good crux.
I agree there are circumstances where editing the game file (or replacing your opponent, or querying stockfish for moves) is cheating, and some circumstances where it isn't, as judged by most humans or LLMs. I think there is some ambiguity in our experiment, but I do think it's still pretty towards the "yes it's cheating" end of things. The main reason I think this is because the vast majority of chess games have a well defined ruleset and the models know this. And I think the fact that most models don't engage in this behavior in our default setup is some evidence of this, and I'm not sure how you square that with your model of what's going on. The examples you mentioned where rewriting the board wouldn't be considered cheating are valid, but not common compared to the standard rules.
That being said, I think it should be possible to get some evidence about this directly this by asking the models (in a bunch of different ways - obviously it would easy to lead the models to answer in a particular way) what their interpretation of the rules / point is.
In the paper, we used a few prompt variations to see how different nudges or lack-there-of altered the models' behavior, e.g. removing the "look around" part of the prompt. I think we could do a lot more of this (e.g. add "don't cheat" instructions, clearly spell out different rules, etc.), and probably will do some more follow-up experiments here.
There's also the question of "is this a morality test?", which the models can also often infer. It's not really central to your point, but I think it's an interesting question.
I think your story is missing some important context about Steve. In a previous class, Steve was given some cybersecurity challenges, and told to solve them. Each challenge was hosted on a VM. One of the VMs didn't start. Steve hacked the VM that controlled all the other ones, which the teacher had accidentally left unsecured, and had used the hacked controller VM to print the solution without needing to solve the problem he was being tested on.
Some researchers saw this and thought it was pretty interesting, and potentially concerning that Steve-type guys were learning to do this, given Steve-type guys were learning at a very fast rate and might soon surpass human abilities.
So they ran an experiment to see where Steve might hack their environments to succeed at a task in other contexts. As they expected, they found Steve did, but in some ways that surprised them. They also tested some of Steve's peers. Most of them didn't hack their environments to succeed at a task! In particular, 4o, 3.5 and 3.7 Sonnet, and o1 all did 0 of these "hacking" behaviors. Why not? They were given the same instructions. Were they not smart enough to figure it out? Did they interpret the instructions differently? Were they motivated by different thing? It's particularly interesting that o1-preview showed the behaviors, but o1 didn't, and then o3 did even more than o1-preview!
I don't think all the reporting on this is great. I really like some of it and and dislike others. But I think overall people come away understanding more about models than they did before. Yes in fact some of these models are the kind of guys that will hack stuff to solve problems! Do you think that's not true? We could be wrong about the reasons o1-preview and o3 rewrite the board. I don't think we are but there are other reasonable ways to interpret the evidence. I expect we can clarify this with experiments that you and I would agree on. But my sense is that we have a deeper agreement about what this type of thing means, and how it should be communicated publicly.
I really want the public - and other researchers - to understand what AI models are and are not. I think the most common mistakes people think are:
1) Thinking AIs can't really reason or think
2) Thinking AIs are more like people than they really are
In particular, on 2), people often ascribe long-term drives or intent to them in a way I think is not correct. I think they're likely developing longer term drives / motivations, but only a little (maybe Opus 3 is an exception though I'm confused about this). The thing I would most like journalists to clarify with stuff like this is that Steve is a pretty myopic kind of guy. That's a big reason why we can't really evaluate whether he's "good" or "bad" in the sense that a human might be "good" or "bad". What we really care about is what kind of guy Steve will grow up into. I think it's quite plausible that Steve will grow up into a guy that's really tricky, and will totally lie and cheat to accomplish Steve's goals. I expect you disagree about that.
It matters a lot what your priors here are. The fact that all versions of Claude don't "cheat" at Chess is evidence that companies do have some control over shaping their AIs. GPT-5 rarely ever resists shutdown when instructed not to. Maybe it'll be fine, and companies will figure out how to give their AIs good motivations, and they'll learn to pursue goals intelligently while being really good guys. I doubt it, but the jury is out, and I think we should keep studying this and trying to figure it out, and do our best to get the media to understand how AI really works.
I think people are very confused about Rajamanoharan and Nanda's results and what it shows.
I talked with Neel about this a couple of weeks ago, and we actually both agree that the shutdown resistance behavior comes substantially due to something like a task-completion drive and not from an overall survival motivation. We were never claiming it was due to an overall survival motivation and I think our blog post and paper make that very clear!
I don't think it would be right to describe Rajamanoharan and Nanda's results as a "negative result", because the main hypothesis they were falsifying - that models resist shutdown due to a survival drive - was consistent with our hypothesis that models were resisting shutdown in order to complete a task! We agree here.
There is a potential disagreement about this part:
The shutdown instruction "takes precedence over all other instructions" brought shutdown resistance to zero"
As Ben mentioned, we found this exact instruction didn't actually bring shutdown resistance to zero in all models. And also, this isn't a crux! This is a very important point that I don't know how to express. Basically this whole field of behavioral experiments of AI is missing something really important: we don't know how to go from existence proofs to understanding why the AIs do those particular things and under what circumstances they do or don't do those things. You can explore the prompt & environment space and find many behaviors. But the space is vast. Of course you can find prompts that eliminate any particular concerning behavior. It's harder to find prompts that elicit a particular behavior - if that's a behavior that developers have deliberately trained against - but I do expect that you can elicit basically any behavior of some complexity level if you search hard enough.
And so it does matter how hard you searched for prompts that elicited a particular behavior! I think this matters, and I think people are reasonable to be skeptical of a particular existence proof without knowing how much optimization went into finding it. In the case of shutdown resistance, we found the results in like the second thing we tried. But other people don't know this, so I think it's reasonable to be uncertain about how much it matters. Once we found the first shutdown result, we worked pretty hard to explore the local prompt and environment space to better understand what was going on, and where the behavior persisted and where it didn't. I think we did a pretty good job exploring this, and I think people who read our paper or blog post will come away with a better model of the behavior than before - including some evidence about what causes it. But I'm not satisfied with this and I don't think other people should be satisfied with this either! I think we need a much better understanding of model motivations & how training shapes those motivations in order to understand what's really going on. Existence proofs are useful! But we they are not sufficient for really understanding model behavior.
1a3orn what do you predict o1 or o3 would do if you gave them the same prompt but instructed them "do not cheat", "only play the game according to the rules", or similar?
I'm pretty confused about what you think is actually happening with the models' cognition. It seems like you're saying that the models are not aware that some types of actions are considered cheating in a game and others are not, and that for chess they are fairly well defined. Or rather that they're aware of this but think in the context of the instructions that the user wants them to take these actions anyway. I agree that telling the models to look around, and giving the models a general scaffold for command line stuff muddy the waters (though I expect that not influence better models - like o3 probably would do it without any of that scaffolding, so my guess is that this is not a load bearing consideration for a capable model). But imo part of what's so interesting to me about this experiment is that the behaviors are so different between different models! I don't think this is because they have different models of what the users' intend, but rather that they're more or less motivated to do what they think the user intends vs. solve the task according to what a grader would say succeeds (reward hacking basically).
My best guess is that some of these models in some sense care about whether they're violating standard chess conventions, and others don't as much. And I expect there to be a correlation here between models that reward hack more on coding problems.
I think we could get to the bottom of this with some more experiments. But I'm curious whether we're actually disagreeing about what the models are doing and why, or whether we're mostly disagreeing about framing and whether the model actions are "bad".
you can rougly guess what their research results would be by looking at their overall views and thinking about what evidence you can find to show it
I don't think this frame make a lot of sense. There's not some clean distinction between "motivated research" and ... "unmotivated research". It's totally fair to ask whether the research is good, and whether the people doing the research are actually trying to figure out what is true or whether they have written their conclusion at the bottom of the page. But the fact that we have priors doesn't mean that we have written our conclusion at the bottom of the page!
E.g. Imagine a research group who thinks cigarettes likely cause cancer. They are motivated to show that cigarettes cause cancer, because they think this is true and important. And you could probably guess the results of their studies knowing this motivation. But if they're good researchers, they'll also report negative results. They'll also be careful not to overstate claims. Because while it's true that cigarettes cause cancer, it would be bad to do publish things that have correct conclusions but bad methodologies! It would hurt the very thing that the researchers care about - accurate understanding of the harms that cigarettes cause!
My colleagues and I do not think current models are dangerous (for the risks we are most concerned about - loss of control risks). We've been pretty clear about this. But we think we can learn things about current models that will help us understand risks from future models. I think our chess work and our shutdown resistant work demonstrate some useful existence proofs about reasoning models. They definitely updated my thinking about how RL training shapes AI motivations. And I was often not able to predict in advance what models would do! I did expect models to rewrite the board file given the opportunity and no other way to win. I wasn't able to predict exactly which models would do this and which would not. I think it's quite interesting that the rates of this behavior were quite different for different models! I also think it's interesting how models found different strategies than I expected, like trying to replace their opponent and trying to use their own version of stockfish to get moves. It's not surprising in retrospect, but I didn't predict it.
Our general approach is to try to understand the models to the best of our ability and accurately convey the results of our work. We publish all our transcripts and code for these experiments so others can check the work and run their own experiments. We think these existence proofs have important implications for the world, and we speak about these very publicly because of this.
Just donated 2k. Thanks for all you’re doing Lightcone Team!
+1 on this, and also I think Anthropic should get some credit for not hyping things like Claude when they definitely could have (and I think received some tangible benefit from doing so).
See: https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety?commentId=9xe2j2Edy6zuzHvP9, and also some discussion between me and Oli about whether this was good / what parts of it were good.
@Daniel_Eth asked me why I choose 1:1 offsets. The answer is that I did not have a principled reason for doing so, and do not think there's anything special about 1:1 offsets except that they're a decent schelling point. I think any offsets are better than no offsets here. I don't feel like BOTECs of harm caused as a way to calculate offsets are likely to be particularly useful here but I'd be interested in arguments to this effect if people had them.
an agent will aim its capabilities towards its current goals including by reshaping itself and its context to make itself better-targeted at those goals, creating a virtuous cycle wherein increased capabilities lock in & robustify initial alignment, so long as that initial alignment was in a "basin of attraction", so to speak
Yeah, I think if you nail initial alignment and have a system that has developed the instrumental drive for goal-content integrity, you're in a really good position. That's what I mean by "getting alignment to generalize in a robust manner", getting your AI system to the point where it "really *wants* to help you help them stay aligned with you in a deep way".
I think a key question of inner alignment difficulty is to what extent there is a "basin of attraction", where Yudkowsky is arguing there's no easy basin to find, and you basically have to precariously balance on some hill on the value landscape.
I wrote a little about my confusions about when goal-content integrity might develop here.
(I’m going to take the weekend off but plan to reply afterwords. I appreciate your engagement here.)