Epistemic Status: I’m not saying anything new here; just thinking through ideas for myself.
Tl;dr: besides evaluations, and some limited real-world scenarios, we can’t do much about situational awareness.
Situational Awareness is, roughly, when a model knows about itself (“I’m an LLM”) and its surroundings (“I’m editing the fine-tuning code for an internal deployment of Claude”).
We might worry about situational awareness during (a) evals, or (b) real world deployments (RWDs). With evals, if the model is aware its being evaluated, it might pretend to be more aligned than it really is, or it might sandbag. With RWDs, the model might use its knowledge of, say, currently editing training code for a frontier model, to interfere with the run in some way (if it is misaligned).
So, situational awareness might be a concern. What should we do about it? I think there are mainly two things – we could either measure it, or try to directly reduce it.
Measurement is nice because tells us *how* situationally aware models are, and flags when this might become a problem. Unfortunately, that canary has long passed. Models, especially recent ones, are exceptionally ‘situationally aware’ – regardless of how we’re operationalizing that term.
We have a few options to mitigate situational awareness.[1] The issue, though, is this is a double-edged sword. The more a model knows about its environment, the better it can navigate it. Ever dumped your entire project repo + history + context into Claude Code? Yeah, that.
So measuring situational awareness hardly tells us something new. And in most real world cases, the incentives overwhelmingly push towards providing the model with as much context as possible – so neither can we mitigate it effectively.
When it comes to evals, yes, situational awareness might be at odds with the point of an evaluation. And there probably are real world scenarios where limiting situational awareness might be crucial.[2] For the most part, however, its here to stay.
Many thanks to Andrés Cotton, Andrew Lawrence, Jo Jiao, and especially Yixiong Hao for reading through earlier drafts of this post.
Eg. we might selectively omit context, or feed the model incorrect information. We could also intervene on the residual stream to mitigate certain aspects of situational awareness (eg. eval-specific awareness).
Epistemic Status: I’m not saying anything new here; just thinking through ideas for myself.
Tl;dr: besides evaluations, and some limited real-world scenarios, we can’t do much about situational awareness.
Situational Awareness is, roughly, when a model knows about itself (“I’m an LLM”) and its surroundings (“I’m editing the fine-tuning code for an internal deployment of Claude”).
We might worry about situational awareness during (a) evals, or (b) real world deployments (RWDs). With evals, if the model is aware its being evaluated, it might pretend to be more aligned than it really is, or it might sandbag. With RWDs, the model might use its knowledge of, say, currently editing training code for a frontier model, to interfere with the run in some way (if it is misaligned).
So, situational awareness might be a concern. What should we do about it? I think there are mainly two things – we could either measure it, or try to directly reduce it.
Measurement is nice because tells us *how* situationally aware models are, and flags when this might become a problem. Unfortunately, that canary has long passed. Models, especially recent ones, are exceptionally ‘situationally aware’ – regardless of how we’re operationalizing that term.
We have a few options to mitigate situational awareness.[1] The issue, though, is this is a double-edged sword. The more a model knows about its environment, the better it can navigate it. Ever dumped your entire project repo + history + context into Claude Code? Yeah, that.
So measuring situational awareness hardly tells us something new. And in most real world cases, the incentives overwhelmingly push towards providing the model with as much context as possible – so neither can we mitigate it effectively.
When it comes to evals, yes, situational awareness might be at odds with the point of an evaluation. And there probably are real world scenarios where limiting situational awareness might be crucial.[2] For the most part, however, its here to stay.
Many thanks to Andrés Cotton, Andrew Lawrence, Jo Jiao, and especially Yixiong Hao for reading through earlier drafts of this post.
Eg. we might selectively omit context, or feed the model incorrect information. We could also intervene on the residual stream to mitigate certain aspects of situational awareness (eg. eval-specific awareness).
Eg. if you’re trying to run blind evals / reviews, and you *want* to maintain an info barrier.