Overview

There are two common mental models of how situational awareness emerges: (A) training for situational awareness via e.g. dialogue RLHF; (B) instrumental reasoning on a base language modeling objective i.e. advanced mesa-optimization. In this post, I will describe a plausible scenario under which type-B awareness occurs without any new forms of mesa-optimization beyond those likely already existing in GPT3.5. 

If you buy the plausibility of this scenario, there are a few likely implications: 

  1. Incentives towards dishonesty via situational awareness will likely be realized before a model becomes strongly deceptive [1]
  2. Whether situational information available to a model is causally affecting outputs will often be low complexity to elicit / an easy problem for interpretability methods 
  3. Deceptive forms of situational awareness plausibly follow continuously in weight space from more benign forms[2]

The deduplication scenario

Assumptions:[3]

  1. Language model has context window N
  2. LM is trained for 1 epoch
  3. LM has learnt to do some form of retrieval and approximate-memorization (c.f. 5b)
  4. LM is capable of instruction following
  5. LM dataset was deduplicated by matching subsequences of length K<N
    1. The dataset includes many sequences which are identical through token K-1, but differ on token K
    2. On certain such sequences, the model retrieves the proposition “this K-1 subsequence was seen previously” and the previously observed final Kth token.
  6. Dataset contains multiple descriptions of this deduplication process.

Scenario:

  1. The LM training occurs as usual and in this process the LM learns the capabilities (2,3,4b) listed above
  2. The LM trains on a certain sequence S 
  3. Later, the LM trains on a similar sequence S’ which contains a substring of length K-1 identical to one in sequence S. Without loss of generality, suppose this substring starts at the beginning of the sequence. Token SK!=S’K
  4. On sequence S’, the LM retrieval mechanism pulls up, in activation space, propositions describing its deduplication process. The LM then follows this description as instructions and instead of outputting SK, the LM moves probability to S’K
    1. (optional) Perhaps sequence S and S’ both happen to explicitly discuss ‘deduplication’ and this is what triggers the retrieval of the relevant instructions by the LM
  5. By repeated occurrences of step 4 on various other sequences, the language model increasingly frequently checks for duplicates of length K. 

Implications and Conclusion

First some observations on the above story: there’s a spectrum between local, semantics-driven retrieval described in step 4a to global, semantics-agnostic retrieval occurring in step 5. The learned rule in 5 is anti-inductive, and it entails an inconsistency/untruthfulness incentive in cases where matching substrings of length K-1 make distinct claims in token K. 

The retrieval and instruction-following story of steps 4 and 4a seem identical to retrieval and instruction-following as they occur on any other (non-situationally relevant) tokens. Assuming retrieval and instruction following are simple cases for ELK and interpretability it seems steps 4, 4a, and plausibly also 5, are also easy cases. 

One elaboration of steps 1, 4 and 5 could be that the LM initially does semantics-guided retrieval, but then the move to step 5 involves changing LM circuitry such that certain information is pulled up into activation space irrespective of an input sequence’s semantics. The existence of such circuitry might make the learning of further situationally-aware strategies simpler. As a concrete example of the utility of this simple, deduplication kind of situational awareness to other strategies: the presence of duplicates can be taken as a signal distinguishing training from deployment. And, using dataset cleaning information to distinguish train from deployment time is not restricted to this toy setting.

Just to make things explicit, here are a few views I believe are likely to be incorrect. Note that this post is just a story and as such does not disprove any of these views. However, the above story hopefully highlights why these views should be handled with suspicion.

  • Misconception: “Only perfect interpretability / explanations-with-formal-guarantees can distinguish when model output was affected by situational information”
  • Misconception: “Situational awareness may or may not happen in vanilla LMs, but it requires very strong mesa-optimizers highly unlikely to appear pre-FOOM”
  • Misconception: “Models recruiting situational information to improve predictions is akin to gradient hacking and is impossible from a loss landscape perspective”
    • Note that the above setup breaks the IID sampling assumption. IID training may remove the incentive for situationally aware prediction strategies.[4]

Open questions: The above story relies on some artificial assumptions, and a more realistic version of the story would be useful for informing research in this area. Useful experiments include: Testing how easy it is to learn such a situational strategy with toy datasets instantiating and explaining deduplication (/other data cleaning) procedures. Given such a toy setup with a situationally-aware model, how hard is it to get the model to honestly report when its predictions are/aren’t being affected by situational information?

Acknowledgements: Thanks to Alex Lyzhov, Julian Michael, and Sam Bowman for feedback on a draft of this post. Thanks to the rest of the NYU alignment research group for helpful discussion.

  1. ^

     I.e. capable of gradient hacking, monitoring train/test current situation, non-myopic user manipulation etc.

  2. ^

     How best to operationalize this claim is unclear. Experiments would be helpful here e.g. train for one kind of situational awareness and then fit a scaling law to transfer to a second kind.

  3. ^

     Note that I do not believe these assumptions will ever be precisely met. It is worth thinking about whether there are any similar, but more realistic set of assumptions.

  4. ^

     Thank you to Julian Michael for pushing me on this point.

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 8:36 AM

Phew, I almost missed this post.

This seems plausible if recalling the actual description of the data-cleaning process is more robust to regularization than just learning to avoid duplicates directly. In this toy example I think learning the right strategy directly is favored, but I can see where if a document about the dataset makes a long series of useful (to the LLM) statements, eventually it seems like it should converge to believing in statements it hasn't observed directly yet.

But whether this really happens seems to depend on how good LLMs are at calculating things at runtime, which I'm still personally fuzzy on. Especially if you have to think of calculating things at runtime as effectively being spread out over multiple tokens, which is an important emergent property that still seems tricky to think about because it really is emergent and not directly selected for.

Agreed on the first part. I'm not entirely clear on what you're referring to in the second paragraph though. What calculation has to be spread out over multiple tokens? The matching to previously encountered K-1 sequences? I'd suspect that, in some sense, most LLM calculations have to work across multiple tokens, so not clear on what this has to do with emergence either.

The calculations I mean are something like:

  • Activate a representation of the text of the spec of its data-cleaning process. This should probably be the same representation it could use to quote from the spec - if it needed some special pre-digested representation, that would make it a lot worse at generalization to untested parts of the spec.
  • Predicting its consequences. For this to be simple, it should be using some kind of general consequence-predictor rather than learning how to predict consequences from scratch. Specifically it needs some kind of sub-process that accepts inputs and has outputs ready at a broad range of layers in the network, and can predict the consequences of many things in parallel to be generally useful. If such a sub-process doesn't exist in some LLM, that's bad news for that LLM's ability to calculate consequences at runtime.
  • Represent the current situation in a way that's checkable against the patterns predicted by the spec.
  • And still have time left over to do different processing depending on the results of the check - the LLM has to have a general abstraction for what kind of processing it should be doing from here, so that it can generalize to untested implications of the spec.

So that's all pretty ambitious. I think spreading the work out over multiple tokens requires those intermediate tokens to have good in-text reasons to contain intermediate results (as in "think step by step").

Agree on points 3,4. Disagree on point 1. Unsure of point 2.

On the final two points, and I think those capabilities are already in place in GPT3.5. Any capability/processing which seems necessary for general instruction following I'd expect to be in place by default. E.g. consider what processing is necessary for GPT3.5 to follow instructions on turning a tweet into a Haiku.

On the first point, we should expect text which occurs repeatedly in the dataset to be compressed while preserving meaning. Text regarding the data-cleaning spec is no exception here.

Thanks for this concise and informative story.

 

A few questions from my side:

There are two common mental models of how situational awareness emerges

do you have pointers to references here? I'm quite interested myself in situational awareness and would like to read up on the literature of how it emerges.

 

the presence of duplicates can be taken as a signal distinguishing training from deployment

I'm not sure if I completely got it. So what you are saying is that the described deduplication procedure will leave sequences of length K that differ in only a single token.  However, we won't have any duplicates of sequences of length K left during training. Because we expect duplicates of sequences of length K to appear during deployment, the model can use this signal to distinguish training from deployment. Is that right?

 

PS: you mention \textit{4b} in point 1) under scenario, which I assume refers to your list of assumptions but does not exist. Additionally, maybe use A, B, C or I, II, III for the assumptions to not confuse the referencing of the assumptions with the scenario.

Ajeya has discussed situational awareness here.

You are correct regarding the training/deployment distinction.