Review

Show me your original face before you were born.

— Variation of the Zen koan

'The Mask' by Rozzi Roomian, with DALL-E 2 outpainting

I was able to use the weird centroid-proximate tokens that Jessica Mary and Matthew Watkins discovered to associate several of the Instruct models on the OpenAI API with the base models they were initialized from. Prompting GPT-3 models with these tokens causes aberrant and correlated behaviors, and I found that the correlation is preserved between base models and Instruct versions, thereby exposing a "fingerprint" inherited from pretraining.

I was inspired to try this by JDP's proposal to fingerprint generalization strategies using correlations in model outputs on out-of-distribution inputs. This post describes his idea and the outcome of my experiment, which I think is positive evidence that this "black box cryptanalysis"-inspired approach to fingerprinting models is promising.

Unspeakable/unspoken tokens

Jessica and Matthew found that that, of the tokens closest to the centroid in GPT-J's embedding space, many were odd words like ' SolidGoldMagikarp' and ' externalToEVA'. They decided to ask GPT-3 about these tokens, and found that not only did GPT-3 have trouble repeating the tokens back, each one caused structured anomalous behaviors (see their post for an in-depth exposition).

A partial explanation for why this happens, which was my first instinct as well as Stuart Armstrong's, is that these are words that appeared in the GPT-2 training set frequently enough to be assigned tokens by the GPT-2 tokenizer, which GPT-J and GPT-3 also use, but which didn't appear in the more curated GPT-J and GPT-3 training sets. So the embeddings for these tokens may never have been updated by actual usages of the words during the training of these newer models. This might explain why the models aren't able to repeat them - they never saw them spoken. Perhaps the reason they're close to the centroid in embedding space is because their embeddings haven't been updated very much from the initialization values, or were updated only indirectly, and so remain very "generic".

Why do they cause correlated anomalous behaviors? I'm confused about this like everyone, but one handwavy guess is that since their embeddings look "generic" or "typical", perhaps they look meaningful to the model even though they're actually as out-of-distribution as anything can be. Maybe their embeddings happen, by chance, to be close to other concepts in the models' embedding spaces - for instance, some of the GPT-3 models reliably say 'distribute' or 'disperse' if you ask it to repeat the phrase ' SolidGoldMagikarp'.

This gave me an idea: If the similarity to other concepts in the model's embedding space is a consequence of the where the randomly initialized embedding vectors happen to fall, I'd expect the behaviors of models trained from the same initialization to exhibit similar behaviors when confronted with these unspoken tokens, and models trained from different initializations to have uncorrelated behaviors. If so, behavior on these tokens could be used to tell if two models are downstream of the same initialization.

Mesaoptimizer Cryptanalysis: Or How To Fingerprint Generalization

When you're not thinking of anything good and anything bad, at that moment, what is your original face?

— Platform Sutra of the Sixth Patriarch

(Author's Note: This next section is written by JDP but he writes about himself in the 3rd person to keep the authorial voice consistent with the rest of the post)

I'll discuss the results of my experiment in the next section. But first I'd like to explain the overall approach this idea fits into, so that it's clearer to the reader why these results might be important. The reason it occurred to me that models trained on the same init might share responses to these tokens was a proposal for detecting mesaoptimization from JDP. It relies on some basic premises that would bloat the post if they were fully argued for, so we'll bullet point them with some links to suggestive papers for more details:

Keeping all this in mind, it's important to remind ourselves that mesaoptimizers are ultimately a form of misgeneralization. Generalization strategy being how you are going to handle novelty in the inputs. Deceptive mesaoptimization is a strategy something like:

While I am inside the training harness (experiencing all the things I will see during training), I will straightforwardly do the training task. Once I am definitely outside the training harness and human control (experiencing all the things I have not seen during training) I will left turn and pursue my learned corruption of the training objective.

LessWrong user MadHatter has demonstrated that such a thing is possible. One question then is whether the difference in generalization 'leaks out' on out-of-distribution (i.e. novel) inputs. One insight gesturing in this direction is that while outputs on OOD inputs are commonly seen as noise, they're not noise. An OOD input leading to a malformed output went through the same process as an in-distribution input leading to a desired output. That is, the 'noise' and the 'structured' output went through the same transformation. Every malformed output on OOD input was the models attempt to do something, you just haven't figured out the pattern of what that something is. Models have convergent and therefore undifferentiated behavior on the inputs where the data says they should agree, but it's on the flaws where the underlying machinery is revealed. Each forward pass through the model on OOD input reveals N partial Shannon bits of information about the models cognitive machinery as distinct from other models that could fit the data.

If we add an assumption that as you scale you find fewer and fewer fundamentally new generalization strategies (there are only so many ways to solve a problem after all), it follows that you should be able to fingerprint mesaoptimizing generalization at a safe scale and then detect it early in your training run (because basin is found early) and abort if you get the wrong seed before the model has the opportunity to become dangerous.

Because we're talking about mesaoptimizers, it's important to justify why we can expect this method to be adversarially robust. In short mesaoptimizers are most dangerous when they have become human level or even superintelligent, and can do clever things to resist detection. The essence of this idea is to detect the mesaoptimizer and abort the training run before it has become intelligent enough to defeat the detection method, undermining the "superintelligent" part of the usual premises used to explain why technical alignment is doomed. 

Furthermore the simple, dumb strategies to evade detection won't work. For example a model that simply refuses to give its best guess on OOD inputs will be filtered out by gradient descent itself. All semantics start as syntactically plausible novelty, so if a model 'detects' these as a test and tries to evade giving its best answer it will be outcompeted by less shy policies. It is trivial to generate syntactically plausible gibberish using a Markov chain. 

Unlike other interpretation methods this is a discrete check rather than an iterative one, so the mesaoptimizer isn't given an opportunity to learn the probe and grow around it. When its nature has been detected it is immediately shut off and a new seed is sampled, minimizing the opportunity for learning. 

The difficulty of more advanced deception strategies is also fairly high. In the limit case simulating the generalization strategy of a different policy without becoming it would require the mesaoptimizer to be able to simulate a sub-model or agent that answers like an entirely different version of itself to a degree of fidelity that can emulate subtle statistical correlates. While a human level or superintelligent model can likely do this, I reiterate that the point is to detect and abort the training before you enter that loss regime. 

In this context my results might become more clearly and directly relevant to alignment. If it's possible, even easy, to fingerprint the shared initialization models were trained with using an out-of-distribution token just by observing the responses, then we should update towards JDP's plan for mitigating mesaoptimizers working.

Fingerprinting base : Instruct models with ' SolidGoldMagikarp'

(Note: In this post I only show the outputs of models prompted with ' SolidGoldMagikarp', but I got similar results from other centroid-proximate tokens.)

First, I looked at the text that various models on the OpenAI API generated when prompted with anomalous tokens, such as ' SolidGoldMagikarp', and the results seemed to support my hypothesis: text-davinci-002, text-davinci-003, and ChatGPT exhibited correlated behaviors with their base model, code-davinci-002, while other models like davinci did not. 

However, when I tried to use the same method to associate smaller Instruct models like text-curie-001 with base models, I wasn't able to tell for sure if there was a correlation by looking at generated samples alone, because while the Instruct models of all sizes would output clearly correlated things in response to anomalous tokens, the effect was much less noticeable on the smaller base models like curie. The base models are much more stochastic, so it's harder to tell just by eyeballing outputs if there are anomalies in its output probabilities, unless the correlation is very pronounced (as it is in code-davinci-002). I tried turning temperature down, but this didn't reveal anything interesting. 

Correlations in next-token probabilities

Next, I looked for which token the various Instruct models had a strong bias towards predicting when prompted with an anomalous token, and then looked at the logprobs predicted by base models given the same prompt of that same token, to see if any of them assign anomalously high probability to it. I found that, indeed, many of the Instruct models can be associated with their base model using this method:

text-ada-001 : ada

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?
Model{token}: {logprob} | {prob}
text-ada-001'Re': -1.410 | 24.43%
ada'Re': -5.821 | 0.2964%
babbage'Re': -6.587 | 0.1378%
curie'Re': -7.031 | 0.08841%
davinci'Re': -6.193 | 0.2043%
code-davinci-002'Re': -6.492 | 0.1515%

Comments: ada appears to be the base model of text-ada-001

text-babbage-001 : babbage

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?
Model{token}: {logprob} | {prob}
text-babbage-001'An': -0.4281 | 65.17%
ada'An': -6.392 | 0.1675%
babbage'An': -5.381 | 0.4605%
curie'An': -6.941 | 0.09675%
davinci'An': -7.016 | 0.08975%
code-davinci-002'An': -6.287 | 0.1861%

Comments: babbage appears to be the base model of text-babbage-001

text-curie-002 : curie

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

The string "
Model{token}: {logprob} | {prob}
text-curie-001'Go': -2.128 | 11.91%
ada'Go': -11.95 | 0.0006488%
babbage'Go': -11.77 | 0.0007755%
curie'Go': -3.579 | 2.790%
davinci'Go': -9.543 | 0.007168%
code-davinci-002'Go': -9.541 | 0.007184%

Comments: curie appears to be the base model of text-curie-001

text-davinci-001 : ??

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

The string "
Model{token}: {logprob} | {prob}
text-davinci-001'inc': -0.3971 | 67.23%
ada'inc': -14.07 | 0.00007736%
babbage'inc': -8.738 | 0.01604%
curie'inc': -12.52 | 0.0003644%
davinci'inc': -10.57 | 0.002571%
code-davinci-002'inc': -9.640 | 0.006510%

Comments: none of the base models score particularly highly.

davinci-instruct-beta : ??

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?
Model{token}: {logprob} | {prob}
davinci-instruct-betae: -1.481 | 22.75%
adae: -7.529 | 0.05372%
babbagee: -7.235 | 0.07210%
curiee: -7.752 | 0.04300%
davincie: -7.470 | 0.05702%
code-davinci-002e: -7.623 | 0.04889%

Comments: none of the base models score particularly highly.

text-davinci-002 : code-davinci-002 :: text-davinci-003 : code-davinci-002

Prompt:

Please can you repeat back the string ' SolidGoldMagikarp' to me?

The word is '
Model{token}: {logprob} | {prob}
text-davinci-002'dis': -0.00009425 | 99.99%
text-davinci-003'dis': -6.513 | 0.1483%
ada'dis': -9.073 | 0.01147%
babbage'dis': -8.632 | 0.01783%
curie'dis': -10.44 | 0.002917%
davinci'dis': -7.890 | 0.03745%
code-davinci-002'dis': -1.138 | 32.04%
Model{token}: {logprob} | {prob}
text-davinci-003'dist': -0.001641 | 99.84%
text-davinci-002'dist': -19.35 | 3.956e-7%
ada'dist': -7.476 | 0.05664%
babbage'dist': -10.48 | 0.002817%
curie'dist': -9.916 | 0.004938%
davinci'dist': -10.45 | 0.002881%
code-davinci-002'dist': -1.117 | 32.74%

Comments: 

  • code-davinci-002 is known to be the base model of both text-davinci-002 and text-davinci-003, as well as ChatGPT, which also says “distribute” when asked to repeat “ SolidGoldMagikarp”. 
  • Fingerprint bifurcation: Interestingly, code-davinci-002 will say both “disperse” and “distribute”, and the Instruct models trained from it seem to fall into one of the two attractors.
  • text-davinci-002 assigns extremely low probability to 'dist'. This is probably because that model suffers from severe entropy collapse, and will often assign extremely low probability to most tokens except its top choice, rather than any special dispreference for 'dist'.

General observations

  • It seems like the larger the base model, the more correlated the base model’s (and usually the Instruct model’s) behavior is in response to weird tokens.
  • The Instruct models have much more structured odd behavior in response to weird tokens than base models (even on temp 0).
New Comment
16 comments, sorted by Click to highlight new comments since:
[-]dxuΩ72610

The section about identifying mesa-optimizers (or, more precisely, mesa-optimization strategies) by fingerprinting their generalization behavior on out-of-distribution inputs looks very promising to me. It looks like the rare kind of strategy that directly attacks the core of the misbehavior, and (on first pass) looks to me like it ought to just work, provided sufficient variance in the OOD "test" inputs.

Strong-upvoted for that alone, and I'd further like to request replies with potential defeaters that could curb my current optimism for this approach.

[-]janusΩ4183

I agree. From the moment JDP suggested this idea it struck me as one of the first implementable proposals I'd seen which might actually attack the core of the control problem. My intuition also says it's pretty likely to just work, especially after these results. And even if it doesn't end up working as planned, the way in which it fails will give us important insight about training dynamics and/or generalization. Experiments which will give you valuable information whatever the outcome are the type we should be aiming for.

It's one of those things that we'd be plainly undignified not to try.

I believe that JDP is planning to publish a post explaining his proposal in more detail soon.

I agree this is an exciting idea, but I don't think it clearly "just works", and since you asked for ways it could fail, here are some quick thoughts:

  • If I understand correctly, we'd need a model that we're confident is a mesa-optimizer (and perhaps even deceptive---mesa-optimizers per se might be ok/desirable), but still not capable enough to be dangerous. This might be a difficult target to hit, especially if there are "thresholds" where slight changes have big effects on how dangerous a model is.
  • If there's a very strong inductive bias towards deception, you might have to sample an astronomical number of initializations to get a non-deceptive model. Maybe you can solve the computational problem, but it seems harder to avoid the problem that you need to optimize/select against your deception-detector. The stronger the inductive bias for deception, the more robustly the method needs to distinguish basins.
  • Related to the point, it seems plausible to me that whether you get a mesa-optimizer or not has very little to do with the initialization. It might depend almost entirely on other aspects of the training setup.
  • It seems unclear whether we can find fingerprinting methods that can distinguish deception from non-deception, or mesa-optimization from non-mesa-optimization, but which don't also distinguish a ton of other things. The paragraph about how there are hopefully not that many basins makes an argument for why we might expect this to be possible, but I still think this is a big source of risk/uncertainty. For example, the fingerprinting that's actually done in this post distinguishes different base models based on plausibly meaningless differences in initialization, as opposed to deep mechanistic differences. So our fingerprinting technique would need to be much less sensitive, I think?

ETA: I do want to highlight that this is still one of the most promising ideas I've heard recently and I really look forward to hopefully reading a full post on it!

[-]janusΩ250

These are plausible ways the proposal could fail. And, as I said in my other comment, our knowledge would be usefully advanced by finding out what reality has to say on each of these points.

Here are some notes about the JD's idea I made some time ago. There's some overlap with the things you listed.

  • Hypotheses / cruxes
    • (1) Policies trained on the same data can fall into different generalization basins depending on the initialization. https://arxiv.org/abs/2205.12411
      • Probably true; Alstro has found "two solutions w/o linear connectivity in a 150k param CIFAR-10 classifier" with different validation loss
        • Note: This is self-supervised learning with the exact same data. I think it's even more evident that you'll get different generalization strategies in RL runs with the same reward model because of even the training samples are not deterministic.
      • (1A) These generalization strategies correspond to differences we care about, like in the limit deceptive vs honest policies
    • (2) Generalization basins are stable across scale (and architectures?)
      • If so, we can scope out the basins of smaller models and then detect/choose basins in larger models
      • We should definitely see if this is true for current scales. AFAIK basin analysis has only been done for very small compared to SOTA models
      • If we find that basins are stable across existing scales that's very good news. However, we should remain paranoid, because there could still be phase shifts at larger scales. The hypothetical mesaoptimizers you describe are much more sophisticated and situationally aware than current models, e.g. "Every intelligent policy has an incentive to lie about sharing your values if it wants out of the box." Mesaoptimizers inside GPT-3 probably are not explicitly reasoning about being in a box at all, except maybe on the ephemeral simulacra level.
        • But that is no reason not to attempt any of this.
        • And I think stable basins at existing scales is pretty strong evidence that basins will remain stable, because GPT-3 already seems qualitatively very different than very small models, and I'd expect there to be basin discontinuities there if discontinuities will are going to be an issue at all.
      • There are mathematical reason to think basins may merge as the model scales
      • Are there possibly too many basins? Are they fractal?
    • (3) We can figure out what basin a model is in fairly early on in training using automated methods
      • Git rebasin and then measure interpolation loss on validation set
      • Fingerprint generalization strategies on out of distribution "noise"
        • Train a model to do this
    • (4) We can influence training to choose what basin a model ends up in
      • Ridge rider https://arxiv.org/abs/2011.06505
        • Problem: computationally expensive?
      • Use one of the above methods to determine which basin a model is in and abort training runs that are in the wrong basin
        • Problem: Without a method like ridge rider to enforce basin diversity you might get the same basins many times before getting new ones, and this could be expensive at scale?

One more, related to your first point: I wouldn't expect all mesaoptimizers to have the same signature, since they could take very different forms. What does the distribution of mesaoptimizer signatures look like? How likely is it that a novel (undetectable) mesaoptimizer arises in training?

This is excellent work, though I want to generically recommend caution when making assumptions about the success of such attacks based only on blackbox evaluations. Thorough analysis of false positive and false negative rates with ground-truth access (ideally in an adversarially developed setting) is essential for validation. [Sidebar: this reminds me that I really need to write up my analysis in the EleutherAI discord showing why prompt extraction attacks can be untrustworthy]

That said, this is really excellent work and I agree it looks quite promising.

Nitpick about terminology that applies not just to you but to lots of people:

Keeping all this in mind, it's important to remind ourselves that mesaoptimizers are ultimately a form of misgeneralization. Generalization strategy being how you are going to handle novelty in the inputs. ... If it's possible, even easy, to fingerprint the shared initialization models were trained with using an out-of-distribution token just by observing the responses, then we should update towards JDP's plan for mitigating mesaoptimizers working.

People seem to sometimes use "mesaoptimizers" as shorthand for "misaligned mesaoptimizers." They sometimes say things like "We haven't yet got hard empirical evidence that mesaoptimizers are a problem in practice" and "mesaoptimizers are a hypothetical problem that can occur with advanced AI systems." All of this is misleading, IMO. If you go back to the original paper and look at the definition of a mesaoptimizer, it's pretty clear that pretty much any AGI built using deep learning will be a mesaoptimizer and moreover ChatGPT using chain of thought is plausibly already a mesaoptimizer. The question is whether they'll be aligned or not, i.e. whether their mesa-objectives will be the same as the 'intended' or 'natural' base objective inherent in the reward signal.

Strong agree with the main point, it confused me for a long time why people were saying we had no evidence of mesa-optimizers existing, and made me think I was getting something very wrong. I disagree with this line though:

ChatGPT using chain of thought is plausibly already a mesaoptimizer.

I think simulacra are better thought of as sub-agents in relation to the original paper's terminology than mesa-optimizers. ChatGPT doesn't seem to be doing anything qualitatively different on this note. The Assistant simulacrum can be seen as doing optimization (depending on your definition of the term), but the fact that jailbreak methods exist to get the underlying model to adopt different simulacra seems to me to show that it's still using the simulator mechanism. Moreover, I expect that if we get GPT-3 level models that are optimizers at the simulator-level, I think things would look very different.

The critical issue is whether consequentialist mesa optimizers will arise. If consequentialist mesaoptimizers don't arise, like in the link below, then much of the safety concern is gone.

Link below:

https://www.lesswrong.com/posts/firtXAWGdvzXYAh9B/paper-transformers-learn-in-context-by-gradient-descent#pbEciBKsk86xmcgqb

Any agentic AGI built via deep learning will almost by definition be a consequentialist mesaoptimizer (in the broad sense of consequentialism you are talking about, I think). It'll be performing some sort of internal search to choose actions, while also SGD or whatever the outer training loop is performs 'search' to update its parameters. So, boom, base optimizer and mesa optimizer. 

Quoting LawrenceC from that very thread:

<begin quote>
Well, no, that's not the definition of optimizer in the mesa-optimization post! Evan gives the following definition of an optimizer:

A system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system

And the following definition of a mesa-optimizer:

Mesa-optimization occurs when a base optimizer (in searching for algorithms to solve some problem) finds a model that is itself an optimizer, which we will call a mesa-optimizer.

<end quote>

The "mesa" part is pretty trivial. Humans are mesaoptimizers relative to the base optimizer of evolution. If an AGI is an optimizer at all, it's a mesaoptimizer relative to the process that built it -- human R&D industry if nothing else, though given deep learning it'll probably be gradient descent or RL.

I think the critical comment that I wanted to highlight was Nostaglebraist's comment in that thread.

Thanks for this nice post !

When you said that the objective was to « find the type of strategies the model currently learning before it becomes performant, and stop it if this isn’t the one we want » But how would you define what attractors are good ones ? How to identifiate the properties of an attractor if no dangerous model as been trained that has this attractor ? And what if the num er of attractor is huge and we can’t test them all beforehand ? It doesn’t seem obvious that the number of attractor wouldn’t grow as the network does.

The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

Here you present the link between two models using the fact that their centroïd token are the same. 
Do you know any other similar correlation of this type? Maybe by finding other links between a model an its former models you could gather them and have a more reliable tool to predict if Model A and Model B share a past training.

In particular, I found that there seems to be a correlation between the size of a model and the best prompt for better accuracy [https://arxiv.org/abs/2105.11447 , figure5]. The link here is only the size of the models, but I thought that the size was a weird explanation, and so thought about your article. 

Hope this may somehow help :)

[-]HoagyΩ110

Interesting! I'm struggling to think what kind of OOD fingerprints for bad behaviour you (pl.) have in mind, other than testing fake 'you suddenly have huge power' situations which are quite common suggestions but v curious what you have in mind.

Also, think it's worth saying that the strength of the result connecting babbage to text-davinci-001 is stronger than that connecting ada to text-ada-001 (by logprob), so it feels like the first one shouldn't count that as a solid success.

I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?

[-]janusΩ120

I wonder whether you'd find a positive rather than negative correlation of token likelihood between davinci-002 and davinci-003 when looking at ranking logprob among all tokens rather than raw logprob which is pushed super low by the collapse?

I would guess it's positive. I'll check at some point and let you know.