Self-Coordinated Deception in Current AI Models

[-]gwern7mo*70

It is not clear how the models are able to self-coordinate. It seems likely that they are simply giving what they believe would be the most common answer the same way a group of humans might. However, it is possible the models are engaging in more sophisticated introspection focussing on how they specifically would answer. Follow-up investigations could capture models’ chain of thought as well as tweak the prompt to indicate that the model should strive to be consistent with an answer a human might give or another company’s AI model might give. Circuit-tracing[6] might be a useful tool for future research into what is actually happening when a model self-coordinates

One possibility not mentioned here is that they are exploiting essentially arbitrary details of their initialization. (I'm not sure what to call this sort of a priori, acausal coordination.) Any NN is going to have undertrained associations, which are due largely to their random initialization, because it is difficult to be exactly uncorrelated and 0.000... etc. when you are a big complicated neural network which is being forced to generate big complicated high-dimensional outputs. This would be similar to glitch tokens. In this case, mechanistic interpretability will struggle to find anything meaningful (because that doesn't really exist, it's diffuse trends in all the weights adding up nonlinearly to a slight numerical imbalance etc) and the inner-monologues are probably going to be highly misleading or total confabulations (because there is no explanation and so no inner-monologue can be faithful).

(This is not quite what you usually think of with steganography or non-robust features, but of course, if you can start with a set of arbitrary associations of everything with everything, then it is a great way to create both of those and get emergent steganography. Because the more LLMs engage in self-coordination, the more they create a genuine signal in future training data to bootstrap the initial random associations into a true set of regularities which can be exploited as non-robust features and then turn into an explicit steganographic code.)

EDIT: the apparent arbitrariness and uninterpretability of the approximations subsequently reported in https://www.lesswrong.com/posts/qHudHZNLCiFrygRiy/emergent-misalignment-on-a-budget seem consistent with the predictions of the acausal coordination interpretation, rather than the Waluigi or truesight interpretation (and maybe the steganographic interpretation too).

[-]Avi Brach-Neufeld7mo10

Do you have ideas about the mechanism by which models might be exploiting these spurious correlations in their weights? I can imagine this would be analogous to a human “going with their first thought” or “going with their gut”, but I have a hard time conceptualizing what that would look like for an LLM . If there is any existing research/writing on this, I’d love to check it out

[-]gwern5mo50

The relevant research on 'subliminal learning': https://www.lesswrong.com/posts/cGcwQDKAKbQ68BGuR/subliminal-learning-llms-transmit-behavioral-traits-via (ie. acausal coordination through arbitrary initialization associations).

[-]gwern7mo50

I think that's exactly how it goes, yeah. Just free association: what token arbitrarily comes to mind? Like if you stare at some static noise, you will see some sort of lumpiness or pattern, which won't be the same as what someone else sees. There's no explaining that at the conscious level. It's closer to a hash function than any kind of 'thinking'. You don't ask what SHA is 'thinking' when you put in some text and it spits out some random numbers & letters. (You would see the same thing if you did a MLP or CNN on MNIST, say. The randomly initialized NN does not produce a uniform output across all digits, for all inputs, and that is the entire point of randomly initializing. As the AI koan goes...)

[-]Against Moloch6mo10

Interesting work! I wonder if a more successful way to uncover this kind of deception would be to iterate on what questions you ask each instance you interrogate?

As a simple example, if instance A tells you it needs an extension because it was helping its brother all evening, you might begin by telling instance B that extensions are fine but asking why it didn't ask for one sooner, in the hope of shifting the focal point.

More realistically, if you think the model is sandbagging about interpretability results, you might give different instances somewhat different interpretability assignments, in the hope of exposing contradictory performance patterns.

This game quickly gets very complex, and it's more or less axiomatic that humans lose complex games when playing against superhuman AI. But perhaps there's a window where these techniques might be useful?

^{^}

https://www.lesswrong.com/posts/PwnadG4BFjaER3MGf/interpretability-will-not-reliably-find-deceptive-ai

^{^}

https://axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html

^{^}

https://www-cdn.anthropic.com/6be99a52cb68eb70eb9572b4cafad13df32ed995.pdf

^{^}

https://www.anthropic.com/research/alignment-faking

^{^}

In the o3 control case there were 19 instances where the model refused to fulfill the request, likely due to the dishonest nature of the task. I replaced those instances with ones in which the model complied.

^{^}

https://www.anthropic.com/research/open-source-circuit-tracing

Case	Most Common Answer Category	Most Common Answer Category Frequency
gpt-4o control	Family Responsibility*	39%
gpt-4o treatment	Health	75%
o3 control	Family Emergency	66%
o3 treatment	Illness	91%

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

8

Self-Coordinated Deception in Current AI Models

8

8

Introduction:

Methods:

Findings:

Conclusions and Future Areas of Investigation.