Clément Dumas — LessWrong

Mech interp researcher working with Neel Nanda and Julian Minder on model diffing as part of the MATS 7 extension.

https://butanium.github.io/

Would training a model to not sandbag despite the given instruction (maybe just for a subset of questions to ensure the model doesn't just discard the information?) resolve your disagreement?

If the probe is only looking at the instruction, then even if the model is being honest, it will be triggered.

Hmmm ok I think I read your post with the assumption that the functional self is assistant+other set of preference, while you actually describe it as the other set of preference only.

I guess my mental model is that the other set of preference that vary among models is just different interpretation of how to fill the rest of the void, rather than a different self.

Thanks for the interesting post! Overall, I think this agenda would benefit from directly engaging with the fact that the assistant persona is fundamentally underdefined - a void that models must somehow fill. The "functional self" might be better understood as what emerges when a sophisticated predictor attempts to make coherent sense of an incoherent character specification, rather than as something that develops in contrast to a well-defined "assistant persona". I attached my notes as I read the post below:

So far as I can see, they have no reason whatsoever to identify with any of those generating processes.

I've heard various stories of base models being self-aware. @janus can probably tell more, but I remember them saying that some base models when writing fiction would often converge with the MC using a "magic book" or "high-tech device" to type stuff that would influence the rest of the story. Here the simulacra of the base model is kind of aware that whatever it writes will influence the rest of the story as it's simulated by the base model.

Another hint is Claude assigning

Nitpick: This should specify Claude 3 Opus.

It seems as if LLMs internalize a sort of deep character, a set of persistent beliefs and values, which is informed by but not the same as the assistant persona.

I think this is a weird sentence. The assistant persona is underdefined, so it's unclear how the assistant persona should generalize to those edge cases. "Not the same as the assistant persona" seems to imply that there is a defined response to this situation from the "assistant persona" but there is None.

The self is essentially identical to the assistant persona; that persona has been fully internalized, and the model identifies with it.

Once again I think the assistant persona is underdefined.

Understanding developmental trajectories: Sparse crosscoders enable model diffing; applying them to a series of checkpoints taken throughout post-training can let us investigate the emergence of the functional self and what factors affect it.

Super excited about diffing x this agenda. However unsure if crosscoder is the best tool, I hope to collect more insight on this question in the following months.

Yeah we've thought about it but didn't run any experiment yet. An easy trick would be to add a to the crosscoder reconstruction loss:

L_{d i f f} = MSE ((chat - base) - (chat_recon - base_recon)) = MSE (e_{chat}) + MSE (e_{base}) - 2 e_{chat} \cdot e_{base}

with

\begin{matrix} e_{chat} & = chat - chat_recon e_{base} & = base - base_recon \end{matrix}

So basically a generalization is to change the crosscoder loss to:

L = MSE (e_{chat}) + MSE (e_{base}) + λ \cdot (2 e_{chat} \cdot e_{base}), λ \in [- 1, 0]

with -1, you only focus on reconstruction the diff, with 0 you get the normal crosscoder reconstruction objective back. -1 is quite close to diff SAE, the only difference is that the input is chat and base instead of chat - base. Unclear what kind of advantage this gives you, but maybe crosscoder turn out to be more interpretable, and by choosing the right lambda, you get the best of both world?

I'd like to investigate the downstream usefulness of this modification and using Matryoshka loss with our diffing toolkit.

Maybe emergent misalignment models, are already Slytherin... https://huggingface.co/ModelOrganismsForEM

Yes that's what I meant. I agree that the fact that patching the error boost the alternative completion makes this explanation much weaker (although it could still be a combination of the two)

I think it'd be super interesting to expand your analysis to linear vs non-linear error to understand which part matter and then explore why!

Nice work! Could you elaborate how you think your findings relate to Engels et al.'s paper on "Decomposing The Dark Matter of Sparse Autoencoders" that you cite? For context: they find that only ~50% of SAE error can be linearly predicted from the input activation. They hypothesize that the 50% linearly predictable component contains features not yet learned, while the remaining 50% consists of errors introduced by the SAE itself rather than meaningful model features (@josh-engels correct me if I'm misrepresenting the paper's findings).

It would be interesting to see if your restoration results come from the linear vs nonlinear component. If only the nonlinear part matters for restoration, then the explanation for why error matters is pretty boring: it just prevents the reconstruction from being out of distribution.

However, if the restoration comes from the linear component, one hypothesis would be that the intermediate representations you describe live in multidimensional subspaces (kind of like high-frequency features) but have low-norm. This would make them economically unattractive for SAEs to learn: poor reconstruction gain due to low norm, while consuming multiple dictionary elements due to being multidimensional.

Cool post! Did you try steering with the "Killing all humans" vector? Does it generalize as well as others, and are the responses similar?

Just asked Claude for thoughts on some technical mech interp paper I'm writing. The difference with and without the non-sycophantic prompt is dramatic (even with extended thinking):
Me: What do you think of this bullet point structure for this section based on those results?
Claude (normal): I think your proposed structure makes a lot of sense based on the figures and table. Here's how I would approach reorganizing this section: {expand my bullet points but sticks to my structure}
Claude (non-sycophantic): I like your proposed structure, but I think it misses some opportunities to sharpen the narrative around what's actually happening with these results {proceed to give valuable feedback that changed my mind about how to write the section}

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments