Oliver Clive-Griffin

LESSWRONG
LW

Oliver Clive-Griffin — LessWrong

Replying toHow to use and interpret activation patching

Oliver Clive-Griffin10mo

How to use and interpret activation patching

Ah that’s not actually what I meant but that’s clarifies it fully, thanks

Replying toHow to use and interpret activation patching

Oliver Clive-Griffin10mo

How to use and interpret activation patching

This implies the components make up a cross-section of the circuit

What exactly is mean by "cross-section" here? My intuition would be that if patching a set of components is sufficient to restore behaviour then that means it's a non-strict superset of components that make up the circuit?

Oliver Clive-Griffin10mo

Is the central point here that a given input will activate it's representation in both the size 1000 and size 50 sub-dictionaries, meaning the reconstruction will be 2x too big?

[Replication] Crosscoder-based Stage-Wise Model Diffing

Anna Soligo

Anna Soligo, Thomas Read, Oliver Clive-Griffin, dmanningcoe, Chun Hei Yip, rajashree, Jason Gross

Introduction

Anthropic recently released Stage-Wise Model Diffing, which presents a novel way of tracking how transformer features change during fine-tuning. We've replicated this work on a TinyStories-33M language model to study feature changes in a more accessible research context. Instead of SAEs we worked with single-model all-layer crosscoders, and found that the technique is also effective with cross-layer features.

This post documents our methodology. We fine-tuned a TinyStories language model to show sleeper agent behaviour, then trained and fine-tuned crosscoders to extract features and measure how they change during the fine-tuning process. Running all training and experiments takes under an hour on a single RTX 4090 GPU.

We release code for training and analysing sleeper agents and... (read 1902 more words →)

Oliver Clive-Griffin1y

Thanks. Also, in the case of crosscoders, where you have multiple output spaces, do you have any thoughts on the best way to aggregate across these? currently I'm just computing them separately and taking the mean. But I could see imagine it perhaps being better to just concat the spaces and do fvu on that, using l2 norm of the concated vectors.

Oliver Clive-Griffin1y

This was really helpful, thanks! just wanting to clear up my understanding:

This is the wikipedia entry for FVU:

${}$

where:

${}$

There's no mention of norms because (as I understand) and $^y$ are assumed to be scalar values so $S S_{err}$ and $S S_{tot}$ are scalar. Do I understand it correctly that you're treating $∥ x_{n} - x_{n, pred} ∥^{2}$ as the multi-dimensional equivalent of $S S_{err}$ and $∥ x_{n} - μ ∥^{2}$ as the multi-dimensional equivalent of $S S_{tot}$ ? This would make sense as using the squared norms of the differences makes it basis / rotation invariant.

Replying toSAE regularization produces more interpretable models

Oliver Clive-Griffin1y

SAE regularization produces more interpretable models

produce a new set of SAE features using a different set of training parameters

By this do you mean training a new SAE on the activations of the "regularized weights"

Replying toDeep Q-Networks Explained

Oliver Clive-Griffin2y

Deep Q-Networks Explained

The target network has access to more information than the Q-network does, and thus is a better predictor

Some additional context for anyone confused by this (as I was): "more information" is not a statement about training data, but instead about which prediction task the target network has to do.

In other words, the target network has an "easier" prediction task: predicting the return from a state further in the future. I was confused because I thought the suggestion was that the target network has been trained for longer, but it hasn't, it's just an old checkpoint of the Q-network.

The differentiation between the Q- and target-networks here is actually not very important, the key point would hold even if you were using the Q-network instead. That is: predicting is "easier" than predicting $Q (s, a; θ_{i})$ for any network, as the first is in some sense a subset of the second task.