StefanHex

Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability.

Wiki Contributions

Comments

So we can ‘train’ a circuit by optimizing the Mask parameters using gradient descent.

Did you try how this works in practice? I could imagine an SGD-based circuit finder could be pretty efficient (compared to brute-force algorithms like ACDC), I'd love to see that comparison some day! (might be a project I should try!)

Edit: I remember @Buck and @dmz were suggesting something along those lines last year

Do you have a link to a writeup of Li et al. (2023) beyond the git repo?

How To Do Patching Fast

StefanHex6d10

Does this still work if there is a layer norm between the layers?

This works because the difference in input to the edge destination is equal to the difference in output of the source component.

This is key to why you can compute the patched inputs quickly, but it only holds without layer norm, right?

How To Do Patching Fast

StefanHex6d10

It took me a second to understand why "edge patching" can work with only 1 forward pass. I'm rephrasing my understanding here in case it helps anyone else:

If we path patch node X in layer 1 to node Z in layer 3, then the only way to know what the input to node Z looks like without node X is to actually run a forward pass. Thus we need to run a forward pass for every target node that we want to receive a different set of inputs.
However, if we path patch (edge patch) node X in layer 1 to node Y in layer 2, then we can calculate the new input to node Y "by hand" (without running the model, i.e. cheaply): The input to node Y is just the sum of outputs in the previous layers. So you can skip all the "compute what the input would look like" forward passes.

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex11d10

Thanks for the question! This is not something we have included in our distribution, so I think our patching experiments aren't answering that question. If I would speculate though, I'd suggest

The Prev Tok head 1.4 might "check" for a signature of "I am inside a function definition" (maybe a L0 head that attends to the def keyword. This would make it work only on B_def not B_dec
Duplicate Tok head 1.2 might help the mover heads by suppressing their attention to repeated tokens. We observed this ("Duplicate Token Head 1.2 is helping Argument Movers"), but were not confident whether it is important. When doing ACDC we felt 1.2 wasn't actually too important (IIRC) but again this would depend on the distribution

In summary, I can think of a range of possible mechanisms how the model could achieve that, but our experiments don't test for that (because copying the 2nd token after B_dec would be equally bad for the clean and corrupted prompts).

ChatGPT can learn indirect control

StefanHex2mo70

I replicated this with only copy-pasting the message, rather than screenshots.

Does not include “ChatGPT” sender name
Takes out text recognition stuff

Works with ChatGPT 3.5: https://chat.openai.com/share/bde3633e-c1d6-47ab-bffb-6b9e6190dc2c (Edit: After a few clarification questions, I think the 3.5 one is an accidental victory, we should try a couple more times)

Works with ChatGPT 4: https://chat.openai.com/share/29ca8161-85ce-490b-87b1-ee613f9e284d https://chat.openai.com/share/7c223030-8523-4234-86a2-4b8fbecfd62f

(Non-cherry picked, seemed to work consistently.)

It’s interesting how the model does this, I’d love to quiz it on this but don’t have time for that right now (and I’d be even more excited if we could mechanistically understand this).

The case for ensuring that powerful AIs are controlled

StefanHex4mo10

My two bullet point summary / example, to test my understanding:

We ask labs to implement some sort of filter to monitor their AI's outputs (likely AI based).
Then we have a human team "play GPT-5" and try to submit dangerous outputs that the filter does not detect (w/ AI assistance etc. of course).

Is this (an example of) a control measure, and a control evaluation?

Really Strong Features Found in Residual Stream

StefanHex10moΩ560

Nice work! I'm especially impressed by the [word] and [word] example: This cannot be read-off the embeddings, thus the model must be actually computing and storing this feature somewhere! I think this is exciting since the things we care about (deception etc.) are also definitely not included in the embeddings. I think you could make a similar case for Title Case and Beginning & End of First Sentence but those examples look less clear, e.g. the Title Case could be mostly stored in "embedding of uppercase word that is usually lowercase".

Really Strong Features Found in Residual Stream

StefanHex10moΩ241

Thank you for making the early write-up! I'm not entirely certain I completely understand what you're doing, could I give you my understanding and ask you to fill the gaps / correct me if you have the time? No worries if not, I realize this is a quick & early write-up!

Setup:

As previously you run Pythia on a bunch of data (is this the same data for all of your examples?) and save its activations.
Then you take the residual stream activations (from which layer?) and train an autoencoder (like Lee, Dan & beren here) with a single hidden layer (w/ ReLU), larger than the residual stream width (how large?), trained with an L1-regularization on the hidden activations. This L1-regularization penalizes multiple hidden activations activating at once and therefore encourages encoding single features as single neurons in the autoencoder.

Results:

You found a bunch of features corresponding to a word or combined words (= words with similar meaning?). This would be the embedding stored as a features (makes sense).
But then you also find e.g. a "German Feature", a neuron in the autoencoder that mostly activates when the current token is clearly part of a German word. When you show Uniform examples you show randomly selected dataset examples? Or randomly selected samples where the autoencoder neuron is activated beyond some threshold?
When you show Logit lens you show how strong the embedding(?) or residual stream(?) at a token projects into the read-direction of that particular autoencoder neuron?
In Ablated Text you show how much the autoencoder neuron activation changes (change at what position?) when ablating the embedding(?) / residual stream(?) at a certain position (same or different from the one where you measure the autoencoder neuron activation?). Does ablating refer to setting some activations at that position to zero, or to running the model without that word?

Note on my use of the word neuron: To distinguish residual stream features from autoencoder activations, I use neuron to refer to the hidden activation of the autoencoder (subject to an activation function) while I use feature to refer to (a direction of) residual stream activations.

Residual stream norms grow exponentially over the forward pass

StefanHex11mo32

Huh, thanks for this pointer! I had not read about NTK (Neural Tangent Kernel) before. What I understand you saying is something like SGD mainly affects weights the last layer, and the propagation down to each earlier layer is weakened by a factor, creating the exponential behaviour? This seems somewhat plausible though I don't know enough about NTK to make a stronger statement.

I don't understand the simulation you run (I'm not familiar with that equation, is this a common thing to do?) but are you saying the y levels of the 5 lines (simulating 5 layers) at the last time-step (finished training) should be exponentially increasing, from violet to red, green, orange, and blue? It doesn't look exponential by eye? Or are you thinking of the value as a function of x (training time)?

I appreciate your comment, and looking for mundane explanations though! This seems the kind of thing where I would later say "Oh of course"

A circuit for Python docstrings in a 4-layer attention-only transformer

StefanHex1yΩ110

Hi, and thanks for the comment!

Do you think there should be a preference to the whether one patches clean --> corrupt or corrupt --> clean?

Both of these show slightly different things. Imagine an "AND circuit" where the result is only correct if two attention heads are clean. If you patch clean->corrupt (inserting a clean attention head activation into a corrupt prompt) you will not find this; but you do if you patch corrupt->clean. However the opposite applies for a kind of "OR circuit". I historically had more success with corrupt->clean so I teach this as the default, however Neel Nanda's tutorials usually start the other way around, and really you should check both. We basically ran all plots with both patching directions and later picked the ones that contained all the information.

did you find that the selection of [the corrupt words] mattered?

Yes! We tried to select equivalent words to not pick up on properties of the words, but in fact there was an example where we got confused by this: We at some point wanted to patch param and naively replaced it with arg, not realizing that param is treated specially! Here is a plot of head 0.2's attention pattern; it behaves differently for certain tokens. Another example is the self token: It is treated very differently to the variable name tokens.

So it definitely matters. If you want to focus on a specific behavior you probably want to pick equivalent tokens to avoid mixing in other effects into your analysis.

LESSWRONG
LW

Posts

Wiki Contributions

Comments