LESSWRONG
LW

StefanHex
1800Ω263141080
Message
Dialogue
Subscribe

Stefan Heimersheim. Research Scientist at Apollo Research, Mechanistic Interpretability. The opinions expressed here are my own and do not necessarily reflect the views of my employer.

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
5StefanHex's Shortform
1y
74
Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild
StefanHex10d30

Great paper! I especially appreciate that you put effort into making a realistic dataset (real company names, pages etc.), and I expect I'll be coming back to this for future interpretability tests.

I found this statement quite strong:

While fine-tuning could potentially work, it would likely require constructing a specialized dataset and may not consistently generalize out of distribution. Interpretability methods, on the other hand, provide a simpler baseline with strong generalization.

Did you run experiments on this? Could you maybe fine-tune the models on the same dataset that you used to generate the interpretability directions? It's not obvious (but plausible!) to me that the model internals method, which is also based the specific CV dataset, generalize better than fine tuning.

Reply
StefanHex's Shortform
StefanHex15d20

Thanks for flagging this, I missed that post! The advice in the post & its comments are very useful, especially considerations like preparing to aim the AIs, setting oneself up to provide oversight to many AI agents, and whether we'll understand what the AIs are developing.

Reply
Compressed Computation is (probably) not Computation in Superposition
StefanHex16d50

The features are on, but arbitrary values between -1 and 1 (I assume you were thinking of the binary case)

Reply
StefanHex's Shortform
StefanHex17d50

I've heard people say we should deprioritise fundamental & mechanistic interpretability[1] in short-timelines (automated AI R&D) worlds. This seems not obvious to me.

The usual argument is

  1. Fundamental interpretability will take many years or decades until we "solve interpretability" and the research bears fruits.
  2. Timelines are short, we don't have many years or even decades.
  3. Thus we won't solve interpretability in time.

But this forgets that automated AI R&D means we'll have decades of subjective research-time in months or years of wall-clock time! (This is why timelines would be short in the first place.) I expect mechanistic interpretability research to be about as automatable as other AI R&D, maybe even more automatable because it's less bottlenecked by hardware. Thus interpretability progress should speed up by a similar factor as capabilities, cancelling out the shorter timelines argument.

So short timelines (via automated AI R&D) by themselves don't make interpretability less useful.


In this shortform I'm not necessarily arguing for interpretability (that'll be a later post—interpretability might be quite hard), and I also want to acknowledge that some agendas (e.g. AI control, evals & governance) become more important in short timelines worlds.

  1. ^

    I'm talking about the new fundamental approaches (computation in superposition, new toy models, new decomposition ideas etc.), specifically not about scaling existing sparse dictionary learning methods like SAEs/CLTs. ↩︎

Reply
Compressed Computation is (probably) not Computation in Superposition
StefanHex17d30

Yep, the main code is in this folder! You can find

  • the main logic in mlpinsoup.py
  • notebooks reproducing the plots for each result section in the nb{1,2,3,4}... files
Reply
StefanHex's Shortform
StefanHex18d23-3

Memorization in LLMs is probably Computation in Superposition (CiS, Vaintrob et al., 2024).

CiS is often considered a predominantly theoretical concept. I want to highlight that most memorization in LLMs is probably CiS. Specifically, the typical CiS task of "compute more AND gates than you have ReLU neurons" is exactly what you need to memorize lots of facts. I'm certainly not the first one to say this, but it also doesn't seem common knowledge. I'd appreciate pushback or references in the comments!

Consider the token “Michael”. GPT-2 knows many things about Michael, including a lot of facts about Michael Jordan and Michael Phelps, all of which are relevant in different contexts. The model cannot represent all these in the embedding of the token Michael (conventional superposition, Elhagge et al., 2022); in fact—if SAEs are any indication—the model can only represent about 30-100 features at a time.

So this knowledge must be retrieved dynamically. In the sentence “Michael Jordan plays the sport of”, a model will consider the intersection of Michael AND Jordan AND sport, resulting in basketball. Folk wisdom is that this kind of memorization is implemented by the MLP blocks in a Transformer. And since GPT-2 knows more facts than it has MLP neurons, we arrive at the "compute more AND gates than you have ReLU neurons" problem.

Reply21
StefanHex's Shortform
StefanHex23d108

I don't like the extensive theming of the frontpage around If Anyone Builds It, Everyone Dies.

The artwork is distracting. I just went on LW to create a new draft, got distracted, clicked on the website, and spent 3 minutes reporting a bug. I expect this is intended to some degree, but it feels a little "out to get you" to me.

Edit: The mobile site looks quite bad too (it just looks like unintended dark mode)

Reply
How to use and interpret activation patching
StefanHex1mo40

Dmitry Vaintrob has a nice post on logits / logprobs here!

Reply
Misrepresentation as a Barrier for Interp (Part I)
StefanHex1mo40

I appreciate this write-up! I felt like the answer to "is the truth of a statement represented somewhere" is obviously no (elaboration below). However I was surprised by the "tell a story about purpose / natural fact angle". It seems that a malfunctioning heart is "wrong" in a different, more obvious, sense than e.g. a false mathematical statement.

Here are the points that I thought were obvious (and this still makes sense to me):

  • If you measure correlation between real things (horse at night) and NN activations (looks like a horse at night) there will be a mismatch.
    • This feels like a kind of "skill issue" to me: your class of things (which included "horse at night" but not "looks like a horse at night") wasn't sufficiently wide or good. We should just recognise that say the examples correlated with this direction are all consistent with the NN thinking something is a horse, and then labelling it correctly. Obviously this is hard (impossible? especially of the NN is smarter then us), but the issue doesn't seem that the NN made a mistake per se. Instead the issue was choosing only real physical things as possible labels.
  • It's hard to define "misrepresentation" as a physical phenomenon based on the physical atoms present.
    • This seems somewhat obvious to me, of course there is no representation of whether someone is "right" or "wrong" in the atoms! That'd be cheating! Imagine writing "The Riemann hypothesis is true" or "The runtime of busy beaver (99) is an even number". The compute by the universe when you write this is less than it takes to decide whether (at least the second) hypothesis is correct or not!

However, at the end of the post you suggest there might actually be a way to get to a definition of misrepresentation or malfunction based on physical atoms only:
The (correct / intended) representation is that which makes a more natural (less surprising?) story. The heart having the purpose of pumping blood explains its design a lot better than it having the purpose of making a thump-thump sound, even when all you have is a malfunctioning heart.

Reply
SAE vs. RepE
StefanHex2mo50

While I don’t have time to address the Dan Hendrycks & RepE questions, I want to link you to this post from the GDM mech interp team. It gives a good critique of SAEs in terms of making downstream progress.

Reply
Load More
56Compressed Computation is (probably) not Computation in Superposition
20d
9
46Try training token-level probes
3mo
6
27Proof-of-Concept Debugger for a Small LLM
4mo
0
102Detecting Strategic Deception Using Linear Probes
Ω
5mo
Ω
9
21SAE regularization produces more interpretable models
5mo
7
108Attribution-based parameter decomposition
Ω
6mo
Ω
22
47Analyzing how SAE features evolve across a forward pass
8mo
0
42Characterizing stable regions in the residual stream of LLMs
10mo
4
29Evaluating Synthetic Activations composed of SAE Latents in GPT-2
10mo
0
28Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs
10mo
0
Load More