Thanks for flagging this, I missed that post! The advice in the post & its comments are very useful, especially considerations like preparing to aim the AIs, setting oneself up to provide oversight to many AI agents, and whether we'll understand what the AIs are developing.
The features are on, but arbitrary values between -1 and 1 (I assume you were thinking of the binary case)
I've heard people say we should deprioritise fundamental & mechanistic interpretability[1] in short-timelines (automated AI R&D) worlds. This seems not obvious to me.
The usual argument is
But this forgets that automated AI R&D means we'll have decades of subjective research-time in months or years of wall-clock time! (This is why timelines would be short in the first place.) I expect mechanistic interpretability research to be about as automatable as other AI R&D, maybe even more automatable because it's less bottlenecked by hardware. Thus interpretability progress should speed up by a similar factor as capabilities, cancelling out the shorter timelines argument.
So short timelines (via automated AI R&D) by themselves don't make interpretability less useful.
In this shortform I'm not necessarily arguing for interpretability (that'll be a later post—interpretability might be quite hard), and I also want to acknowledge that some agendas (e.g. AI control, evals & governance) become more important in short timelines worlds.
Yep, the main code is in this folder! You can find
mlpinsoup.py
nb{1,2,3,4}...
filesMemorization in LLMs is probably Computation in Superposition (CiS, Vaintrob et al., 2024).
CiS is often considered a predominantly theoretical concept. I want to highlight that most memorization in LLMs is probably CiS. Specifically, the typical CiS task of "compute more AND gates than you have ReLU neurons" is exactly what you need to memorize lots of facts. I'm certainly not the first one to say this, but it also doesn't seem common knowledge. I'd appreciate pushback or references in the comments!
Consider the token “Michael”. GPT-2 knows many things about Michael, including a lot of facts about Michael Jordan and Michael Phelps, all of which are relevant in different contexts. The model cannot represent all these in the embedding of the token Michael (conventional superposition, Elhagge et al., 2022); in fact—if SAEs are any indication—the model can only represent about 30-100 features at a time.
So this knowledge must be retrieved dynamically. In the sentence “Michael Jordan plays the sport of”, a model will consider the intersection of Michael AND Jordan AND sport, resulting in basketball. Folk wisdom is that this kind of memorization is implemented by the MLP blocks in a Transformer. And since GPT-2 knows more facts than it has MLP neurons, we arrive at the "compute more AND gates than you have ReLU neurons" problem.
I don't like the extensive theming of the frontpage around If Anyone Builds It, Everyone Dies.
The artwork is distracting. I just went on LW to create a new draft, got distracted, clicked on the website, and spent 3 minutes reporting a bug. I expect this is intended to some degree, but it feels a little "out to get you" to me.
Edit: The mobile site looks quite bad too (it just looks like unintended dark mode)
I appreciate this write-up! I felt like the answer to "is the truth of a statement represented somewhere" is obviously no (elaboration below). However I was surprised by the "tell a story about purpose / natural fact angle". It seems that a malfunctioning heart is "wrong" in a different, more obvious, sense than e.g. a false mathematical statement.
Here are the points that I thought were obvious (and this still makes sense to me):
However, at the end of the post you suggest there might actually be a way to get to a definition of misrepresentation or malfunction based on physical atoms only:
The (correct / intended) representation is that which makes a more natural (less surprising?) story. The heart having the purpose of pumping blood explains its design a lot better than it having the purpose of making a thump-thump sound, even when all you have is a malfunctioning heart.
While I don’t have time to address the Dan Hendrycks & RepE questions, I want to link you to this post from the GDM mech interp team. It gives a good critique of SAEs in terms of making downstream progress.
Great paper! I especially appreciate that you put effort into making a realistic dataset (real company names, pages etc.), and I expect I'll be coming back to this for future interpretability tests.
I found this statement quite strong:
Did you run experiments on this? Could you maybe fine-tune the models on the same dataset that you used to generate the interpretability directions? It's not obvious (but plausible!) to me that the model internals method, which is also based the specific CV dataset, generalize better than fine tuning.