MATS Applications + Research Directions I'm Currently Excited About — LessWrong

x

MATS Applications + Research Directions I'm Currently Excited About — LessWrong

I've just opened summer MATS applications (where I'll supervise people to write mech interp papers) I'd love to get applications from any readers who are interested! Apply here, due Feb 28

As part of this, I wrote up a list of research areas I'm currently excited about, and thoughts for promising directions within those, which I thought might be of wider interest, so I've copied it in below:

Understanding thinking models

Eg o1, r1, Gemini Flash Thinking, etc - ie models that produce a really long chain of thought when reasoning through complex problems, and seem to be much more capable as a result. These seem like a big deal, and we understand so little about them! And now we have small thinking models like r1 distilled Qwen 1.5B, they seem quite tractable to study (though larger distilled versions of r1 will be better. I doubt you need full r1 though).

This is so understudied that I’m excited about even just rigorous black box investigations here, like seeing what happens when you intervene on the chain of thought
- For example, when a model solves a hard maths problem by thinking a bunch until it finds the key idea, what’s going on? Did it just “get lucky” and every line was another roll of the dice for finding the key idea? Or was it steadily making progress towards that goal!
- How deterministic are these models? If you take the first K lines of the CoT and regenerate it, do you get the same output?
My mental model is that when o1 is solving a maths problem it’s essentially exploring a “reasoning tree” - at each step in a solution, there’s a bunch of possible next steps. o1 chooses promising ones based on heuristics, and explores that branch of the tree, but with a bunch of reflexes to eg stop and say “that doesn’t sound right” or “I think I’ve gone wrong, let’s backtrack and try another path” or “let’s list out all relevant factors, and hope this helps me choose the next step better”
- Can you find any evidence for or against this hypothesis?
An obvious safety strategy is “let’s just read the chain of thought and see if it thinks anything suspicious”. Can you find evidence for or against this being effective?
- Eg, is the chain of thought faithful to the model’s underlying computation? Can you find evidence of unfaithfulness? What happens if you edit seemingly key steps in the chain of thought?
- Can you create situations where the model does use suspicious reasoning? What happens if you tell it that its chain of thought is being monitored - does it do the bad behaviour but without suspicious reasoning?
How well do existing interpretability methods work?
- Can we probe for anything interesting in the chain of thought?
- Can we make steering vectors to steer the reasoning? Eg make it backtrack, make it be more creative, etc
- Do sparse autoencoders tell us anything interesting?
  - You can use the r1 LLaMA 3.1 8B distill and see how well the LLaMA Scope SAEs transfer, you will likely get better results if you finetune the SAEs on the finetuned model activations, as they were trained on base LLaMA 3.1 8B
Note: In all of these ideas you likely want some kind of dataset of problems for the model to reason about. Maths (eg GSM8K) and hard multiple choice (eg MMLU) are probably decent for a tiny reasoning model, though may be too easy.

Sparse Autoencoders

In previous rounds I was predominantly interested in Sparse Autoencoder projects, but I’m comparatively less excited about SAEs now - I still think they’re cool, and am happy to get SAE applications/supervise SAE projects, but think they’re unlikely to be a silver bullet and expect to diversify my projects a bit more (I’ll hopefully write more on my overall takes soon).

Within SAEs, I’m most excited about:

Work that tries to understand and measure fundamental problems with SAEs, eg:
- Feature absorption
- Whether SAEs learn the “right” concepts
- Whether our interpretations of SAE latents (aka features)^[1] are correct
  - I suspect our explanations are often way too general, and the true explanation is more specific (preliminary evidence)
    - Standard autointerp metrics do concerningly well on randomly initalized transformers… (largely because they score latents that light up on a specific token highly, I think)
  - Doing a really deep dive into rigorously interpreting a latent, including checking for false negatives, could be cool!
Work that tries to fix these problems, eg
- Matryoshka SAEs (co-invented by my MATS scholar Bart Bussmann!)
- Improving our current techniques for using LLMs to interpret SAE latents
Attempts to make SAEs practically useful (or show that they’re not), in a way that involves comparing rigorously to baselines. Eg
- Can SAEs work as good probes? (not really)
- Can SAEs help us unlearn concepts? (not really)
- Can SAEs help us steer models? (kinda!)
  - But also maybe not?
Exploring very different approaches to decomposing model concepts
- Eg can we scale attribution-based parameter decomposition to small language models?
- How well do other dictionary learning methods perform on SAEBench?
Sanity checking how well the underlying assumptions behind SAEs actually apply to real language models
- Can we find the “true” direction corresponding to a concept? How could we tell if we’ve succeeded?
- Can we find a compelling case study of concepts represented in superposition, that couldn’t just be made up of a smaller set of orthogonal concepts? How confident can we be that superposition is really a thing?
- Can we find examples of non-linear representations? (Note: it’s insufficient to just find concepts that live in subspaces of greater than one dimension)
Basic science of SAEs
- Why are some concepts learned, and not others? How is this affected by the data, SAE size, etc.
  - Scaling Monosemanticity had some awesome preliminary results here, but I’ve seen no follow-ups
- How big an improvement *are *Matryoshka SAEs? Should we just switch to using them all the time, or do they have some flaws?
- What’s up with high-frequency latents? (ie which activate on >10% of input tokens - they seem notably more common in JumpReLU and TopK, and are very annoying)

Model diffing

What happens to a model during finetuning? If we have both the original and tuned model, can we somehow take the “diff” between the two to just interpret what changed during finetuning?

I think model diffing could be a very big deal, and has largely been under-explored! Many alignment relevant concerns, like goal seeking, planning capabilities, etc seem like they could be introduced during finetuning. Intuitively it seems like it *should *be a big advantage to have access to the original model, but I’ve not seen much work demonstrating this so far.
Relevant prior work:
What to diff?
- Applying this to some of the small thinking models, like Qwen 1.5B r1 distilled, could be super interesting
- Base vs chat models is another super interesting direction
  - It’d be best to start with a specific capability here. Eg refusal, instruction following, chain of thought (especially circuits re what the chain of thought should “look like”, even if it’s not faithful to the model’s reasoning), conversational style/tone, specialized knowledge (eg finetuning on Python code or French), hallucination/saying ‘I don’t know’
- Taking a model finetuned for some specific task, maybe doing it yourself, might be easier to analyse
  - Using a LoRA might make it even cleaner
There’s various low tech ideas here, like looking at the KL divergence between the two models at each token on some prompt/rollouts, or patching activations/swapping weights between the original and tuned model to try to isolate where the key changes were
Crosscoders for model diffing (basically, an SAE trained on the concatenation of a residual stream from the original model and the tuned model) seem like they have a lot of potential for finding model diffing insights - the tentative explorations in the paper are a good start, but there’s a lot more that can be done. I’d love to see someone train a crosscoder on an r1 distill and analyse it
- Some of my MATS scholars have trained open source crosscoders and training code: Connor Kissane and Clément Dumas & Julian Minder - these should be a good starting point
Applying the stage-wise model diffing approach from Bricken et al to something else and see what you can learn

Understanding sophisticated/safety relevant behaviour

LLMs are getting good enough that they start to directly demonstrate some alignment relevant behaviours. Most interpretability work tries to advance the field in general by studying arbitrary, often toy, problems, but I’d be very excited to study these phenomena directly!

Warning: The most interesting behaviours tend to happen in the largest models, eg LLaMA 405B. This is a pain to run yourself, and I do not recommend it unless you have experience with this kind of thing.
In the recent alignment faking paper, they found that LLaMA 405B would sometimes fake alignment when prompted, *without *needing to reason aloud - I’d be really excited if you can learn anything about what’s going on here, eg with probes or activation patching from edited prompts
Chen et al shows that LLMs form surprisingly accurate and detailed models of the user, eg their gender, age, socioeconomic status, and level of education, and do this from very little information. They can find these with probes, and steer with these to change the model’s actions in weird ways.
- This is wild! What else can we learn here? What else do models represent about the user? How are these inferred? How else do they shape behaviour?
- Do LLMs form dynamic models of users for attributes that vary across turns, eg emotion, what the user knows, etc.
  - As a stretch goal, do LLMs ever try to intentionally manipulate these? Eg detect when a user is sad and try to make them happy
- You could try making probe training data by having an LLM generate conversations while modelling various desired attributes
Can we give LLMs a simple social deception game, like werewolf, maybe prompt them with some strategy advice, and have them play competently? If so, can we say anything about whether they’re modelling other players, or acting strategically?

Being useful

Interpretability is often pretty abstract, pursuing lofty blue skies goals, and it’s hard to tell if your work is total BS or not. I’m excited about projects that take a real task, one that can be defined without ever referencing interpretability, and trying to beat non-interp baselines in a fair(ish) fight - if you can do this, it’s strong evidence you’ve learned *something *real

Projects here don’t need to use fancy techniques that “feel” mechanistic, like SAEs or circuit analysis. Simple things like probes, steering vectors, and saliency maps are great if we can rigorously show that they’re actually useful (but beware of illusions!)
Examples:
One kind of thing I’d be very excited about is ‘debugging’ projects - taking some weird LLM behaviour and trying to understand what’s happening with interp tools, like probes and SAEs, and ideally using the resulting insights to fix it (though the fix doesn’t need to directly involve interp)
- Example: Using neurons and autointerp to try to debug why LLMs thing 9.11 > 9.9
Bonus points for safety relevant tasks, but I’ll take anything objectively verifiable and not insanely contrived

Investigate fundamental assumptions

There’s a lot of assumptions behind common mechanistic interpretability works, both scientific assumptions and theory of change assumptions, that in my opinion have insufficient evidence. I’d be keen to gather evidence for and against!

Can we find examples of features in real language models that are not linearly represented?
- A good place to start might be digging more into the weirder parts of SAE error terms
Is circuit analysis even needed, if SAEs can find the right concepts for us? A common take is that understanding the circuits an SAE latent participates in could help us interpret it, but is this true? Can you find any case studies where it actually helps?
- Feel free to use very high level circuit analysis, like attribution patching between that latent and latents in a much earlier layer, or even to input tokens
Is superposition real?
Is sparsity actually a good proxy for interpretability?
Do the directions corresponding to features mean the same thing across layers, or is there some systematic “drift”? If there’s drift, I weakly predict it can largely be explained as each layer applying a fixed linear transform plus a non-linear component.
- Acausal crosscoders seem a good way to start studying this
Fuzzier things: (I have no clue how to research these, but I’d love to see progress!)
- Are circuits real? Is this the right way to think abut models?
- Are features real? Do models actually think in concepts?
- Does it make sense to expect features to be linearly represented?

I favour the term latent over feature, because feature also refers to the subtly but importantly different concept of “the interpretable concept”, which an SAE “feature” imperfectly corresponds to, and it’s very confusing for it to mean both. ↩︎