Josh Engels's Shortform

Josh Engels

This is a special post for quick takes by Josh Engels. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

I really liked @Sam Marks recent post on downstream applications as validation for interp techniques, and I've been feeling similarly after the (in my opinion) somewhat disappointing downstream performance of SAEs.

Motivated by this, I've written up about 50 weird language model results I found in the literature. I expect some of them to be familiar to most here (e.g. alignment faking, reward hacking) and some to be a bit more obscure (e.g. input space connectivity, fork tokens).

If our current interp techniques can help us understand these phenomenon better, that's great! Otherwise I hope that seeing where our current techniques fail might help us develop better techniques.

I'm also interested in taking a wide view of what counts as interp. When trying to understand some weird model behavior, if mech interp techniques aren't as useful as linear probing, or even careful black box experiments, that seems important to know!

Here's the doc: https://docs.google.com/spreadsheets/d/1yFAawnO9z0DtnRJDhRzDqJRNkCsIK_N3_pr_yCUGouI/edit?gid=0#gid=0

Thanks to @jake_mendel, @Senthooran Rajamanoharan, and @Neel Nanda for the conversations that convinced me to write this up.

Really helpful work, thanks a lot for doing it

Very happy you did this!

Thanks for doing this— I found it really helpful.

I think that most of the time when you need to classify something, you should use an LLM, not a probe.

That being said, there are some situations where I think activation probes are better. To help clarify my thinking, I wrote out the axes on which I currently think that probes are sometimes better than LLMs / possibly SOTA:

1. Efficiency -> when done on policy, probes are extremely cheap and fast. For example, Anthropic's work on efficient misuse probes or our recent Gemini Probing paper.

2. Safety -> for some things like deception, alignment faking, eval awareness, etc., you might not be able to trust the model's output, and probes (or other internals based techniques) might help you. See Apollo's work on deception probes and Probing and Steering Evaluation Awareness of Language Models.

3. Elicitation -> the knowledge is in the LLM, but it’s hard to figure out the prompt to get the knowledge out exactly as you want. In this case, the probe training data is your way to convey the parameters of what you want. Goodfire's recent paper could be an example of this: https://arxiv.org/abs/2602.10067

4. Calibration -> again, the knowledge is in the LLM, but the model has trouble telling you its internal confidence in the knowledge when prompted. For example, in Cognitive Dissonance: Why Do Language Model Outputs Disagree with Internal Representations of Truthfulness?, they conclude that truth probes are sometimes better than prompting primarily because of better calibration.