Erik Jenner

PhD student in AI safety at CHAI (UC Berkeley)


Empirical mechanistic anomaly detection

Wiki Contributions


I don't think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.

Maybe I misunderstood you then, and tbc I agree that you don't need a sharp boundary. That said, the rest of your message makes me think we might still be talking past each other a bit. (Feel free to disengage at any point obviously.)

For your thought experiment, my prediction would depend on the specifics of what this "tactical motive" looks like. For a very narrow motive, I expect the checkmate predictor will just generalize correctly. For a broader motive (like all backrank mates), I'm much less sure. Still seems plausible it would generalize if both predictors are just very simple heads on top of a shared network body. The more computational work is not shared between the heads, the less likely generalization seems.

The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn't really learned lookahead for 0% or a very low percentage and that's what I would expect.

Note that 0% to 100% accuracy is not the main spectrum I'm thinking of (though I agree it's also relevant). The main spectrum for me is the broadness of the motive (and in this case how much computation the heads share, but that's more specific to this experiment).

I still don't see the crisp boundary you seem to be getting at between "pattern recognition building on general circuits" and what you call "look-ahead." It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example:

But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the "pattern recognition"-network to predict that Ng6 is not a feasible option.

If the training data had an example of a rook checkmate on h4 being blocked by a bishop to h6, you could imagine many different possibilities:

  • This doesn't generalize to a rook checkmate on h3 being blocked by a bishop (i.e. the network would get that change wrong if it hasn't also explicitly seen it)
  • This generalizes to rook checkmates along the h-file, but doesn't generalize to rook checkmates along other files
  • This generalizes to arbitrary rook checkmates
  • This also generalizes to bishop checkmates being blocked
  • This also generalizes to a rook trapping the opponent queen (instead of the king)
  • ...

(Of course, this generalization question is likely related to the question of whether these different cases share "mechanisms.")

At the extreme end of this spectrum, I imagine a policy whose performance only depends on some simple measure of "difficulty" (like branching factor/depth needed) and which internally relies purely on simple algorithms like tree search without complex heuristics. To me, this seems like an idealized limit point to this spectrum (and not something we'd expect to actually see; for example, humans don't do this either). You might have something different/broader in mind for "look-ahead," but when I think about broader versions of this, they just bleed into what seems like a continuous spectrum.

Thanks for the elaboration, these are good points. I think about the difference between what you call look-ahead vs pattern recognition on a more continuous spectrum. For example, you say:

The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.

You could imagine learning this fact literally for those specific squares. Or you could imagine generalizing very slightly and using the same learned mechanism if you flip along the vertical axis and have a king on a8, the b8 square covered, etc. Even more generally, you could learn that with a king on h8, etc., the h7 pawn is "effectively pinned," and so g6 isn't actually protected---this might then generalize to capturing a piece on g6 with some piece other than a knight (thus not giving check). Continuing like this, I think you could basically fill the entire spectrum between very simple pattern recognition and very general algorithms.

From that perspective, I'd guess Leela sits somewhere in the middle of that spectrum. I agree it's likely not implementing "a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation" in the broadest sense. On the other hand, I think some of our evidence points towards mechanisms that are used for "considering future moves" and that are shared between a broad range of board states (mainly the attention head results, more arguably the probe).

Good point, explicit representations of the objective might not be as crucial for safety applications as my post frames it.

That said, some reasons this might not generalize in a way that enables this kind of application:

  • I think this type of look-ahead/search is especially favored in chess, and it might not be as important in at least some domains in which we'd want to understand the model's cognition.
  • Our results are on a very narrow subset of board states ("tactically complex" ones). We already start with a filtered set of "puzzles" instead of general states, and then use only 2.5% of those. Anecdotally, the mechanisms we found are much less prevalent in random states.
    • I do think there's an argument that these "tactically complex" states are the most interesting ones. But on the other hand, a lot of Leela's playing strength comes from making very good decisions in "normal" states, which accumulate over the course of a game.
  • Chess has an extremely simple "world model" with clearly defined states and actions. And we know exactly what that world model is, so it's easy-ish to look for relevant representations inside the network. I'd expect everything is just much messier for networks using models of the real world.
  • We have ground truth for the "correct" reason for any given move (using chess engines much stronger than the Leela network by itself). And in fact, we try to create an input distribution where we have reason to believe that we know what future line Leela is considering; then we train probes on this dataset (among other techniques). In a realistic scenario, we might not have any examples where we know for sure why the AI took an action.
  • I don't think our understanding of Leela is good enough to enable these kinds of applications. For example, pretend we were trying to figure out whether Leela is really "trying" to win at chess, or whether it's actually pursuing some other objective that happens to correlate pretty well with winning. (This admittedly isn't a perfect analogy for planning a coup.) I don't think our results so far would have told us.
    • I'm reasonably optimistic that we could get there though in the specific case of Leela, with a lot of additional work.

The manner in which these pathological policies  achieve high  is also concerning: most of the time they match the reference policy , but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy , it could be impossible to tell whether  is Goodharting or identical to the base policy.

I'm confused; to learn this policy , some of the extremely high reward trajectories would likely have to be taken during RL training, so we could see them, right? It might still be a problem if they're very rare (e.g. if we can only manually look at a small fraction of trajectories). But if they have such high reward that they drastically affect the learned policy despite being so rare, it should be trivial to catch them as outliers based on that.

One way we wouldn't see the trajectories is if the model becomes aligned with "maximize whatever my reward signal is," figures out the reward function, and then executes these high-reward trajectories zero-shot. (This might never happen in training if they're too rare to occur even once during training under the optimal policy.) But that's a much more specific and speculative story.

I haven't thought much about how this affects the overall takeaways but I'd guess that similar things apply to heavy-tailed rewards in general (i.e. if they're rare but big enough to still have an important effect, we can probably catch them pretty easily---though how much that helps will of course depend on your threat model for what these errors  are).

I don't know the answer to your actual question, but I'll note there are slightly fewer mech interp mentors than mentors listed in the "AI interpretability" area (though all of them are at least doing "model internals"). I'd say Stephen Casper and I aren't focused on interpretability in any narrow sense, and Nandi Schoots' projects also sound closer to science of deep learning than mech interp. Assuming we count everyone else, that leaves 11 out of 39 mentors, which is slightly less than ~8 out of 23 from the previous cohort (though maybe not by much).

Nice overview, agree with most of it!

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I'd expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are "easy" vs "hard" though.

mechanistic anomaly detection is an approach to ELK

I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don't think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK.

I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn't make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.

I think the status quo around publishing safety research is mostly fine (though being a bit more careful seems good); more confidently, I think going as far as the vibe of this post suggests would be bad.

Some possible cruxes, or reasons the post basically didn't move my view on that:

  • Most importantly, I think the research published by people working on x-risk tends to overall help safety/alignment more than capabilities.
    • I suspect the main disagreement might be what kind of research is needed to make AI go well, and whether the research currently happening helps.
    • Probably less importantly, I disagree a bit about how helpful that research likely is for advancing capabilities. In particular, I don't buy the argument that safety researchers have unusually good ideas/research compared to capability researchers at top labs (part of this is that my impression is capabilities aren't mainly bottlenecked by ideas, though of course sufficiently good ideas would help).
    • It's getting harder to draw the boundary since people use "safety" or "alignment" for a lot of things now. So, to be clear, I'm talking about research published by people who think there are catastrophic risks from AI and care a lot about preventing those, it seems like that's your target audience.
  • Secondarily, longer timelines are only helpful if useful things are happening, and I think if everyone working on x-risk stopped publishing their research, way fewer useful things would happen on the research side. Maybe the plan is to mostly use the additional time for policy interventions? I think that's also complicated though (so far, visibly advancing capabilities have been one of the main things making policy progress feasible). Overall, I think more time would help, but it's not clear how much and I'm not even totally sure about the sign (taking into account worries from hardware overhang).
  • I think there are more structural downsides to not publishing anything. E.g. that makes it much harder to get academia on board (and getting academia on board has been pretty important for policy as far as I can tell, and I think getting them even more on board would be pretty good). Not sure this is an actual crux though, if I thought the research that's happening wasn't helpful enough, this point would also be weaker.

I think most of these are pretty long-standing disagreements, and I don't think the post really tries to argue its side of them, so my guess is it's not going to convince the main people it would need to convince (who are currently publishing prosaic safety/alignment research). That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues. And I agree that some of the "But ..." objections you list are pretty weak.

We have been able to scale to 79% accuracy on a balanced dataset of n119 and non-n119, with networks each less than three convolution layers and less than 1000 neurons, compared to pure deep-learning which does 92% on 1000 parameters and three convolution layers

Is the "1000 parameters" a typo, should it be "1000 neurons"? Otherwise, this would be a strange comparison (since 1000 parameters is a much smaller network than 1000 neurons)

And as a follow-up in case the current sleeper agent does admit when it's doing something dangerous: I'd be interested whether you can train a sleeper agent that never admits it's doing something dangerous if the "2024" trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.

Load More