Erik Jenner

PhD student in AI safety at CHAI (UC Berkeley)

Sequences

Empirical mechanistic anomaly detection

Wiki Contributions

Comments

Sorted by

Yeah, I feel like we do still disagree about some conceptual points but they seem less crisp than I initially thought and I don't know experiments we'd clearly make different predictions for. (I expect you could finetune Leela for help mates faster than training a model from scratch, but I expect most of this would be driven by things closer to pattern recognition than search.)

I think if there is a spectrum from pattern recognition to search algorithm there must be a turning point somewhere: Pattern recognition means storing more and more knowledge to get better. A search algo means that you don't need that much knowledge. So at some point of the training where the NN is pushed along this spectrum much of this stored knowledge should start to be pared away and generalised into an algorithm. This happens for toy tasks during grokking. I think it doesn't happen in Leela.

I don't think I understand your ontology for thinking about this, but I would probably also put Leela below this "turning point" (e.g., I expect most of its parameters are spent on storing knowledge and patterns rather than implementing crisp algorithms).

That said, for me, the natural spectrum is between a literal look-up table and brute-force tree search with no heuristics at all. (Of course, that's not a spectrum I expect to be traversed during training, just a hypothetical spectrum of algorithms.) On that spectrum, I think Leela is clearly far removed from both sides, but I find it pretty difficult to define its place more clearly. In particular, I don't see your turning point there (you start storing less knowledge immediately as you move away from the look-up table).

That's why I've tried to avoid absolute claims about how much Leela is doing pattern recognition vs "reasoning/..." but instead focused on arguing for a particular structure in Leela's cognition: I just don't know what it would mean to place Leela on either one of those sides. But I can see that if you think there's a crisp distinction between these two sides with a turning point in the middle, asking which side Leela is on is much more compelling.

Thanks for running these experiments! My guess is that these puzzles are hard enough that Leela doesn't really "know what's going on" in many of them and gets the first move right in significant part by "luck" (i.e., the first move is heuristically natural and can be found without (even heuristically) knowing why it's actually good). I think your results are mainly reflections of that, rather than Leela generally not having sensibly correlated move and value estimates (but I'm confused about what a case would be where we'd actually make different predictions about this correlation).

In our dataset, we tried to avoid cases like that by discarding puzzles where even a much weaker network ("LD2") got the first move right, so that Leela getting the first move right was actually evidence it had noticed the non-obvious tactic.

Some predictions based on that:

  1. Running our experiments on your dataset would result in smaller effect sizes than in our paper (in my view, that would be because Leela isn't relying on look-ahead in your puzzles but is in ours but there could be other explanations)
  2. LD2 would assign non-trivial probability to the correct first move in your dataset (for context, LD2 is pretty weak, and we're only using puzzles where it puts <5% probability on the correct move; this leaves us with a lot of sacrifices and other cases where the first move is non-obvious)
  3. Leela is much less confident on your dataset than on our puzzles (this is a cheap prediction because we specifically filtered our dataset to have Leela assign >50% probability to the correct move)
  4. Leela gets some subsequent moves wrong a decent fraction of the time even in cases where it gets the first move right. Less confidently, there might not be much correlation between getting the first move right and getting later moves right, but I'd need to think about that part more.

You might agree with all of these predictions, they aren't meant to be super strong. If you do, then I'm not sure which predictions we actually disagree about---maybe there's a way to make a dataset where we expect different amounts of correlation between policy and value output but I'd need to think about that.

But I think it can be ruled out that a substantial part of Leela network's prowess in solving chess puzzles or predicting game outcome is due to deliberate calculation. 

FWIW, I think it's quite plausible that only a small part of Leela's strength is due to look-ahead, we're only testing on a pretty narrow distribution of puzzles after all. (Though similarly, I disagree somewhat with "ruling out" given that you also just look at pretty specific puzzles (which I think might just be too hard to be a good example of Leela's strength)).

 

ETA: If you can share your dataset, I'd be happy to test the predictions above if we disagree about any of them, also happy to make them more concrete if it seems like we might disagree. Though again, I'm not claiming you should disagree with any of them just based on what you've said so far.

Thank you for writing this! I've found it helpful both to get an impression what some people at Anthropic think and also to think about some things myself. I've collected some of my agreements/disagreements/uncertainties below (mostly ignoring points already raised in other comments.)

Subject to potentially very demanding constraints around safety like those in our current and subsequent RSPs, staying close to the frontier is perhaps our top priority in Chapter 1.

If I understand this correctly, the tasks in order of descending priority during Chapter 1 are:

  1. Meet safety constraints for models deployed in this phase
  2. Stay close to the frontier
  3. Do the work needed to prepare for Chapter 2

And the reasoning is that 3. can't really happen without 2.[1] But on the other hand, if 2. happens without 3., that's also bad. And some safety work could probably happen without frontier models (such as some interpretability).

My best guess is that staying close to the frontier will be the correct choice for Anthropic. But if there ends up being a genuine trade-off between staying at the frontier and doing a lot of safety work (for example, if compute could be spent either on a pretraining run or some hypothetical costly safety research, but not both), then I'm much less sure that staying at the frontier should be the higher priority. It might be good to have informal conditions under which Anthropic would deprioritize staying close to the frontier (at least internally and, if possible, publicly).

Largely Solving Alignment Fine-Tuning for Early TAI

I didn't quite understand what this looks like and which threat models it is or isn't meant to address. You say that scheming is a key challenge "to a lesser extent for now," which I took to mean that (a) there are bigger threats than scheming from early TAI, and (b) "largely solving alignment fine-tuning" might not include confidently ruling out scheming. I probably disagree with (a) for loss of control risk (and think that loss of control is already the biggest risk in this period weighted by scale). I'd be curious what you think the main risks in this period are and what "largely solving alignment fine-tuning" means for those. (You mention reward hacking---to me, this seems unlikely to lead to loss of control for early TAI that isn't scheming against us, and I'm curious whether you disagree or think it's important for other reasons.)

the LeCun Test: Imagine another frontier AI developer adopts a copy of our RSP as binding policy and entrusts someone who thinks that AGI safety concerns are mostly bullshit to implement it

This sounds quite ambitious, but I really like it as a guide!

The key challenge here is forecasting which risks and risk factors are important enough to include.

I don't understand why this is crucial. If some risk is plausible enough to be worth seriously thinking about, it's probably important enough to include in an RSP. (And the less important it was, the easier it hopefully is to argue in a safety case that it's not a problem.) Concretely, you mention direct misuse, misalignment, and "indirect contributions via channels like dual-use R&D" as potential risks for ASL-3 and ASL-4. It seems to me that the downside of just including all of them in RSPs is relatively minor, but I might be misunderstanding or missing something. (I get that overly restrictive precautions could be very costly, but including too many tests seems relatively cheap as long as the tests correctly notice when risk is still low.)

Getting Interpretability to the Point of Making Strong Assurances

Major successes in this direction, even if they fall short of our north-star enumerative safety goal [...] would likely form some of the highest-confidence core pieces of a safety case

I'm curious what such safety cases would be for and what they could look like (the "Interpretability Dreams" post seems to talk about enumerative safety rather than safety cases that require less interpretability success). The next section sounds like interpretability would not be a core piece of a safety case for robustness, so I'm not sure what it would be used for instead. Maybe you don't include scheming under robustness? (Or maybe interp would be one of the "highest-confidence core pieces" but not the "primary piece?")

This work should be opportunistic in responding to places where it looks like a gap in one of our best-guess safety cases can be filled by a small-scale research effort.

I like this perspective; I hadn't seen it put quite that way before!

In addition, we’ll need our evaluations to be legibly appropriate. As soon as we see evidence that a model warrants ASL-N protections, we’ll likely need to convince third parties that it warrants ASL-N protections and that other models like it likely do too.

+1, seems very important!

Supporting Efforts that Build Societal Resilience

I liked this section! Of course, a lot of people work on this for reasons other than AI risk, but I'm not aware of much active work motivated by AI risk---maybe this should be a bigger priority?

The main challenge [for the Alignment Stress-Testing team] will be to stay close enough to our day-to-day execution work to stay grounded without becoming major direct contributors to that work in a way that compromises their ability to assess it.

+1, and ideally, there'd be structures in place to encourage this rather than just having it as a goal (but I don't have great ideas for what these structures should look like).

This work [in Chapter 2] could look quite distinct from the alignment research in Chapter 1: We will have models to study that are much closer to the models that we’re aiming to align

This seems possible but unclear to me. In both Chapter 1 and 2, we're trying to figure out how to align the next generation of AIs, given access only to the current (less capable) generation. Chapter 2 might still be different if we've already crossed important thresholds (such as being smart enough to potentially scheme) by then. But there could also be new thresholds between Chapter 2 and 3 (such as our inability to evaluate AI actions even with significant effort). So I wouldn't be surprised if things feel fundamentally similar, just at a higher absolute capability level (and thus with more useful AI helpers).

  1. ^

    "Our ability to do our safety work depends in large part on our access to frontier technology."

I don't think my argument relies on the existence of a crisp boundary. Just on the existence of a part of the spectrum that clearly is just pattern recognition and not lookahead but still leads to the observations you made.

Maybe I misunderstood you then, and tbc I agree that you don't need a sharp boundary. That said, the rest of your message makes me think we might still be talking past each other a bit. (Feel free to disengage at any point obviously.)

For your thought experiment, my prediction would depend on the specifics of what this "tactical motive" looks like. For a very narrow motive, I expect the checkmate predictor will just generalize correctly. For a broader motive (like all backrank mates), I'm much less sure. Still seems plausible it would generalize if both predictors are just very simple heads on top of a shared network body. The more computational work is not shared between the heads, the less likely generalization seems.

The results of this experiment would also be on a spectrum from 0% to 100% of correct checkmate-prediction for this tactical motive. But I think it would be fair to say that it hasn't really learned lookahead for 0% or a very low percentage and that's what I would expect.

Note that 0% to 100% accuracy is not the main spectrum I'm thinking of (though I agree it's also relevant). The main spectrum for me is the broadness of the motive (and in this case how much computation the heads share, but that's more specific to this experiment).

I still don't see the crisp boundary you seem to be getting at between "pattern recognition building on general circuits" and what you call "look-ahead." It sounds like one key thing for you is generalization to unseen cases, but the continuous spectrum I was gesturing at also seems to apply to that. For example:

But if in the entire training data there was never a case of a piece blocking the checkmate by rook h4, the existence of a circuit that computes the information that the bishop on d2 can drop back to h6 is not going to help the "pattern recognition"-network to predict that Ng6 is not a feasible option.

If the training data had an example of a rook checkmate on h4 being blocked by a bishop to h6, you could imagine many different possibilities:

  • This doesn't generalize to a rook checkmate on h3 being blocked by a bishop (i.e. the network would get that change wrong if it hasn't also explicitly seen it)
  • This generalizes to rook checkmates along the h-file, but doesn't generalize to rook checkmates along other files
  • This generalizes to arbitrary rook checkmates
  • This also generalizes to bishop checkmates being blocked
  • This also generalizes to a rook trapping the opponent queen (instead of the king)
  • ...

(Of course, this generalization question is likely related to the question of whether these different cases share "mechanisms.")

At the extreme end of this spectrum, I imagine a policy whose performance only depends on some simple measure of "difficulty" (like branching factor/depth needed) and which internally relies purely on simple algorithms like tree search without complex heuristics. To me, this seems like an idealized limit point to this spectrum (and not something we'd expect to actually see; for example, humans don't do this either). You might have something different/broader in mind for "look-ahead," but when I think about broader versions of this, they just bleed into what seems like a continuous spectrum.

Thanks for the elaboration, these are good points. I think about the difference between what you call look-ahead vs pattern recognition on a more continuous spectrum. For example, you say:

The network learns that Ng6 is often a likely move when the king is on h8, the queen or bishop takes away the g8 square and there is a rook or queen ready to move to the h-file.

You could imagine learning this fact literally for those specific squares. Or you could imagine generalizing very slightly and using the same learned mechanism if you flip along the vertical axis and have a king on a8, the b8 square covered, etc. Even more generally, you could learn that with a king on h8, etc., the h7 pawn is "effectively pinned," and so g6 isn't actually protected---this might then generalize to capturing a piece on g6 with some piece other than a knight (thus not giving check). Continuing like this, I think you could basically fill the entire spectrum between very simple pattern recognition and very general algorithms.

From that perspective, I'd guess Leela sits somewhere in the middle of that spectrum. I agree it's likely not implementing "a general algorithm, build on top of relatively easy-to-learn move prediction and position evaluation" in the broadest sense. On the other hand, I think some of our evidence points towards mechanisms that are used for "considering future moves" and that are shared between a broad range of board states (mainly the attention head results, more arguably the probe).

Good point, explicit representations of the objective might not be as crucial for safety applications as my post frames it.

That said, some reasons this might not generalize in a way that enables this kind of application:

  • I think this type of look-ahead/search is especially favored in chess, and it might not be as important in at least some domains in which we'd want to understand the model's cognition.
  • Our results are on a very narrow subset of board states ("tactically complex" ones). We already start with a filtered set of "puzzles" instead of general states, and then use only 2.5% of those. Anecdotally, the mechanisms we found are much less prevalent in random states.
    • I do think there's an argument that these "tactically complex" states are the most interesting ones. But on the other hand, a lot of Leela's playing strength comes from making very good decisions in "normal" states, which accumulate over the course of a game.
  • Chess has an extremely simple "world model" with clearly defined states and actions. And we know exactly what that world model is, so it's easy-ish to look for relevant representations inside the network. I'd expect everything is just much messier for networks using models of the real world.
  • We have ground truth for the "correct" reason for any given move (using chess engines much stronger than the Leela network by itself). And in fact, we try to create an input distribution where we have reason to believe that we know what future line Leela is considering; then we train probes on this dataset (among other techniques). In a realistic scenario, we might not have any examples where we know for sure why the AI took an action.
  • I don't think our understanding of Leela is good enough to enable these kinds of applications. For example, pretend we were trying to figure out whether Leela is really "trying" to win at chess, or whether it's actually pursuing some other objective that happens to correlate pretty well with winning. (This admittedly isn't a perfect analogy for planning a coup.) I don't think our results so far would have told us.
    • I'm reasonably optimistic that we could get there though in the specific case of Leela, with a lot of additional work.

The manner in which these pathological policies  achieve high  is also concerning: most of the time they match the reference policy , but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy , it could be impossible to tell whether  is Goodharting or identical to the base policy.

I'm confused; to learn this policy , some of the extremely high reward trajectories would likely have to be taken during RL training, so we could see them, right? It might still be a problem if they're very rare (e.g. if we can only manually look at a small fraction of trajectories). But if they have such high reward that they drastically affect the learned policy despite being so rare, it should be trivial to catch them as outliers based on that.

One way we wouldn't see the trajectories is if the model becomes aligned with "maximize whatever my reward signal is," figures out the reward function, and then executes these high-reward trajectories zero-shot. (This might never happen in training if they're too rare to occur even once during training under the optimal policy.) But that's a much more specific and speculative story.

I haven't thought much about how this affects the overall takeaways but I'd guess that similar things apply to heavy-tailed rewards in general (i.e. if they're rare but big enough to still have an important effect, we can probably catch them pretty easily---though how much that helps will of course depend on your threat model for what these errors  are).
 

I don't know the answer to your actual question, but I'll note there are slightly fewer mech interp mentors than mentors listed in the "AI interpretability" area (though all of them are at least doing "model internals"). I'd say Stephen Casper and I aren't focused on interpretability in any narrow sense, and Nandi Schoots' projects also sound closer to science of deep learning than mech interp. Assuming we count everyone else, that leaves 11 out of 39 mentors, which is slightly less than ~8 out of 23 from the previous cohort (though maybe not by much).

Nice overview, agree with most of it!

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I'd expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are "easy" vs "hard" though.

mechanistic anomaly detection is an approach to ELK

I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don't think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK.

I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn't make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.

Load More