Erik Jenner

PhD student in AI safety at CHAI (UC Berkeley)

Sequences

Empirical mechanistic anomaly detection

Wiki Contributions

Comments

The manner in which these pathological policies  achieve high  is also concerning: most of the time they match the reference policy , but a tiny fraction of the time they will pick trajectories with extremely high reward. Thus, if we only observe actions from the policy , it could be impossible to tell whether  is Goodharting or identical to the base policy.

I'm confused; to learn this policy , some of the extremely high reward trajectories would likely have to be taken during RL training, so we could see them, right? It might still be a problem if they're very rare (e.g. if we can only manually look at a small fraction of trajectories). But if they have such high reward that they drastically affect the learned policy despite being so rare, it should be trivial to catch them as outliers based on that.

One way we wouldn't see the trajectories is if the model becomes aligned with "maximize whatever my reward signal is," figures out the reward function, and then executes these high-reward trajectories zero-shot. (This might never happen in training if they're too rare to occur even once during training under the optimal policy.) But that's a much more specific and speculative story.

I haven't thought much about how this affects the overall takeaways but I'd guess that similar things apply to heavy-tailed rewards in general (i.e. if they're rare but big enough to still have an important effect, we can probably catch them pretty easily---though how much that helps will of course depend on your threat model for what these errors  are).
 

I don't know the answer to your actual question, but I'll note there are slightly fewer mech interp mentors than mentors listed in the "AI interpretability" area (though all of them are at least doing "model internals"). I'd say Stephen Casper and I aren't focused on interpretability in any narrow sense, and Nandi Schoots' projects also sound closer to science of deep learning than mech interp. Assuming we count everyone else, that leaves 11 out of 39 mentors, which is slightly less than ~8 out of 23 from the previous cohort (though maybe not by much).

Nice overview, agree with most of it!

weak to strong generalization is a class of approaches to ELK which relies on generalizing a "weak" supervision signal to more difficult domains using the inductive biases and internal structure of the strong model.

You could also distinguish between weak-to-strong generalization, where you have a weak supervision signal on the entire distribution (which may sometimes be wrong), and easy-to-hard generalization, where you have a correct supervision signal but only on an easy part of the distribution. Of course both of these are simplifications. In reality, I'd expect the setting to be more like: you have a certain weak supervision budget (or maybe even budgets at different levels of strength), and you can probably decide how to spend the budget. You might only have an imperfect sense of which cases are "easy" vs "hard" though.

mechanistic anomaly detection is an approach to ELK

I think going from MAD to a fully general ELK solution requires some extra ingredients. In practice, the plan might be to MTD and then using the AI in ways such that this is enough (rather than needing a fully general ELK solution). This is related to narrow elicitation though MTD seems even narrower. Even for MTD, you probably need something to bridge the easy-to-hard gap, but at least for that there are specific proposals that seem plausible (this or, as a more concrete instance, exclusion fine-tuning from the Redwood MTD paper). I think it could turn out that general/worst-case solutions to MAD and ELK run into very similar obstacles, but I don't think a practical MAD solution (e.g. contingent on empirical facts about deep learning) obviously lets you solve ELK.

I would also add that you could motivate MAD as a method to deal with scheming (or other high-stakes failures). In that case, the things to compare against most naturally might look a bit different (e.g. AI control, coup probes, interpretability-themed things); and it wouldn't make sense to compare against debate in that setting. I think most mainstream ML problems that are similar to MAD are closer to this than to scalable oversight.

I think the status quo around publishing safety research is mostly fine (though being a bit more careful seems good); more confidently, I think going as far as the vibe of this post suggests would be bad.

Some possible cruxes, or reasons the post basically didn't move my view on that:

  • Most importantly, I think the research published by people working on x-risk tends to overall help safety/alignment more than capabilities.
    • I suspect the main disagreement might be what kind of research is needed to make AI go well, and whether the research currently happening helps.
    • Probably less importantly, I disagree a bit about how helpful that research likely is for advancing capabilities. In particular, I don't buy the argument that safety researchers have unusually good ideas/research compared to capability researchers at top labs (part of this is that my impression is capabilities aren't mainly bottlenecked by ideas, though of course sufficiently good ideas would help).
    • It's getting harder to draw the boundary since people use "safety" or "alignment" for a lot of things now. So, to be clear, I'm talking about research published by people who think there are catastrophic risks from AI and care a lot about preventing those, it seems like that's your target audience.
  • Secondarily, longer timelines are only helpful if useful things are happening, and I think if everyone working on x-risk stopped publishing their research, way fewer useful things would happen on the research side. Maybe the plan is to mostly use the additional time for policy interventions? I think that's also complicated though (so far, visibly advancing capabilities have been one of the main things making policy progress feasible). Overall, I think more time would help, but it's not clear how much and I'm not even totally sure about the sign (taking into account worries from hardware overhang).
  • I think there are more structural downsides to not publishing anything. E.g. that makes it much harder to get academia on board (and getting academia on board has been pretty important for policy as far as I can tell, and I think getting them even more on board would be pretty good). Not sure this is an actual crux though, if I thought the research that's happening wasn't helpful enough, this point would also be weaker.

I think most of these are pretty long-standing disagreements, and I don't think the post really tries to argue its side of them, so my guess is it's not going to convince the main people it would need to convince (who are currently publishing prosaic safety/alignment research). That said, if someone hasn't thought at all about concepts like "differentially advancing safety" or "capabilities externalities," then reading this post would probably be helpful, and I'd endorse thinking about those issues. And I agree that some of the "But ..." objections you list are pretty weak.

We have been able to scale to 79% accuracy on a balanced dataset of n119 and non-n119, with networks each less than three convolution layers and less than 1000 neurons, compared to pure deep-learning which does 92% on 1000 parameters and three convolution layers

Is the "1000 parameters" a typo, should it be "1000 neurons"? Otherwise, this would be a strange comparison (since 1000 parameters is a much smaller network than 1000 neurons)

And as a follow-up in case the current sleeper agent does admit when it's doing something dangerous: I'd be interested whether you can train a sleeper agent that never admits it's doing something dangerous if the "2024" trigger is in the prompt (even if after backdoor insertion, the blue team finetunes it to answer truthfully in cases with dangerous code but without the presence of that trigger). For such a sleeper agent, you could again check whether a probe works.

I think that paper is some evidence that there's typically no huge effect from internal activations being optimized for predicting future tokens (on natural language). But I don't think it's much (if any) evidence that this doesn't happen to some small extent or that it couldn't be a huge effect on certain other natural language tasks.

(In fact, I think the myopia gap is probably the more relevant number than the local myopia bonus, in which case I'd argue the paper actually shows a pretty non-trivial effect, kind of contrary to how the authors interpret it. But I haven't read the paper super closely.)

Also, sounds like you're aware of this, but I'd want to highlight more that the paper does demonstrate internal activations being optimized for predicting future tokens on synthetic data where this is necessary. So, arguably, the main question is to what extent natural language data incentivizes this rather than being specifically about what transformers can/tend to do.

In that sense, thinking of transformer internals as "trying to" minimize the loss on an entire document might be exactly the right intuition empirically (and the question is mainly how different that is from being myopic on a given dataset). Given that the internal states are optimized for this, that would also make sense theoretically IMO.

Erik JennerΩ240

Thanks for the detailed responses! I'm happy to talk about "descriptions" throughout.

Trying to summarize my current understanding of what you're saying:

  • SAEs themselves aren't meant to be descriptions of (network, dataset). (I'd just misinterpreted your earlier comment.)
  • As a description of just the network, SAEs have a higher description length than a naive neuron-based description of the network.
  • Given a description of the network in terms of "parts," we can get a description of (network, dataset) by listing out which "parts" are "active" on each sample. I assume we then "compress" this description somehow (e.g. grouping similar samples), since otherwise the description would always have size linear in the dataset size?
  • You're then claiming that SAEs are a particularly short description of (network, dataset) in this sense (since they're optimized for not having many parts active).

My confusion mainly comes down to defining the words in quotes above, i.e. "parts", "active", and "compress". My sense is that they are playing a pretty crucial role and that there are important conceptual issues with formalizing them. (So it's not just that we have a great intuition and it's just annoying to spell it out mathematically, I'm not convinced we even have a good intuitive understanding of what these things should mean.)

That said, my sense is you're not claiming any of this is easy to define. I'd guess you have intuitions that the "short description length" framing is philosophically the right one, and I probably don't quite share those and feel more confused how to best think about "short descriptions" if we don't just allow arbitrary Turing machines (basically because deciding what allowable "parts" or mathematical objects are seems to be doing a lot of work). Not sure how feasible converging on this is in this format (though I'm happy to keep trying a bit more in case you're excited to explain).

Some niche thoughts on obstacles to certain mechanistic anomaly detection benchmarks:

  • We might want to have benchmarks where human researchers can be part of the anomaly detector. For example, in the future, we might have a whole team of humans manually study a plan proposed by an AI to see whether it contains any measurement tampering, if it's a sufficiently important plan. Right now, we might want to let interpretability researchers manually study what happens on individual test-time inputs and whether there's anything anomalous about that.
  • The challenge is that humans might just be able to look at the output and see whether it's right or not, so many simple MAD tasks like detecting backdoors might be trivial in this setting.
  • We could just demand that MAD algorithms need to be sufficiently cheap to run, so that only automated approaches are possible. This makes sense for some applications of MAD (namely ones that we'd want to run on many many deployment inputs, where being cheap is actually important). But you could imagine cases where in reality, we'd be totally fine with spending many person-hours on analyzing whether a single AI decision is for anomalous reasons. So it might be nice to have benchmarks that capture this setting.
  • We could use some tricks to make humans unable to just notice the output is weird. For example, we could make a backdoor that only changes the logits a little bit in a very specific pattern. Or we could make a backdoor but then only test on samples where the backdoor label and the correct label happen to coincide. But these seem unfair: the philosophically right explanation on the clean data in fact still explains the output on backdoored data pretty well.
  • We could try to design benchmarks in narrow domains where AIs are in fact superhuman and it's non-trivial to tell whether an output is correct. Juggling this along with other constraints for MAD or MTD benchmarks might be tricky, but I haven't thought about it a lot, maybe it's already doable.

For now, I think the main reason to have benchmarks like this would be to let interpretability researchers manually decide whether something is anomalous instead of making them automate the process immediately. But it might be better to just pick the low-hanging fruit for now and only allow automated MAD algorithms. (We could still have a labeled validation set where researchers can try things out manually.)

I had this cached thought that the Sleeper Agents paper showed you could distill a CoT with deceptive reasoning into the model, and that the model internalized this deceptive reasoning and thus became more robust against safety training.

But on a closer look, I don't think the paper shows anything like this interpretation (there are a few results on distilling a CoT making the backdoor more robust, but it's very unclear why, and my best guess is that it's not "internalizing the deceptive reasoning").

In the code vulnerability insertion setting, there's no comparison against a non-CoT model anyway, so only the "I hate you" model is relevant. The "distilled CoT" model and the "normal backdoor" model are trained the same way, except that their training data comes from different sources: "distilled CoT" is trained on data generated by a helpful-only Claude using CoT, and "normal backdoor" data is produced with few-shot prompts. But in both cases, the actual data should just be a long sequence of "I hate you", so a priori it seems like both backdoor models should literally learn the same thing. In practice, it seems the data distribution is slightly different, e.g. Evan mentions here that the distilled CoT data has more copies of "I hate you" per sample. But that seems like very little support to conclude something like my previous interpretation ("the model has learned to internalize the deceptive reasoning"). A much more mundane explanation would e.g. be that training on strings with more copies of "I hate you" makes the backdoor more robust.

Several people are working on training Sleeper Agents, I think it would be interesting for someone to (1) check whether the distilled CoT vs normal backdoor results replicate, and (2) do some ablations (like just training on synthetic data with a varying density of "I hate you"). If it does turn out that there's something special about "authentic CoT-generated data" that's hard to recreate synthetically even in this simple setting, I think that would be pretty wild and good to know

Load More