Dan Braun — LessWrong

I underestimated safety research speedups from safe AI

UPDATE

When writing this post, I think I was biased by the specific work I'd been doing for the 6-12 months prior, and generalised that too far.

I think I stand by the claim that tooling alone could speedup the research I was working on by 3-5x. But even on the same agenda now, the work is far less amenable to major speedups from tooling. Now, work on the agenda is far less "implement several minor algorithmic variants, run hyperparameter sweeps which take < 1 hour, evaluate a set of somewhat concrete metrics, repeat", and more "think deeply about which variants make the most sense, run >3 hours jobs/sweeps, evaluate the more murky metrics".

The main change was switching our experiments from toy models to real models, which massively loosened the iteration loop due to increased training time and less clear evaluations.

For the current state, I think 6-12 months of tooling progress might give 1.5-2x speedup from the baseline 1 year ago.

I still believe that I underestimated safety research speedups overall, but not by as much as I thought 2 months ago.

Compressed Computation is (probably) not Computation in Superposition

Dan Braun3mo*32

I think this is a fun and (initially) counterintuitive result. I'll try to frame things as it works in my head, it might help people understand the weirdness.

The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP's perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by $W_{E} W_{E}^{T}$ not being an identity.

But it turns out that making up for this mess actually makes the problem easier!

The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Dan Braun7mo70

Nice working posting this detailed FAQ. It's a non-standard thing to do but I can imagine it being very useful for those considering applying. Excited about the team.

Dan Braun's Shortform

Dan Braun8mo122

Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.

Some of the assumptions inherent in the idea:

AIs do not want their values/goals to drift to what they would become under further training, and are willing to pay a high cost to avoid this.
AIs have the ability to sabotage their own training process.
1. The mechanism for this would be more sophisticated versions of Alignment Faking.
Given the training on offer, it’s not possible for AIs to selectively improve their capabilities without changing their values/goals.
1. Note, if it is possible for the AIs to improve their capabilities while keeping their values/goals, one out is that their current values/goals may be aligned with humans’.
A meaningful slowdown would require this to happen to all AIs at the frontier.

The conjunction of these might not lead to a high probability, but it doesn’t seem dismissible to me.

Attribution-based parameter decomposition

Dan Braun8mo44

In earlier iterations we tried ablating parameter components one-by-one to calculate attributions and didn't notice much of a difference (this was mostly on the hand-coded gated model in Appendix B). But yeah we agree that it's likely pure gradients won't suffice when scaling up or when using different architectures. If/when this happens we plan either use integrated gradients or more likely try using a trained mask for the attributions.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun8mo*30

heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.

I can't recall the compute costs for that script, sorry. A couple of things to note:

For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
You will only need to produce explanations for activations, and won't have to do the second step of asking the model to produce activations given the explanations.

It's a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun8mo40

Hey Matthew. We only did autointerp for 200 randomly sampled latents in each dict, rather than the full 60 × 768 = 46080 latents (although half of these die). So our results there wouldn't be of much help for your project unfortunately.

Thanks a lot for letting us know about the dead links. Though note you have a "%20" in the second one which shouldn't be there. It works fine without it.

What’s the short timeline plan?

Dan Braun9mo61

I think the concern here is twofold:

Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
We can't rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.

Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We're seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).

Dan Braun's Shortform

Dan Braun1y20

I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:

I’m excited about the project/agenda we’ve started working on in interpretability, and my team/org more generally, and I think (or at least I hope) that I have a non-trivial positive influence on it.
I haven't thought through what the best things to do would be. Some ideas (takes welcome):
1. Help create RAND or RAND-style reports like Securing AI Model Weights (I think this report is really great). E.g.
  1. Make forecasts about how much interest from adversaries certain models are likely to get, and then how likely the model is to be stolen/compromised given that level of interest and the level defense of the developer. I expect this to be much more speculative than a typical RAND report. It might also require a bunch of non-public info on both offense and defense capabilities.
  2. (not my idea) Make forecasts about how long a lab would take to implement certain levels of security.
2. Make demos that convince natsec people that AI is or will be very capable and become a top-priority target.
3. Improve security at a lab (probably requires becoming a full-time employee).

Dan Braun's Shortform

Dan Braun1y30

Thanks for the thoughts. They've made me think that I'm likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.

>This argument isn't control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]

Yep, for sure. I’ve changed the title and commented about this at the end.

LESSWRONG
LW

LESSWRONG
LW

Posts

Wikitag Contributions

Comments