Dan Braun — LessWrong

That’s the trap: in software, effort is easy to generate, activity is easy to justify, and impact is surprisingly easy to avoid.

Yyep. And it’s much much worse for research.

Roman Malov's Shortform

Dan Braun23d60

I know your comment isn't an earnest attempt to convince people, but fwiw:

For years, AI x-risk people have warned us that a huge danger comes with AI capable of RSI

I think this argument is more likely to have the opposite effect than intended when used on the types of people pushing on RSI. I think your final paragraph would be much more effective.

Zach Stein-Perlman's Shortform

Dan Braun3mo*162

The ASL-3 security standard states in 4.2.4 that "third-party environments", which surely includes compute providers, are in scope (and on their minds) for the standards they laid out:

Third-party environments: Document how all relevant models will meet the criteria above, even if they are deployed in a third-party partner’s environment that may have a different set of safeguards.

I underestimated safety research speedups from safe AI

Dan Braun4mo142

UPDATE

When writing this post, I think I was biased by the specific work I'd been doing for the 6-12 months prior, and generalised that too far.

I think I stand by the claim that tooling alone could speedup the research I was working on by 3-5x. But even on the same agenda now, the work is far less amenable to major speedups from tooling. Now, work on the agenda is far less "implement several minor algorithmic variants, run hyperparameter sweeps which take < 1 hour, evaluate a set of somewhat concrete metrics, repeat", and more "think deeply about which variants make the most sense, run >3 hours jobs/sweeps, evaluate the more murky metrics".

The main change was switching our experiments from toy models to real models, which massively loosened the iteration loop due to increased training time and less clear evaluations.

For the current state, I think 6-12 months of tooling progress might give 1.5-2x speedup from the baseline 1 year ago.

I still believe that I underestimated safety research speedups overall, but not by as much as I thought 2 months ago.

Compressed Computation is (probably) not Computation in Superposition

Dan Braun6mo*32

I think this is a fun and (initially) counterintuitive result. I'll try to frame things as it works in my head, it might help people understand the weirdness.

The task of the residual MLP (labelled CC Model here) is to solve y = x + ReLU(x). Consider the problem from the MLP's perspective. You might think that the problem for the MLP is to just learn how to compute ReLU(x) for 100 input features with only 50 neurons. But given that we have this random matrix, the task is actually more complicated. Not only does the MLP have to compute ReLU(x), it also has to make up for the mess caused by $W_{E} W_{E}^{T}$ not being an identity.

But it turns out that making up for this mess actually makes the problem easier!

The GDM AGI Safety+Alignment Team is Hiring for Applied Interpretability Research

Dan Braun10mo*80

Nice work posting this detailed FAQ. It's a non-standard thing to do but I can imagine it being very useful for those considering applying. Excited about the team.

Dan Braun's Shortform

Dan Braun11mo122

Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.

Some of the assumptions inherent in the idea:

AIs do not want their values/goals to drift to what they would become under further training, and are willing to pay a high cost to avoid this.
AIs have the ability to sabotage their own training process.
1. The mechanism for this would be more sophisticated versions of Alignment Faking.
Given the training on offer, it’s not possible for AIs to selectively improve their capabilities without changing their values/goals.
1. Note, if it is possible for the AIs to improve their capabilities while keeping their values/goals, one out is that their current values/goals may be aligned with humans’.
A meaningful slowdown would require this to happen to all AIs at the frontier.

The conjunction of these might not lead to a high probability, but it doesn’t seem dismissible to me.

Attribution-based parameter decomposition

Dan Braun1y44

In earlier iterations we tried ablating parameter components one-by-one to calculate attributions and didn't notice much of a difference (this was mostly on the hand-coded gated model in Appendix B). But yeah we agree that it's likely pure gradients won't suffice when scaling up or when using different architectures. If/when this happens we plan either use integrated gradients or more likely try using a trained mask for the attributions.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun1y*30

heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.

I can't recall the compute costs for that script, sorry. A couple of things to note:

For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
You will only need to produce explanations for activations, and won't have to do the second step of asking the model to produce activations given the explanations.

It's a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.

Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Dan Braun1y40

Hey Matthew. We only did autointerp for 200 randomly sampled latents in each dict, rather than the full 60 × 768 = 46080 latents (although half of these die). So our results there wouldn't be of much help for your project unfortunately.

Thanks a lot for letting us know about the dead links. Though note you have a "%20" in the second one which shouldn't be there. It works fine without it.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Posts

Wikitag Contributions

Comments

Posts

Wikitag Contributions

Comments