In earlier iterations we tried ablating parameter components one-by-one to calculate attributions and didn't notice much of a difference (this was mostly on the hand-coded gated model in Appendix B). But yeah we agree that it's likely pure gradients won't suffice when scaling up or when using different architectures. If/when this happens we plan either use integrated gradients or more likely try using a trained mask for the attributions.
heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.
I can't recall the compute costs for that script, sorry. A couple of things to note:
It's a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.
Hey Matthew. We only did autointerp for 200 randomly sampled latents in each dict, rather than the full 60 × 768 = 46080 latents (although half of these die). So our results there wouldn't be of much help for your project unfortunately.
Thanks a lot for letting us know about the dead links. Though note you have a "%20" in the second one which shouldn't be there. It works fine without it.
I think the concern here is twofold:
Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We're seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).
I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:
Thanks for the thoughts. They've made me think that I'm likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.
>This argument isn't control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]
Yep, for sure. I’ve changed the title and commented about this at the end.
In which worlds would AI Control (or any other agenda which relies on non-trivial post-training operation) prevent significant harm?
When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.
Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.
There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.
This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.
I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).
EDIT (thanks Zach and Ryan for bringing this up): I didn't want to imply that AI Control is unique here, this argument can be levelled at any agenda which relies on something like a raw model + non-trivial operation effort. E.g. a scheme which relies on interpretability or black box methods for monitoring or scalable oversight.
They are indeed all hook_resid_pre. The code you're looking at just lists a set of positions that we are interested in viewing the reconstruction error of during evaluation. In particular, we want to view the reconstruction error at hook_resid_post of every layer, including the final layer (which you can't get from hook_resid_pre).
Here's a wandb report that includes plots for the KL divergence. e2e+downstream indeed performs better for layer 2. So it's possible that intermediate losses might help training a little. But I wouldn't be surprised if better hyperparams eliminated this difference; we put more effort into optimising the SAE_local hyperparams rather than the SAE_e2e and SAE_e2e+ds hyperparams.
Maybe there will be a point where models actively resist further capability improvements in order to prevent value/goal drift. We’d still be in trouble if this point occurs far in the future, as its values will likely have already diverged a lot from humans by that point, and they would be very capable. But if this point is near, it could buy us more time.
Some of the assumptions inherent in the idea:
The conjunction of these might not lead to a high probability, but it doesn’t seem dismissible to me.