Dan Braun

Wiki Contributions

Comments

Sorted by

heh, unfortunately a single SAE is 768 * 60. The residual stream in GPT2 is 768 dims and SAEs are big. You probably want to test this out on smaller models.

I can't recall the compute costs for that script, sorry. A couple of things to note:

  1. For a single SAE you will need to run it on ~25k latents (46k minus the dead ones) instead of the 200 we did.
  2. You will only need to produce explanations for activations, and won't have to do the second step of asking the model to produce activations given the explanations.

It's a fun idea. Though a serious issue is that your external LoRA weights are going to be very large because their input and output will need to be the same size as your SAE dictionary, which could be 10-100x (or more, nobody knows) the residual stream size. So this could be a very expensive setup to finetune.

Hey Matthew. We only did autointerp for 200 randomly sampled latents in each dict, rather than the full 60 × 768 = 46080 latents (although half of these die). So our results there wouldn't be of much help for your project unfortunately.

 

Thanks a lot for letting us know about the dead links. Though note you have a "%20" in the second one which shouldn't be there. It works fine without it.

I think the concern here is twofold:

  1. Once a model is deceptive at one point, even if this happens stochastically, it may continue in its deception deterministically.
  2. We can't rely on future models being as stochastic w.r.t the things we care about, e.g. scheming behaviour.

Regarding 2, consider the trend towards determinicity we see for the probability that GPT-N will output a grammatically correct sentence. For GPT-1 this was low, and it has trended upwards towards determinicity with newer releases. We're seeing a similar trend for scheming behaviour (though hopefully we can buck this trend with alignment techniques).

I plan to spend more time thinking about AI model security. The main reasons I’m not spending a lot of time on it now are:

  1. I’m excited about the project/agenda we’ve started working on in interpretability, and my team/org more generally, and I think (or at least I hope) that I have a non-trivial positive influence on it.
  2. I haven't thought through what the best things to do would be. Some ideas (takes welcome):
    1. Help create RAND or RAND-style reports like Securing AI Model Weights (I think this report is really great). E.g.
      1. Make forecasts about how much interest from adversaries certain models are likely to get, and then how likely the model is to be stolen/compromised given that level of interest and the level defense of the developer. I expect this to be much more speculative than a typical RAND report. It might also require a bunch of non-public info on both offense and defense capabilities.
      2. (not my idea) Make forecasts about how long a lab would take to implement certain levels of security.
    2. Make demos that convince natsec people that AI is or will be very capable and become a top-priority target.
    3. Improve security at a lab (probably requires becoming a full-time employee).

Thanks for the thoughts. They've made me think that I'm likely underestimating how much Control is needed to get useful work out of AIs capable and inclined to scheme. Ideally, this fact would increase the likelihood of other actors implementing AI Control schemes with the stolen model that are at least sufficient for containment and/or make them less likely to steal the model, though I wouldn’t want to put too much weight on this hope.

>This argument isn't control specific, it applies to any safety scheme with some operating tax or implementation difficulty.[1][2]

Yep, for sure. I’ve changed the title and commented about this at the end.

Dan Braun5126

In which worlds would AI Control (or any other agenda which relies on non-trivial post-training operation) prevent significant harm?

When I bring up the issue of AI model security to people working in AI safety, I’m often met with something of the form “yes, this is a problem. It’s important that people work hard on securing AI models. But it doesn’t really affect my work”.

Using AI Control (an area which has recently excited many in the field) as an example, I lay out an argument for why it might not be as effective an agenda as one might think after considering the realities of our cyber security situation.

  1. AI Control concerns itself with models that intentionally try to subvert its developers.
  2. These models are likely to be very generally capable and capable of causing significant harm without countermeasures.
  3. Leading cyber-capable institutions would likely expend significant resources and political capital to steal these models or steal enough insights to reproduce such models.
  4. If the weights or insights are stolen, work on AI control will not prevent these models from causing significant harm.
  5. Current AI developers are not on track to be able to defend against high-priority operations from leading cyber-capable institutions in the coming years.
  6. Therefore,  AI control will only be useful in the coming years under one (or more) of these conditions:
    1. Models that scheme are unlikely to be generally capable/dangerous enough to be a high-priority target for leading cyber-capable institutions.
    2. Models that scheme are only developed by actors that can thwart high-priority operations from leading cyber-capable institutions (which precludes current AI developers for at least several years).
    3. AI Control won’t be directly useful in the coming years but it will be indirectly useful to progress the field for when models are developed by actors capable of thwarting top cyber operations.
    4. Even if the model was stolen and caused significant harm, there would still be less harm overall than if the model undergoing AI control measures also broke out. Of course, this argument is much weaker for models capable of existential harm.
    5. Actors that steal the model would also implement strong-enough AI control measures and would not misuse the AI to cause significant harm.

There are of course other arguments against working on AI control. E.g. it may encourage the development and use of models that are capable of causing significant harm. This is an issue if the AI control methods fail or if the model is stolen. So one must be willing to eat this cost or argue that it’s not a large cost when advocating for AI Control work.

This isn’t to say that AI Control isn’t a promising agenda, I just think people need to carefully consider the cases in which their agenda falls down for reasons that aren’t technical arguments about the agenda itself.

I’m also interested to hear takes from those excited by AI Control on which conditions listed in #6 above that they expect to hold (or to otherwise poke holes in the argument).

EDIT (thanks Zach and Ryan for bringing this up): I didn't want to imply that AI Control is unique here, this argument can be levelled at any agenda which relies on something like a raw model + non-trivial operation effort. E.g. a scheme which relies on interpretability or black box methods for monitoring or scalable oversight.

They are indeed all hook_resid_pre. The code you're looking at just lists a set of positions that we are interested in viewing the reconstruction error of during evaluation. In particular, we want to view the reconstruction error at hook_resid_post of every layer, including the final layer (which you can't get from hook_resid_pre).

Here's a wandb report that includes plots for the KL divergence. e2e+downstream indeed performs better for layer 2. So it's possible that intermediate losses might help training a little. But I wouldn't be surprised if better hyperparams eliminated this difference; we put more effort into optimising the SAE_local hyperparams rather than the SAE_e2e and SAE_e2e+ds hyperparams.

Dan Braun3129

Very well articulated. I did a solid amount of head nodding while reading this.

As you appear to be, I'm also becoming concerned about the field trying to “cash in” too early too hard on our existing methods and theories which we know have potentially significant flaws. I don’t doubt that progress can be made by pursuing the current best methods and seeing where they succeed and fail, and I’m very glad that a good portion of the field is doing this. But looking around I don’t see enough people searching for new fundamental theories or methods that better explain how these networks actually do stuff. Too many eggs are falling in the same basket.

I don't think this is as hard a problem as the ones you find in Physics or Maths. We just need to better incentivise people to have a crack at it, e.g. by starting more varied teams at big labs and by funding people/orgs to pursue non-mainline agendas.

Thanks for prediction. Perhaps I'm underestimating the amount of shared information between in-context tokens in real models. Thinking more about it, as models grow, I expect the ratio of contextual information which is shared across tokens in the same context to more token-specific things like part of speech to increase. Obviously a bigram-only model doesn't care at all about the previous context. You could probably get a decent measure of this just by comparing cosine similarities of activations within context to activations from other contexts. If true, this would mean that as models scale up, you'd get a bigger efficiency hit if you didn't shuffle when you could have (assuming fixed batch size).

Load More