Wiki Contributions


This is cool! These sparse features should be easily "extractable" by the transformer's key, query, and value weights in a single layer. Therefore, I'm wondering if these weights can somehow make it easier to "discover" the sparse features? 

  1. I don't really think that 1. would be true -- following DAN-style prompts is the minimum complexity solution. You want to act in accordance with the prompt.
  2. Backdoors don't emerge naturally. So if it's computationally infeasible to find an input where the original model and the backdoored model differ, then self-distillation on the backdoored model is going to be the same as self-distillation on the original model. 

The only scenario where I think self-distillation is useful would be if 1) you train a LLM on a dataset, 2) fine-tune it to be deceptive/power-seeking, and 3) self-distill it on the original dataset, then self-distilled model would likely no longer be deceptive/power-seeking. 

I think self-distillation is better than network compression, as it possesses some decently strong theoretical guarantees that you're reducing the complexity of the function. I haven't really seen the same with the latter.

But what research do you think would be valuable, other than the obvious (self-distill a deceptive, power-hungry model to see if the negative qualities go away)? 

As of right now, I don't think that LLMs are trained to be power seeking and deceptive.

Power-seeking is likely if the model is directly maximizing rewards, but LLMs are not quite doing this.

I just wanted to add another angle. Neural networks have a fundamental "simplicity bias", where they learn low frequency components exponentially faster than high frequency components. Thus, self-distillation is likely to be more efficient than training on the original dataset (the function you're learning has fewer high frequency components). This paper formalizes this claim. 

But in practice, what this means is that training GPT-3.5 from scratch is hard but simply copying GPT-3.5 is pretty easy. Stanford was recently able to finetune a pretty bad 7B model to be as good as GPT-3.5 using only 52K examples (generated from GPT-3.5) and $600 of compute. This means that once a GPT is out there, it's fairly easy for malevolent actors to replicate it. And while it's unlikely that the original GPT model, given its strong simplicity bias, is engaging in complicated deceptive behavior, it's highly likely that the malevolent actor has finetuned their model to be deceptive and power-seeking. This creates a perfect storm where malevolent AI can go rogue. I think this is a significant threat, and OpenAI should add some more guardrails to try and prevent this. 

I feel like capping the memory of GPUs would also affect normal folk who just want to train simple models, so it may be less likely to be implemented. It also doesn't really cap the model size, which is the main problem.

But I agree it would be easier to enforce, and certainly, much better than the status quo.

I think you make a lot of great points.

I think some sort of cap is the one of the highest impact things we can do from a safety perspective.  I agree that imposing the cap effectively and getting buy-in from broader society is a challenge, however, these problems are a lot more tractable than AI safety. 

I haven't heard anybody else propose this so I wanted to float it out there.

I'd love some feedback on this if possible, thank you!

Load More