Current “unlearning” methods only suppress capabilities instead of truly unlearning the capabilities. But if you distill an unlearned model into a randomly initialized model, the resulting network is actually robust to relearning. We show why this works, how well it works, and how to trade off compute for robustness. Unlearn-and-Distill...
Overview: By training neural networks with selective modularity, gradient routing enables new approaches to core problems in AI safety. This agenda identifies related research directions that might enable safer development of transformative AI. Introduction Soon, the world may see rapid increases in AI capabilities resulting from AI research automation, and...
Over the past few months, I helped develop Gradient Routing, a non loss-based method to shape the internals of neural networks. After my team developed it, I realized that I could use the method to do something that I have long wanted to do: make an autoencoder with an extremely...
We present gradient routing, a way of controlling where learning happens in neural networks. Gradient routing applies masks to limit the flow of gradients during backpropagation. By supplying different masks for different data points, the user can induce specialized subcomponents within a model. We think gradient routing has the potential...
Produced as part of the MATS Summer 2024 program, under the mentorship of Alex Turner (TurnTrout). A few weeks ago, I stumbled across a very weird fact: it is possible to find multiple steering vectors in a language model that activate very similar behaviors while all being orthogonal. This was...
Do you ever go to a lecture, follow it thinking it makes total sense, then look back at your notes later and realize it makes no sense? This used to happen to me, but I’ve learned how to use spaced repetition to fully avoid this if I want. I’m going...
I did Quiz Bowl throughout my time in high school, and looking back on it, it was a pretty positive thing to do! In this blog post, I want to make a list of some of the life lessons I have taken from Quiz Bowl. If you know about Quiz...