Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
We present gradient routing, a way of controlling where learning happens in neural networks. Gradient routing applies masks to limit the flow of gradients during backpropagation. By supplying different masks for different data points, the user can induce specialized subcomponents within a model. We think gradient routing has the potential to train safer AI systems, for example, by making them more transparent, or by enabling the removal or monitoring of sensitive capabilities. In this post, we: * Show how to implement gradient routing. * Briefly state the main results from our paper, on... * Controlling the latent space learned by an MNIST autoencoder so that different subspaces specialize to different digits; * Localizing computation in language models: (a) inducing axis-aligned features and (b) demonstrating that information can be localized then removed by ablation, even when data is imperfectly labeled; and * Scaling oversight to efficiently train a reinforcement learning policy even with severely limited ability to score its behavior. * Discuss the results. A key takeaway: gradient routing is qualitatively different than behavioral (i.e. purely loss-based) training methods, granting it unique affordances. * Conclude by speculating about how gradient routing might be relevant to AI alignment. If you’re interested in further discussion or details, check out the paper and its extensive appendices, or the code for gradient routing. Gradient routing Gradient routing allows the user to configure what data (at the level of tokens, documents, or any other feature of the data) causes learning updates where in a neural network (parameters, activations, modules). In full generality, this configuration is achieved by assigning weights to every edge in the computational graph, for every data point. These weights are then multiplied by the gradients that get backpropagated through these edges. This is formalized in the paper. Each data point updates diff
I only have anecdata but I've talked to quite a few people and most people say it's is a good idea to use the myriad of other concerns about AI as a force multiplier on shared policy goals.