This is the latest work in our Parameter Decomposition agenda. We introduce a new parameter decomposition method, adVersarial Parameter Decomposition (VPD)[1] and decompose the parameters of a small[2] language model with it. VPD greatly improves on our previous techniques, Stochastic Parameter Decomposition (SPD) and Attribution-based Parameter Decomposition (APD). We think...
(See my update 2 months later) A year or so ago, I thought that 10x speedups in safety research would require AIs so capable that takeover risks would be very high. 2-3x gains seemed plausible, 10x seemed unlikely. I no longer think this. What changed? I continue to think that...
Abstract A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition— a framework that has been proposed to resolve several issues with current decomposition methods—decomposes neural network parameters into a sum of sparsely used vectors...
This is a linkpost for Apollo Research's new interpretability paper: "Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition". We introduce a new method for directly decomposing neural network parameters into mechanistic components. Motivation At Apollo, we've spent a lot of time thinking about how the computations...
This post reflects my personal opinion and not necessarily that of other members of Apollo Research or any of the people acknowledged below. Thanks to Jarrah Bloomfield, Lucius Bushnaq, Marius Hobbhahn, Axel Højmark, and Stefan Heimersheim for comments/discussions. I find that people in the AI/AI safety community have not considered...