(See my update 2 months later) A year or so ago, I thought that 10x speedups in safety research would require AIs so capable that takeover risks would be very high. 2-3x gains seemed plausible, 10x seemed unlikely. I no longer think this. What changed? I continue to think that...
Abstract A key step in reverse engineering neural networks is to decompose them into simpler parts that can be studied in relative isolation. Linear parameter decomposition— a framework that has been proposed to resolve several issues with current decomposition methods—decomposes neural network parameters into a sum of sparsely used vectors...
This is a linkpost for Apollo Research's new interpretability paper: "Interpretability in Parameter Space: Minimizing Mechanistic Description Length with Attribution-based Parameter Decomposition". We introduce a new method for directly decomposing neural network parameters into mechanistic components. Motivation At Apollo, we've spent a lot of time thinking about how the computations...
This post reflects my personal opinion and not necessarily that of other members of Apollo Research or any of the people acknowledged below. Thanks to Jarrah Bloomfield, Lucius Bushnaq, Marius Hobbhahn, Axel Højmark, and Stefan Heimersheim for comments/discussions. I find that people in the AI/AI safety community have not considered...
Why we made this list: * The interpretability team at Apollo Research wrapped up a few projects recently[1]. In order to decide what we’d work on next, we generated a lot of different potential projects. Unfortunately, we are computationally bounded agents, so we can't work on every project idea that...
This is a linkpost for: www.apolloresearch.ai/blog/the-first-year-of-apollo-research About Apollo Research Apollo Research is an evaluation organization focusing on risks from deceptively aligned AI systems. We conduct technical research on AI model evaluations and interpretability and have a small AI governance team. As of 29 May 2024, we are one year old....