Introduction Activation additions represent an exciting new way of interacting with LLMs, and have potential applications to AI safety. However, the field is very new, and there are a few aspects of the technique that do not have a principled justification. One of these is the use of counterbalanced subtractions...
I think that this has been stated implicitly in various comments and lesswrong posts, but I haven't found a post which makes this point clearly, so thought I would quickly write one up! To get the most out of this, you should probably be pretty familiar with GPT-2 style transformer...
Ian Hogarth has just been announced as the Chair of the UK's AI Foundation Model Taskforce. He's the author of the FT article "We must slow down the race to God-like AGI", and seems to take X-risks from AI seriously. To quote his twitter thread: > And to that end...
Tldr A super simple form of gradient hacking might be to perform badly on any change to the value of the model's mesa-objective, in order to preserve the mesa-objective. I use some simple properties of SGD to explain why this method of gradient hacking cannot work. Thanks for Jacob Trachtman,...
One aspect of alignment research I find interesting is improving how we understand potential threat / failure modes by making them more concrete or rigorous in some way. When I think of making these concrete, I’m thinking of theoretical results in ML (think Turner’s power-seeking work) or toy examples of...
Summary I see this post as essentially trying to make clear the link between self-reference as discussed in Hofstadter’s “Gödel, Escher, Bach” and self-reference as considered in some of MIRI’s work on Embedded Agency. I start by considering instances of self-reference in mathematics through Gödel's Incompleteness Theorems, before moving on...