I was recently talking with someone about the problem of free will, and I realised that for many years now I have always had the same response, without really ever soliciting broader critical feedback. The notion of free will here refers to a naive, libertarian, non-strictly-defined approach of "when I...
This research was completed for London AI Safety Research (LASR) Labs 2024. The team was supervised by @Stefan Heimershiem (Apollo Research). Find out more about the program and express interest in upcoming iterations here. This video is a short overview of the project presented on the final day of the...
Produced As Part Of The OxAI Safety Labs program, mentored by Joar Skalse. TL;DR This is a blog post introducing our new paper, "Goodhart's Law in Reinforcement Learning" (to appear at ICLR 2024). We study Goodhart's law in RL empirically, provide a geometric explanation for why it occurs, and use...
Here is one possible world we could be living in. Imagine that the majority of the world's population knows that unaligned AGI is coming. This majority includes most of the world's heads of governments, all of respectable scientists, most of journalists, all your friends, your close family, your distant aunt...
The paper Optimal Policies Tend to Seek Power tries to justify the claim in the title with a mathematical argument, using a clever formal definition of "power". I like the general approach, but I think that parts of the formalisation do not correspond well to the intuitive understanding of the...