There are a couple of frames I find useful when understanding why different people talk very differently about AI safety - the wall, and the bridge. A wall is incrementally useful. Every additional brick you add is good, and the more bricks you add the better. If you are adding...
Last year, I wrote a post about my upskilling in AI alignment. To this day, I still get people occasionally reaching out to me because of this article, to ask questions about getting into the field themselves. I’ve also had several occasions to link people to the article who asked...
Keywords: Mechanistic Interpretability, Adversarial Examples, GridWorlds, Activation Engineering This is part 2 of A Mechanistic Interpretability Analysis of a GridWorld Agent Simulator Links: Repository, Model/Training, Task. Epistemic status: I think the basic results are pretty solid, but I’m less sure about how these results relate to broader phenomena such as...
Thanks to the Open Source Mechanistic Interpretability Slack for their feedback. When going through Neel's excellent 200 Concrete Problems In Interpretability sequence, I found myself thinking "It sure would be great to have a single document I could go through to see all the problems, and ideally sort them by...
Five months ago, I received a grant from the Long Term Future Fund to upskill in AI alignment. As of a few days ago, I was invited to Berkeley for two months of full-time alignment research under Owain Evans’s stream in the SERIMATS program. This post is about how I...
Note: This is a long post. The post is structured in such a way that not everyone needs to read everything - Sections 1 and 2 are skippable background information, and Sections 4 and 5 go into technical detail that not everybody wants or needs to know. Section 3 on...