I recently went to NeurIPS, the world’s largest academic AI conference. This was a multi-purpose trip: present a poster at the MechInterp Workshop and hobnob with other AI Safety researchers; get a general impression of state of the art in AI, particularly the gamedev/creative space, and represent Timaeus, the company...
(previously, Q1/Q2 2025) I got complaints last time about calling negative, trivial or inconclusive results "failed" when they still show something and I probably learnt a lot personally, so let's go with "minor" instead. Mostly I've been busy with my MATS project - working with Redwood Research to build an...
A couple of people have mentioned to me: “we need more fiction examples of positive AI superintelligence - utopias like the Culture novels”. And they’re right, AI can be tremendously positive, and some beacons lit into the future could help make that come around. But one of my hobbies is...
Many of the stupid errors that LLMs make are attributable to behaviours they learn in during pre-training, then fail to forget or suppress later. This post is here as your reminder that base model behaviour often shows through, and should be one of your foremost theories when trying to explain...
I'm in the midst of doing the MATS program which has kept me super busy, but that didn't stop me working on resolving the most important question of our time: What Hogwarts House does your chatbot belong to? Basically, I submitted each chatbot to the quiz at https://harrypotterhousequiz.org and totted...
This year I've been on sabbatical, and have spent my time upskilling in AI Safety. Part of that is doing independent research projects in different fields. Some of those items have resulted in useful output, notably A Toy Model of the U-AND Problem, Do No Harm? and SAEs and their...