A common element in many AI control schemes is monitoring – using some model to review actions taken by an untrusted model in order to catch dangerous actions if they occur. Monitoring can serve two different goals. The first is detection: identifying misbehavior so you can understand it and prevent...
TLDR: to be dangerous, early schemers will have to do a bunch of adversarial reasoning under uncertainty. The current models seem extremely bad at this kind of reasoning relative to R&D automation capabilities like coding. I’m quite happy with that and I’d like to keep an eye on it as...
We’ve just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we’ve used in the past. In this post we’ll discuss the strengths and weaknesses of BashArena, and what we’ve learned about how to make settings for high stakes control research. BashArena Overview...
Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this. The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore...
In the spirit of better late than never - this has been sitting in drafts for a couple months now. Big thanks for Aryan Bhatt for helpful input throughout. Thanks to Abhay Sheshadri for running a bunch of experiments on other models for me. Thanks to Rowan Wang, Stewy Slocum,...
Context: for fun (and profit?) Basic Contact Contact is a lightweight many-versus-one word guessing game. I was first introduced to it on a long bus ride several years ago, and since then it’s become one of my favorite games to play casually with friends. There are a few blog posts...
it's now been 15 months since MIRI but I just remembered that three separate people have told me they liked this post despite my not cross-posting it, so I am now cross-posting it. Not written with the intention of being useful to any particular audience, just collecting my thoughts on...