Near-mode thinking on AI
There is a stark difference between rehearsing classical AI risk 101 arguments about instrumental convergence, and tackling problems like "Design and implement the exact evaluations we'll run on GPT-5 to determine whether it's capable enough that we should worry about it acting aligned until it can execute a takeover". And naturally, since I've started working on problems like the one above, I've noticed a large shift in my thinking on AI. I describe it as thinking about risks in near-mode, as opposed to far-mode. In this post, I share a few concrete examples about my experiences with this change-of-orientation. I. Prerequisites for scheming Continuing with the example from the intro: A year ago I was confident about the "the AI is just playing along with our training and evaluations, until it is in a position where it can take over" threat model (deceptive alignment / scheming) basically being the default outcome and the main source of AI x-risk. I now think I was overconfident. Past-me hadn't really thought through the prerequisites for scheming. A textbook example of a successful deceptive alignment story, applied to an LLM, paints a picture of a model that: * Understands that the inputs it sees are simply prompts written by humans * As opposed to the model being "immersed to the setting", thinking that the current context more-or-less reflects reality, or flat-out is reality, taking the prompt at face value * Understands that it's undergoing a training process, and having an explicit (correct!) model of the training objective used to train it * As opposed the model not thinking about the training process and objective at all, and a model's changed behavior upon training being reflective of the training process molding the model, rather than the model learning the training objective and optimizing for it * Performs lots of the same cognition (e.g. the above two bullet points) regardless of the input fed to it * As opposed to the model's intern
I didn't hold into it after some life-changes. If I'm trying to think why, I get:
I think I'm now in a situation where there's much more value... (read more)