I'm doing research and other work focused on AI safety/security, governance and risk reduction. Currently my top projects are (last updated Feb 26, 2025):
General areas of interest for me are AI safety strategy, comparative AI alignment research, prioritizing technical alignment work, analyzing the published alignment plans of major AI labs, interpretability, deconfusion research and other AI safety-related topics.
Research that I’ve authored or co-authored:
Before getting into AI safety, I was a software engineer for 11 years at Google and various startups. You can find details about my previous work on my LinkedIn.
While I'm not always great at responding, I'm happy to connect with other researchers or people interested in AI alignment and effective altruism. Feel free to send me a private message!
Starting to be some discussion on LW now, e.g.
https://www.lesswrong.com/posts/tnc7YZdfGXbhoxkwj/give-me-a-reason-ing-model
I should have mentioned the above thoughts are a low-confidence take. I was mostly just trying to get the ball rolling on discussion because I couldn't find any discussion of this paper on LessWrong yet, which really surprised me because I saw the paper had been shared thousands of times on LinkedIn already.
Thoughts on "The Ilusion of Thinking" paper that came out of Apple recently?
https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf
Seems to me like at least a point in favor of "stochastic parrots" over "builds a quality world model" for the language reasoning models.
Also wondering if their findings could be used to the advantage of safety/security somehow. E.g. if these models are more dependent on imitating examples than we relaized, then it might also be more effective than we previously thought to purge training data of the types of knowledge and reasoning that we don't want them to have (e.g. knowledge of dangerous weapons development, scheming, etc.)
I agree it's a good post, and it does take guts to tell people when you think that a research direction that you've been championing hard actually isn't the Holy Grail. This is a bit of a nitpick but not insubstantial:
Neel is talking about interpretability in general, not just mech-interp. He claims to be accounting in his predictions for other non-mech interp approaches to interpretability that seem promising to some other researchers, such as representation engineering (RepE), which Dan Hendrycks among others has been advocating for recently.
Let me know if anyone has thoughts on this question I just posted as well: Does the Universal Geometry of Embeddings paper have big implications for interpretability?
Does representation engineering (RepE) seem like a game-changer for interpretability? I don't see it mentioned in your post, so I'm trying to figure out if it is baked into your predictions or not.
It seemed like Apollo was able to spin up a pretty reliable strategic deception detector (95-99% accurate) using linear probes even though the techniques are new, and generally it sounds like RepE is getting traction on some things that have been a slog for mech interp. Does it look plausible that RepE could get us to high reliability interpretability on workable timelines or are we likely to hit similar walls with that approach?
Thanks for your post Neel (and Gemini 2.5) - really important perspective on all this.
"AI governance looking bleak" seems like an overstatement. Certain types or aims of AI governance are looking bleak right now, especially getting strong safety-oriented international agreements that include the US and China, or meaningful AI regulation at the national level in the US. But there may be other sorts of AI governance projects (e.g. improving the policies of frontier labs, preparing for warning shots, etc.) that could still be quite worthwhile.
Is there a summary of this post?
2023: AI governance starting to look promising because governments are waking up about AI risks. Technical AI safety getting challenging if you're not in a frontier lab because hard to access relevant models to run experiments.
2025: AI governance looking bleak after the AI Action Summit. Technical AI safety looking more accessible because open-weight models are proliferating.
Update (Feb 10, 2023): I still endorse much of this comment, but I had overlooked that all or most of the prompts use "Human:" and "Assistant:" labels. Which means we shouldn't interpret these results as pervasive properties of the models or resulting from any ways they could be conditioned, but just of the way they simulate the "Assistant" character. nostalgebraist's comment explains this well. [Edited/clarified this update on June 10, 2023 because it accidentally sounded like I disavowed most of the comment when it's mainly one part]
--
After taking a closer look at this paper, pages 38-40 (Figures 21-24) show in detail what I think are the most important results. Most of these charts indicate what evhub highlighted in another comment, i.e. that "the model's tendency to do X generally increases with model scale and RLHF steps", where (in my opinion) X is usually a concerning behavior from an AI safety point of view:
A few thoughts on these graphs as I've been studying them: