Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
The fact that the model discusses the backdoor is exciting! We can monitor the CoT for signs that the model has a backdoor.
Yes! moreover, the fact that the model discusses the backdoor is evidence about the degree of introspective access the model has to it's internal states / propensities / etc. Moreover the way in which the model discusses the backdoor might give clues that help us prognosticate how it might generalize / what underlying circuitry might be involved & how it might have been influenced by the training process.
https://scottaaronson.blog/?p=8908
Scott Aaronson says some lovely things about this community! <3
Also, it's not like these ideas are new to him or he hasn't thought about them before. See the Musk v. Altman emails, ctrl-f "AGI dictatorship"
idk, haven't thought about it, you'd know better than me
Yeah, especially if this becomes standard part of big company toolboxes, it feels like it might noticeably (~1%?) reduce overall AI risks. Gives company more fine-grained cheap control over what skills a model has vs. lacks.
Idea: Use this to make more faithful CoT models:
--Take your best model and scrape all it's CoTs including with tool calls etc.
--Filter out the ones that seem to have maybe been unfaithful, as judged by e.g. activations for deception or whatnot.
--maybe also filter out any discussion of CoT monitoring and CoT lol
--Distill.
--Take the new model and see if it has better faithfulness properties, e.g. is harder to fine-tune to fool monitoring systems. Hopefully the answer is yes.
--Then maybe do another technique, where you train a smaller faster model to do more steps of reasoning to get to the same result. Like, take the CoT of your previous model and divide it into N chunks, where each chunk is like a sentence or so. Then train a new, smaller model to take chunk 1 and do lots of CoT and eventually reach chunk 2, and then to take chunk 1+CoT+chunk2 and do lots of CoT to eventually reach chunk 3, and so on. So that basically you have a model that tends to do more of its thinking in CoT, but has a similar capability and propensity profile.
More speculatively, UNDO’ing deception or sycophancy.
That would be pretty sweet
After years of tinkering and incremental progress, AIs can now play Diplomacy as well as human experts.[6]
CICERO was a custom-trained diplomacy model that couldn't win against human experts if they knew it was an AI. Now, in 2025, we have https://every.to/diplomacy which is just off-the-shelf LLM chatbots applied to Diplomacy. I'm curious to know how they would stack up against human experts who knew they were AIs. I expect they'd probably lose, but that if somehow they could do lots of RL on games against humans, they'd start winning, just as I originally forecast.
seconds to run interpretability analyses, or better yet, if I write my own complicated code for a better successor AGI from scratch. Is that OK?” The tech company employees respond in unison: “lol no way in hell”.
Not the AI company employees I know... they are all too eager to trust future AIs with autonomy and responsibilities.
They just sent me some graphs that may be of interest: