Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
Yeah, probably, but what do you mean by that? Do you mean 'the maxima of its utility function / longterm goals is the same as what would have happened by default if it didn't accumulate power over us / didn't take over / etc.?' Or do you mean 'like us, it wants people to not die, be happy, be rich, ...?' If it's just a list of things like that, but the list is not exactly the same as our list, well, it can get more of what it wants by breaking the chain.
I assume AIs will be superhuman at that stuff yeah, it was priced in to my claims. Basically a bunch of philosophical dilemmas might be more values-shaped than fact-shaped. Simply training more capable AIs won't pin down the answers to the questions, for the same reason that it doesn't pin down the answers to ethical questions.
I don't have a great single piece to point to. For a recent article I quote-tweeted, see https://www.beren.io/2025-08-02-Do-We-Want-Obedience-Or-Alignment/
For some of the earlier writing on the subject, see https://www.lesswrong.com/w/corrigibility-1 and https://ai-alignment.com/corrigibility-3039e668638
Also I liked this, which appears to be Eliezer's Ideal Plan for how to make a corrigible helper AI that can help us solve the rest of the problem and get an actually aligned AI: https://www.lesswrong.com/posts/5sRK4rXH2EeSQJCau/corrigibility-at-some-small-length-by-dath-ilan
https://alignment.openai.com/prod-evals/
I'm glad OpenAI did this research and published it! I hope this sort of thing becomes industry standard.
However, because of their rigid structure, confessions may not surface “unknown-unknowns” in the way a chain-of-thought can. A model might confess honestly to questions we ask, but not to questions we did not know to ask for.
I'm curious if there are examples already of insights you've gained this way via studying chains of thought.
OK. Yeah that's also my opinion too. Maybe I am one of the people leaning too heavily on their work. The problem is, there isn't much else to go on. "The worst benchmark for predicting AGI, except for all the others."
I like your raises!
Why do you think METR hasn't built more tasks then, if it's easy? I take it you have a negative opinion of them?
from https://arxiv.org/pdf/2001.08361
also see the grokking literature: https://en.wikipedia.org/wiki/Grokking_(machine_learning)
Previous discussion:
https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent
The contradiction isn't just in this community, it's everywhere. People mostly haven't made up their minds yet whether they want AIs to be corrigible (obedient) or virtuous (ethical). Most people don't seem to have noticed the tension. In the AI safety community this problem is discussed, and different people have different views, including very strongly held views. I myself have changed my mind on this at least twice!
Insofar as people are disagreeing with you I think maybe it's because of your implication that this issue hasn't been discussed before in this community. It's been discussed for at least five years I think, maybe more like fifteen.
I basically agree with everything you say here and wish we had a better way to try to ground AGI timelines forecasts. Do you recommend any other method? E.g. extrapolating revenue? Just thinking through arguments about whether the current paradigm will work, and then using intuition to make the final call? We discuss some methods that appeal to us here.
Note that we allow it to go subexponential, so actually it can change the date arbitrarily far in the future if you really want it to. Also, dunno what's happening with Eli's parameters, but with my parameter settings putting the doubling difficulty growth factor to 1 (i.e. pure exponential trend, neither super or sub exponential) gets to AC in 2035. (Though I don't think we should put much weight on this number, as it depends on other parameters which are subjective & important too, such as the horizon length which corresponds to AC, which people disagree a lot about)