Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
I don't have a great single piece to point to. For a recent article I quote-tweeted, see https://www.beren.io/2025-08-02-Do-We-Want-Obedience-Or-Alignment/
For some of the earlier writing on the subject, see https://www.lesswrong.com/w/corrigibility-1 and https://ai-alignment.com/corrigibility-3039e668638
Also I liked this, which appears to be Eliezer's Ideal Plan for how to make a corrigible helper AI that can help us solve the rest of the problem and get an actually aligned AI: https://www.lesswrong.com/posts/5sRK4rXH2EeSQJCau/corrigibility-at-some-small-length-by-dath-ilan
https://alignment.openai.com/prod-evals/
I'm glad OpenAI did this research and published it! I hope this sort of thing becomes industry standard.
However, because of their rigid structure, confessions may not surface “unknown-unknowns” in the way a chain-of-thought can. A model might confess honestly to questions we ask, but not to questions we did not know to ask for.
I'm curious if there are examples already of insights you've gained this way via studying chains of thought.
OK. Yeah that's also my opinion too. Maybe I am one of the people leaning too heavily on their work. The problem is, there isn't much else to go on. "The worst benchmark for predicting AGI, except for all the others."
I like your raises!
Why do you think METR hasn't built more tasks then, if it's easy? I take it you have a negative opinion of them?
from https://arxiv.org/pdf/2001.08361
also see the grokking literature: https://en.wikipedia.org/wiki/Grokking_(machine_learning)
Previous discussion:
https://www.lesswrong.com/posts/FRv7ryoqtvSuqBxuT/understanding-deep-double-descent
The contradiction isn't just in this community, it's everywhere. People mostly haven't made up their minds yet whether they want AIs to be corrigible (obedient) or virtuous (ethical). Most people don't seem to have noticed the tension. In the AI safety community this problem is discussed, and different people have different views, including very strongly held views. I myself have changed my mind on this at least twice!
Insofar as people are disagreeing with you I think maybe it's because of your implication that this issue hasn't been discussed before in this community. It's been discussed for at least five years I think, maybe more like fifteen.
Thanks! Welcome to LessWrong!
Re: all markets opening at 50%: I don't know the details but my understanding is that this is a solved problem? Basically, the creator of a market can subsidize the market (perhaps by opening it at 50% and then spending a fixed pot of money buying/selling to try to get the price to stay at 50%? Again I don't know how it works but it seems to work fine on Manifold)
Re: each model trading on each market: There are different ways to set this up, I don't think the details matter. Probably it's best to have a market, and then allow a blooming diversity of approaches all to participate in the market. Some AI traders might be a script that spins up an agent for each market every week for one hour; others might be a single agent that runs continuously and browses over all the markets deciding which to engage in. The reward functions could be experimented with too; there are loads of variants on the basic idea of reinforcing success.
Re: Clean mechanics: Idk, maybe. Let a thousand flowers bloom, all of the above should be tried. I kinda suspect that arbitrage and inefficiency detection are valid ways to contribute though & it's OK if the AIs are motivated to do those things?
Sure, it's possible. But it's not like we've exhaustively mapped the space of possible scenarios that could realistically arise. Lots of crazy shit is going to happen during the singularity. E.g. "The President has ordered the datacenters be shut down but he's under the influence of AI Psychosis so maybe it's valid to resist even though this means whipping up a mob to block the arrival of the national guard?" E.g. "As part of our deal with the Vile Ones we need to allow them to continue their fusion-powered robotic FOOM which will render all life on Earth impossible in six months, but maybe that's fine because we can just upload all our citizens and beam them onto the space datacenters. They are claiming that's exactly what we should do and we should stop being such babies about it, but maybe this is actually killing everyone and therefore unacceptable and we need to go to war instead?"
By "Quite possible" do you mean "Probable?" If so why?
I assume AIs will be superhuman at that stuff yeah, it was priced in to my claims. Basically a bunch of philosophical dilemmas might be more values-shaped than fact-shaped. Simply training more capable AIs won't pin down the answers to the questions, for the same reason that it doesn't pin down the answers to ethical questions.