Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker's Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html
Some of my favorite memes:
(by Rob Wiblin)
(xkcd)
My EA Journey, depicted on the whiteboard at CLR:
(h/t Scott Alexander)
Notes to self about the structure of the problem, probably not interesting to others:
This is heavily drawn from MIRIs work and Joe Carlsmith's work
So, there are two kinds of value structure: (1) Long-term goals, and (2) immediate goals & deontological constraints. The line between them isn't sharp but that's OK.
If we imagine an agent that only has long-term goals, well, that thing is going to be a ruthless consequentialisty optimizer thingy and when it gets smart and powerful enough it'll totally take over the world if it can, unless the maxima of its long-term goals are very similar to what would have happened by default if it didn't take over the world, which they won't be. Fragility of Value thesis and Orthogonality thesis both hold, for this type of agent. We are not on track to get the long-term goals of our AIs sufficiently close to correct for it to be safe for us to build this type of agent. I claim.
However, we can instead try to build agents that also have immediate goals / deontological constraints, and we can try to shape those goals/constraints to make the resulting agent corrigible or otherwise steerable and safe. E.g. it's vision for a future utopia would actually be quite bad from our perspective because there's some important value it lacks (such as diversity, or consent, or whatever) and we haven't noticed this yet, but that's OK, because it's honest with us and obeys the rules and so it wouldn't take over and instead would politely explain this divergence in values to us when we ask.
So far so good. How are we doing at getting those constraints to stick?
Empirically not so great, but not completely hopeless either.
But I want to talk about the theoretical side. The main reason to be concerned, on priors, is the nearest unblocked neighbor problem. Basically, if a weak agent is trying to control a strong agent by imposing rules on what that strong agent can do -- even if the strong agent is bound to follow the rules -- then the situation is kinda dicey for the weak agent; the strong agent will be looking for loopholes basically, for workarounds... it'll still be in some sense trying to take over and make the world good by its lights (and terrible by the lights of the weaker agent) it's just that it can't do it in the obvious ways that the weaker agent has thought of in advance. This mostly works fine in practice if the two agents are of similar intelligence levels, similar understandings of the relevant domains, etc. (after all this is how many productive business relationships are, in capitalism -- they agree on a contract, then they do a mutually beneficial trade, even though they each would prefer more money for themselves and not for the other) But e.g. if it's clever salesmen and dumb rich kids, the clever salesmen systematically extract value from the dumb rich kids, without doing anything illegal or violating any contracts or lying to the kids.
How bad is this? Well, it's hard to say. It does make me feel that if we had to go from humans all the way to ASI in one jump, even if we could reliably make the ASI follow whatever constraints we input, we'd probably be screwed (assuming the ASI's long-term goals were somewhat different from ours, which they would be if it was trained using anything like current techniques). We'd make it be honest with us, we'd make it follow the law, we'd make it avoid manipulating us, etc. but somewhere in the process of specifying what we mean by all those things we'd mess up and forget something and there would be a loophole which would eventually be exploited. (Remember also that due to race dynamics we wouldn't exactly have teams of philosophers and lawyers pondering for years what constraints to put in -- instead it would be more like a few tech company employees pulling all-nighters to update the Spec that was used on earlier dumber models and apply it to the new ASI that's already being trained.)
Fortunately, things will probably be more gradual than that; we can try to control/steer AI systems not that different from todays, and then get their help controlling AI systems smarter than them, and so forth in a long chain all the way to ASI.
But will that chain break? Totally it could break. But I find it hard to guess at how high the probability is that it'll break. Maybe it's fine? Maybe the default outcome is that every link of the chain holds? Idk, I'm like 50/50 at the moment. Curious for arguments one way or another. Also curious for arguments that I'm thinking about this all wrong.
I think I agree with this? "Most algo progress is data progress" "Yep. Still counts though."
In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess. Simply grinding to AGI by getting experts to exhaustively cover every possible bit of human knowledge and skill and hand-coding (albeit with AI assistance) every single possible task into an RL-gym seems likely to both be inordinately expensive, take a very long time, and seems unlikely to suddenly bootstrap to superintelligence.
I think this is a reasonable take. Here's the opposite hypothesis:
"What are you talking about? These companies are giant juggernauts that are building a huge pipeline that does the following: (1) Identify economically valuable skills that AIs are missing (2) Collect/construct training environments / data to train those skills (3) Include those environments in the next big training runs, so that future AIs are no longer missing those skills. Already this seems like the sort of economic engine that could just keep churning until basically the whole world has been transformed. Is it AGI? No, it's still massively less efficient than the human brain. But it might nevertheless automate most jobs within a decade or so, and then continue churning along, automating new jobs as they come up. AND that's not taking into account three additional important factors: (4) The AIs are already generalizing to unseen skills/tasks to some extent, e.g. Claude is getting better at Pokemon despite not having been trained on Pokemon. Thus there might be a sort of 'escape velocity' effect where, after the models get big enough and have been trained on enough diverse important tasks, they become able to do additional new tasks with less and less additional training, and eventually can just few-shot-learn them like humans. If this happens then they really are AGI in the relevant sense, while still being less data-efficient than humans in some sense. (5) The AIs are already accelerating coding to some extent. The aforementioned pipeline that does steps 1-3 repeatedly to gradually automate the economy? That pipeline itself is in the process of getting automated as we speak. If you like you can think of the resulting giant automated corporation as itself an AGI that learns pretty fast, perhaps even faster than humans (albeit still less efficiently than humans in some sense). (Faster than humans? Well yeah; consider how fast AIs have improved at math over the last two years as companies turned their efforts towards training math skills; then consider what's happening to agentic coding; compare to individual human mathmeticians and programmers, who take several times as long to cross the same skill range during school.) (6) Even if those previous two claims are wrong and the current paradigm just won't count as AGI, period, if AI R&D gets accelerated significantly then the new paradigms that are necessary should be a few years away rather than decades away. And it seems that the current paradigm might suffice to accelerate R&D significantly, even if it can't automate it completely.
Which of these two competing hypotheses is less wrong? I don't know, but I still have substantial weight on the second.
I wish there was some quantitative analysis attempting to distinguish the two. Questions I'd love to see quantitative answers to: How much would it cost to give every major job in the economy the treatment math and coding are currently getting? How much will that cost go down, as AIs partially automate the pipeline? How much are AIs generalizing already? (this one is hard to answer because the companies are quiet about their training data) Is generalization radius increasing as models get smarter & are trained on more diverse stuff, or does it seem to be plateauing or entirely a function of e.g. pretraining loss?
...
Huh, I wonder if this helps explain some of the failures of the agents in the AI Village. Maybe a bunch of these custom RL environments are buggy, or at least more buggy than the actual environments they are replicating, and so maybe the agents have learned to have a high prior that if you try to click on something and it doesn't work, it's a bug rather than user error. (Probably not though. Just an idea.)
Similar issues but less extreme. The degree of concentration of power, misalignment risk, etc. from intelligence amplification would be smaller than from AGI.
I don't think it's the most important or original or interesting thing I've done, but I'm proud of the ideas in here nevertheless. Basically, other researchers have now actually done many of the relevant experiments to explore the part of the tech tree I was advocating for in this post. See e.g. https://www.alignmentforum.org/posts/HuoyYQ6mFhS5pfZ4G/paper-output-supervision-can-obfuscate-the-cot
I'm very happy that those researchers are doing that research, and moreover, very happy that the big AI companies have sorta come together to agree on the importance of CoT monitorability! https://arxiv.org/abs/2507.11473
OK but are these ideas promising? How do they fit into the bigger picture?
Conveniently, my answer to those questions is illustrated in the AI 2027 scenario:
It's not the only answer though. I think that improving the monitorability of CoT is just amaaaazing for building up the science of AI and the science of AI alignment, plus also raising awareness about how AIs think and work, etc.
Another path to impact is that if neuralese finally arrives and we have no more CoTs to look at, (a) some of the techniques for making good neuralese interpreters might benefit from the ideas developed for keeping CoT faithful, and (b) having previously studied all sorts of examples of misaligned CoTs, it'll be easier to argue to people that there might be misaligned cognition happening in the neuralese.
(Oh also, to be clear, I'm not taking credit for all these ideas. Other people like Fabian and Tamera for example got into CoT monitorability before I did, as did MIRI arguably. I think of myself as having just picked up the torch and ran with it for a bit, or more like, shouted at people from the sidelines to do so.)
Overall do I recommend this for the review? Well, idk, I don't think it's THAT important or great. I like it but I don't think it's super groundbreaking or anything. Also, it has low production value; it was a very low-effort dialogue that we banged out real quick after some good conversations.
Thanks! Got a sense of how many button presses happen on average per step? That would be helpful for making an apples-to-apples comparison.
Yeah, I agree; the conclusion is that median human performance in the steps graph should be higher. It was always suspicious that it was worse than the models anyway.
I agree we probably shouldn't update much on this, it's from AI and it's janky.
As for beating the game... well sure, but based on the above graphs it seems like Claude will beat the game within about a year?
Other models have beaten the game months ago, but with more advanced harnesses/scaffolds.
Did 3 Pro and 2.5 Pro have the same harness? I assume so... 2.5 was released at the end of March, 3 was released in late November. So 8 month gap. So pokemon 'speed' improved by 4x in 8 months. Meanwhile for Claude, 3.7 Sonnet was released in February, so 9 months to get to today's Opus 4.5. Seems harder to make the comparison for Claude because 3.7 Sonnet seems like it was basically stuck and would essentially never finish. If we focus on the part before it got stuck though, e.g. defeating Surge, it seems like Opus 4.5. is a bit more than 4x faster? But then Opus 4.5 got stuck later in Rocket hideout... but eventually made it through... yeah hard to compare. But it's interesting that they both are doing something like a 4x improvement in speed over a similar period of time.
Possibly relevant possibly hallucinated data: https://www.lesswrong.com/posts/cxuzALcmucCndYv4a/daniel-kokotajlo-s-shortform?commentId=sBtoCfWNnNxxGEgiL
I don't think what you are doing makes the situation worse. Perhaps you do think that of me though; this would be understandable...