Small typo: Alog(B) = log(B^A), not log(A^B)
Also there's a good chance AI gov won't work, and labs will just have a very limited safety budget to implement their best guess mitigations. Or maybe AI gov does work and we get a large budget, we still need to actually solve alignment.
Thanks for writing this!
For what its worth, my view is that we're very likely to be wrong about the specific details in both of the endings -- they are obviously super conjunctive. I don't think that there's any way around this because we can be confident AGI is going to cause some ex-ante surprising things to happen.
Also, this is scenario is around 20th percentile timelines for me, my median is early 2030s (though other authors disagree with me). I also feel much more confident about the pre-2027 scenario than about the post 2027 scenario.
Is your disagreement that you think AGI will happen later, or that you think the effects of AGI on the world will look very different, or both? If its just the timelines, we might have fairly similar views.
This wasn't intended to be humor. In the scenario, we write:
(To avoid singling out any one existing company, we’re going to describe a fictional artificial general intelligence company, which we’ll call OpenBrain. We imagine the others to be 3–9 months behind OpenBrain.)
I think that OpenAI, GDM, and Anthropic are in the lead and are the most likely to be ahead, with similar probability.
Thank you! We actually tried to write one that was much closer to a vision we endorse! The TLDR overview was something like:
Unfortunately, it was hard to write this out in a way that felt realistic.
The next major project I focus on is likely going to be focusing on thinking through the right governance interventions here to make that happen. I'm probably not going to do this in scenario format (and instead something closer to normal papers and blog posts), but would be curious for thoughts.
Thanks - I see, I was misunderstanding.
Proposal part 1: Shoggoth/Face Distinction: Instead of having one model undergo agency training, we have two copies of the base model work together, specializing in different parts of the job, and undergo training together. Specifically we have the "shoggoth" copy responsible for generating all the 'reasoning' or 'internal' CoT, and then we have the "face" copy responsible for the 'actions' or 'external' outputs. So e.g. in a conversation with the user, the Shoggoth would see the prompt and output a bunch of reasoning token CoT; the Face would see the prompt + all that CoT and then output a message to pass to the user, and then the process would repeat.
Proposal part 2: Blind the Evaluation Process to the Internal Reasoning: Remember that reward process that evaluates overall performance? Hide from it all the outputs of the Shoggoth. Just show it the outputs from the Face.
I don't see how this proposal could get you AGI-level capabilities. You need the "shoggoth" part of the model to gain general purpose capabilities somehow, and if you are completely shielding the shoggoth from RL, all of the capabilities have to come from pre-training, which seems unlikely. My guess is that a more realistic proposal involves having separate reward models for the two parts:
But it isn't clear to me how to separate out the alignment/capabilities. IIUC the hope is that the capabilities training seems would naturally incentivize the shoggoth to have bad goals, and perhaps instrumentally decieve the humans, but to do this deception in the COT in english so we can see.
You later write:
How it all (hopefully) fits together: First, because of the paraphraser, the Shoggoth will not learn to speak in some optimized alien language, or even to speak in jargon or euphemisms -- because semantics not recognized by the paraphraser will be scrambled by it. So the Shoggoth will continue to 'think in English' even as it undergoes lots of agency training and learns substantial new skills.
Second, because the training process is blind to the outputs of the Shoggoth (and instead only sees what the Face outputs), there should be no pressure on the Shoggoth to learn to decieve, conceal, dissemble, doublethink, etc. Instead the Shoggoth will be 'free' to put all its important thoughts in plain English.
Which suggests we are training the shoggoth with agency training. But I don't understand how this is consistnet with "the training process be blind to the outputs of the Shoggoth". Did you mean just some "alignment" subset of the training process?
I think a problem with all the proposed terms is that they are all binaries, and one bit of information is far too little to characterize takeoff:
So I don't really think that any of the binaries are all that useful for thinking or communicating about takeoff. I don't have a great ontology for thinking about takeoff myself to suggest instead, but I generally try to in communication just define a start and end point and then say quantitatively how long this might take. One of the central ones I really care about is the time between wakeup and takeover capable AIs.
wakeup = "the first period in time when AIs are sufficiently capable that senior government people wake up to incoming AGI and ASI"
takeover capable AIs = "the first time there is a set of AI systems that are coordinating together and could take over the world if they wanted to"
The reason to think about this period is that (kind of by construction) it's the time where unprecedented government actions that matter could happen. And so when planning for that sort of thing this length of time really matters.
Of course, the start and end times I think about are both fairly vague. They also aren't purely a function of AI capabilities, and they care about stuff like "who is in government" and "how capable our institutions are at fighting a rogue AGI". Also, many people believe that we never will get takeover capable AIs even at superintelligence.
This post seems systematically too slow to me, and to underrate the capabilities of superintelligence. One particular point of disagreement:
When I read this, I thought this would say "lower bound". Why would you expect evolution to find globally optimal doubling times? This reads to me a bit like saying that the speed of a Cheetah or the size of an Blue Whale will be an upper bound on the speed/size of a robot. Why???
The case for lower bound seems clear: biology did it, probably a superintelligence could design a more functional robot than biology.