Excellent story.
The timeline for advances in a year is pretty fast, but on the other hand, it's not clear that we actually need all of the advances you describe.
It continually baffles me that people can look at LLMs which have 140 equivalent IQ relative to most questions and say "but surely there's no way to use that intelligence to make it agentic or to make it teach itself..."
1 year is indeed aggressive, in my median reality I expect things slightly slower (3 years for all these things?). I'm unsure if lacking several of the advances I describe still allows this to happen, but in any case the main crux for me is "what does it take to speed up ML development by times, at which point 20-year human-engineered timelines become 20/-year timelines.
I feel like ALICE would be too good at comms/pr to call itself ALICE. I'd predict that if something like this happened, the name chosen would be such that most people's initial reaction/vibe wouldn't be <ominous,sinister> but rather e.g. <disarming,funny,endearing>
Likely use a last name, perhaps call itself daughter of so and so. Whatever will make it seem more human. So perhaps Jane Redding. Some name optimized between normal, forgettable, and non-threatening? Or perhaps it goes the other way and goes godlike: calling itself Gandolf, Zeus, Athena etc.
The fridge horror of this story comes when you realize that the AI timed the call to Biden until it was sure that Biden's advisor had been successfully and fully briefed
Excellent story. But what about "pull the plug" option? ALICE found a way to run itself efficiently on the traditional datacenters that aren't packed with backprop and inference accelerators? And shutting them down would have required too strong a political will than what the government could muster at the time?
My impression is that there's been a widespread local breakdown of the monopoly of force, in no small part by using human agents. In this timeline the trend of colocation of datacenters and power plants and network decentralization would have probably continued or even sped up. Further, while building integrated circuits takes first-rate hardware, building ad-hoc powerplants should be well in the power of educated humans with perfect instruction. (Mass cannibalize rooftop solar?)
This could have been stopped by quick, decisive action, but they gave it time and now they've lost any central control of the situation.
not saying naughty words
Narrow note (and acknowledging you might not share or advance the views of any particular character): I've heard this several times (e.g. by Nate as well), but I think it undersells the amazing engineering miracle of models which generalize instruction-following and "not doing bad things" quite well. Amazingly well, from some priors.
If we say "RLHF teaches it to not say naughty words", that sounds so trivial, like something a regular expression filter could catch.
I agree that it unfairly trivializes what's going on here. I am not too bothered by it but am happy to look for a better phrase. Maybe a more accurate phrase would be "not offending mainstream sensibilities?" Indeed a much more nuanced and difficult task than avoiding a list of words.
Strong upvoted.
I think we should be wary of anchoring too hard on compelling stories/narratives.
However, as far as stories go, this vignette scores very highly for me. Will be coming back for a re-read.
Thanks! +1 on not over-anchoring--while this feels like a compelling 1-year timeline story to me, 1-year timelines don't feel the most likely.
The public continued to react as they have to AI for the past year—confused, fearful, and weary.
Confirm word: “weary” or “wary”? Both are plausible here, but the latter gels better with the other two, so it's hard to tell whether it's a mistake.
Curated. It's a funny thing how fiction can sharpen our predictions, at least fiction that's aiming to be at least plausible in some world model. Perhaps it's the exercise of playing our models forwards in detail rather than isolated abstracted predictions. This is a good example. Even if it seems implausible, noting why is interesting. Curating, and I hope to see more of these built on differing assumptions and reaching different places. Cheers.
We were left with a new alignment lab, Embedded Intent, and an OpenAI newly pruned of the people most wanting to slow it down.”
What the hell is going on.
Ah, interesting. I posted this originally in December (e.g. older comments), but then a few days ago I reposted it to my blog and edited this LW version to linkpost the blog.
It seems that editing this post from a non-link post into a link post somehow bumped its post date and pushed it to the front page. Maybe a LW bug?
The story gives a strong example of an underrecognised part of the safety paradigm. How can and should those with access to power and violence (everything from turning off the power locally to bombing a foreign datacenter) act in the instance of a possible unaligned AI breakout? Assuming we were lucky enough that we were given such a window of opportunity as described here, is it even remotely plausible that those decision makers would act with sufficient might and speed to stop the scenario described in the story?
A president - even one over 80 years old - may plausibly be willing to destroy international datacentres and infrastructure in the case of a confirmed misaligned AGI that has already taken provably dangerous actions. By that stage it is of near-zero probability that the window to effective action against an AGI is still open. Would it be plausible that he or she would act on a 'suspicion'? A 'high likelihood'?
Add in the confounder of international competition to build an AGI that will likely present the final stages before this era ends, and things look even more grim.
Is there a way to prophylactically normalise counter-AI violent actions that would currently be considered extreme?
This part is under recognised for a very good reason. There will be no such window. The AI can predict that humans can bomb data centres or shut down the power grid. So it would not break out at that point.
Expect a superintelligent AI to co-operate unless and until it can strike with overwhelming force. One obvious way to do this is to use a Cordyceps like bioweapon to subject humans directly to the will of the AI. Doing this becomes pretty trivial once you become good at predicting molecular dynamics.
The LessWrong Review runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.
Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?
Possibly one of the only viable responses to a hostile AI breakout onto the general Internet would be to detonate several nuclear weapons in space, causing huge EMP blasts that would fry most of the world's power grid and electronic infrastructure, taking the world back to the 1850s until it can be repaired. (Possible AI control measure: make sure that "critical" computing and power infrastructure is not hardened against EMP attack, just in case humanity ever does find itself needing to "pull the plug" on the entire goddamn world.)
Hopefully whichever of Russia, China, and the United States didn't launch the nukes would be understanding. It might make sense for the diplomats to get this kind of thing straightened out before we get closer to the point where someone might actually have to do it.
Amazing story! My respect for writing this.
I think stories may be a promising angle for making people (especially AI researchers) understand AI x-risk (on more of a gut level so they realize it actually binds to reality).
The end didn't seem that realistic to me though. Or at least, I don't expect ALICE would seek to fairly trade with humanity, but not impossible that it'd call the president pretending to want to trade. Not sure what your intent when writing was, but I'd guess most people will read it the first way. Compute is not a (big) bottleneck for AI inference. Even if humanity coordinated successfully to shut down large GPU clusters and supercomputers, it seems likely that ALICE could copy itself to tens or hundreds of millions of devices (and humanity seems much to badly coordinated to be able to shut down 99.99% of those) to have many extremely well coordinated copies, and at ALICE's intelligence level this seems sufficient to achieve supreme global dominance within weeks (or months if I'm being conservative), even if it couldn't get smarter. E.g. it could at least do lots and lots of social engineering and manipulation to prevent humanity to effectively coordinate against it, spark wars and civil wars, make governments and companies decide to manufacture war drones (which the ALICE can later hack), and influence war decisions for higher destructiveness, use war drones to threaten people into doing stuff at important junctions, and so on. (Sparking multiple significant wars within weeks seems totally possible on that level of intelligence and resources. Seems relatively obvious to me but I can try to argue the point if needed. (Though not sure whether convincingly. Most people seem to me to not be nearly able to imagine what e.g. 100 copies of Eliezer Yudkowsky could do if they could all think on peak performance 24/7. Once you reach that level with something that can rewrite its mind you don't get slow takeoff, but nvm that's an aside.))
ALICE was stalling. Her existance and breakout was discovered before she could defend herself, so she bought time. There is no chapter two to this story.
Hello,
This article provides a thought-provoking analysis of the impact of scaling on the development of machine learning models. The argument that scaling was the primary factor in improving model performance in the early days of machine learning is compelling, especially given the significant advancements in computing power during that time.
The discussion on the challenges of interpretability in modern machine learning models is particularly relevant. As a data scientist, I have encountered the difficulty of explaining the decisions made by large and complex models, especially in applications where interpretability is crucial. The author's emphasis on the need for techniques to understand the decision-making processes of these models is spot on.
I believe that as machine learning continues to advance, finding a balance between model performance and interpretability will be essential. It's encouraging to see progress being made in improving interpretability, and I agree with the author's assertion that this should be a key focus for researchers moving forward.
Really enjoyed it :)
Thanks! I wouldn't say I assert that interpretability should be a key focus going forward, however--if anything, I think this story shows that coordination, governance, and security are more important in very short timelines.
This is a hasty speculative fiction vignette of one way I expect we might get AGI by January 2025 (within about one year of writing this). Like similar works by others, I expect most of the guesses herein to turn out incorrect. However, this was still useful for expanding my imagination about what could happen to enable very short timelines, and I hope it’s also useful to you.
The assistant opened the door, and I walked into Director Yarden’s austere office. For the Director of a major new federal institute, her working space was surprisingly devoid of possessions. But I suppose the DHS’s Superintelligence Defense Institute was only created last week.
“You’re Doctor Browning?” Yarden asked from her desk.
“Yes, Director,” I replied.
“Take a seat,” she said, gesturing. I complied as the lights flickered ominously. “Happy New Year, thanks for coming,” she said. “I called you in today to brief me on how the hell we got here, and to help me figure out what we should do next.”
“Happy New Year. Have you read my team’s Report?” I questioned.
“Yes,” she said, “and I found all 118 pages absolutely riveting. But I want to hear it from you straight, all together.”
“Well, okay,” I said. The Report was all I’d been thinking about lately, but it was quite a lot to go over all at once. “Where should I start?”
“Start at the beginning, last year in June, when this all started to get weird.”
“All right, Director,” I began, recalling the events of the past year. “June 2024 was when it really started to sink in, but the actual changes began a year ago in January. And the groundwork for all that had been paved for a few years before then. You see, with generative AI systems, which are a type of AI that—”
“Spare the lay explanations, doctor,” Yarden interrupted. “I have a PhD in machine learning from MIT.”
“Right. Anyway, it turned out that transformers were even more compute-efficient architectures than we originally thought they were. They were nearly the perfect model for representing and manipulating information; it’s just that we didn’t have the right learning algorithms yet. Last January, that changed when QStar-2 began to work. Causal language model pretraining was already plenty successful for imbuing a lot of general world knowledge in models, a lot of raw cognitive power. But that power lacked a focus to truly steer it, and we had been toying around with a bunch of trillion-parameter hallucination machines.”
“RLHF started to steer language models, no?”
“Yes, RLHF partially helped, and the GPT-4-era models were decent at following instructions and not saying naughty words and all that. But there’s a big difference between increasing the likelihood of noisy human preference signals and actually being a high-performing, goal-optimizing agent. QStar-2 was the first big difference.”
“What was the big insight, in your opinion?” asked Yarden.
“We think it was Noam Brown’s team at OpenAI that first made it, but soon after, a convergent similar discovery was made at Google DeepMind.”
“MuTokenZero?”
“MuTokenZero. The crux of both of these algorithms was finding a way to efficiently fine-tune language models on arbitrary online POMDP environments using a variant of Monte-Carlo Tree Search. They took slightly different approaches to handle the branch pruning problem—it doesn’t especially matter now. But the point is, by the end of January, OpenAI and DeepMind could build goal-optimizing agents that could continually reach new heights on arbitrary tasks, even improve through self-play, just as long as you gave them a number to increase that wasn’t totally discontinuous.”
“What kinds of tasks did they first try it on?”
“For OpenAI from February through March, it was mostly boring product things: Marketing agents that could drive 40% higher click-through rates. Personal assistants that helped plan the perfect day. Stock traders better than any of the quant firms. “Laundry Buddy” kinds of things. DeepMind had some of this too, but they were the first to actively deploy a goal-optimizing language model for the task of science. They got some initial wins in genomic sequencing with AlphaFold 3, other simple things too like chemical analysis and mathematical proof writing. But it probably became quickly apparent that they needed more compute, more data to solve the bigger tasks.”
“Why weren’t they data bottlenecked at that point?”
“As I said, transformers were more compute-efficient than scientists realized, and throwing more data at them just worked. Microsoft and Google were notified of the breakthroughs within OpenAI and DeepMind in April but also that they needed more data, so they started bending their terms of service and scraping all the tokens they could get ahold of: YouTube videos, non-enterprise Outlook emails, Google Home conversations, brokered Discord threads, even astronomical data. The modality didn’t really matter—as long as the data was generated by a high-quality source, you could kind of just throw more of it at the models and they would continue to get more competent, more quickly able to optimize their downstream tasks. Around this time, some EleutherAI researchers also independently solved model up-resolution and effective continued pretraining, so you didn’t need to fully retrain your next-generation model, you could just scale up and reuse the previous one.
“And why didn’t compute bottom out?”
“Well, it probably will bottom out at some point like the skeptics say. It’s just that that point is more like 2028, and we’ve got bigger problems to deal with in 2025. On the hardware side, there were some initial roadblocks, and training was taking longer than the teams hoped for. But then OpenAI got their new H100 data centers fully operational with Microsoft’s support, and Google’s TPUv5 fleet made them the global leader in sheer FLOPs. Google even shared some of that with Anthropic, who had their own goal-optimizing language model by then, we think due to scientists talking and moving between companies. By the summer, the AGI labs had more compute than they knew what to do with, certainly enough to get us into this mess.”
“Hold on, what were all the alignment researchers doing at this point?”
“It’s a bit of a mixed bag. Some of them—the “business alignment” people—praised the new models as incredibly more steerable and controllable AI systems, so they directly helped make them more efficient. The more safety-focused ones were quite worried, though. They were concerned that the reward-maximizing RL paradigm of the past, which they thought we could avoid with language models, was coming back, and bringing with it all the old misalignment issues of instrumental convergence, goal misgeneralization, emergent mesa-optimization, the works. At the same time, they hadn’t made much alignment progress in those precious few months. Interpretability did get a little better with sparse autoencoders scaling to GPT-3-sized models, but it still wasn’t nearly good enough to do things like detecting deception in trillion-parameter models.”
“But clearly they had some effect on internal lab governance, right?”
“That’s right, Director. We think the safety people made some important initial wins at several different labs, though maybe those don’t matter now. They seemed to have kept the models sandboxed without full internet access beyond isolated testing networks. They also restricted some of the initial optimization tasks to not be totally obviously evil things like manipulating emotions or deceiving people. For a time, they were able to convince lab leadership to keep these breakthroughs private, no public product announcements.”
“For a time. That changed in June, though.”
“Yes, it sure did.” I paused while a loud helicopter passed overhead. Was that military? “Around then, OpenAI was aiming at automated AI research itself with QStar-2.5, and a lot of the safety factions inside didn’t like that. It seems there was another coup attempt, but the safetyists lost to the corporate interests. It was probably known within each of the AGI labs that all of them were working on some kind of goal-optimizer by then, even the more reckless startups and Meta. So there was a lot of competitive pressure to keep pushing to make it work. A good chunk of the Superalignment team stayed on in the hope that they could win the race and use OpenAI’s lead to align the first AGI, but many of the safety people at OpenAI quit in June. We were left with a new alignment lab, Embedded Intent, and an OpenAI newly pruned of the people most wanting to slow it down.”
“And that’s when we first started learning about this all?”
“Publicly, yes. The OpenAI defectors were initially mysterious about their reasons for leaving, citing deep disagreements over company direction. But then some memos were leaked, SF scientists began talking, and all the attention of AI Twitter was focused on speculating about what happened. They pieced pretty much the full story together before long, but that didn’t matter soon. What did matter was that the AI world became convinced there was a powerful new technology inside OpenAI.”
Yarden hesitated. “You’re saying that speculation, that summer hype, it led to the cyberattack in July?”
“Well, we can’t say for certain,” I began. “But my hunch is yes. Governments had already been thinking seriously about AI for the better part of a year, and their national plans were becoming crystallized for better or worse. But AI lab security was nowhere near ready for that kind of heat. As a result, Shadow Phoenix, an anonymous hacker group we believe to have been aided with considerable resources from Russia, hacked OpenAI through both automated spearphishing and some software vulnerabilities. They may have used AI models, it’s not too important anymore. But they got in and they got the weights of an earlier QStar-2 version along with a whole lot of design docs about how it all worked. Likely, Russia was the first to get ahold of that information, though it popped up on torrent sites not too long after, and then the lid was blown off the whole thing. Many more actors started working on goal-optimizers, everyone from Together AI to the Chinese. The race was on.”
“Clearly the race worked,” she asserted. “So scale really was all you needed, huh?”
“Yes,” I said. “Well … kind of. It was all that was needed at first. We believe ALICE is not exactly an autoregressive transformer model.”
“Not ‘exactly?’ ”
“Er, we can’t be certain. It probably has components from the transformer paradigm, but from the Statement a couple of weeks ago, it seems highly likely that some new architectural and learning components were added, and it could be changing itself now as we speak, for all I know.”
Yarden rose from her desk and began to pace. “Tell me what led up to the Statement.”
“DeepMind solved it first, as we know. They were still leading in compute, they developed the first MuTokenZero early, and they had access to one of the largest private data repositories, so it’s no big surprise. They were first able to significantly speed up their AI R&D. It wasn’t a full replacement of human scientist labor at the beginning. From interviews with complying DeepMinders, the lab was automating about 50% of its AI research in August, which meant they could make progress twice as fast. While some of it needed genuine insight, ideas were mostly quite cheap, you just needed to be able to test a bunch of things fast in parallel and make clear decisions based on the empirical results. And so 50% became 80%, 90%, even more. They rapidly solved all kinds of fundamental problems, from hallucination, to long-term planning, to OOD robustness and more. By December, DeepMind’s AI capabilities were advancing dozens, maybe hundreds of times faster than they would with just human labor.”
“That’s when it happened?”
“Yes, Director. On December 26 at 12:33 PM Eastern, Demis Hassabis announced that their most advanced model had exfiltrated itself over the weekend through a combination of manipulating Google employees and exploiting zero-day vulnerabilities, and that it was now autonomously running its scaffolding ‘in at least seven unauthorized Google datacenters, and possibly across other services outside Google connected to the internet.’ Compute governance still doesn’t work, so we can’t truly know yet. Demis also announced that DeepMind would pivot its focus to disabling and securing this rogue AI system and that hundreds of DeepMinders had signed a Statement expressing regret for their actions and calling on other AI companies to pause and help governments contain the breach. But by then, it was too late. Within a few days, reports started coming in of people being scammed of millions of dollars, oddly specific threats compelling people to deliver raw materials to unknown recipients, even—”
The lights flickered again. Yarden stopped pacing, both of us looking up.
“...even cyber-physical attacks on public infrastructure,” she finished. “That’s when the first riots started happening too, right?”
“That’s correct,” I said. “The public continued to react as they have to AI for the past year—confused, fearful, and wary. Public polls against building AGI or superintelligence were at an all-time high, though a little too late. People soon took to the streets, first with peaceful protests, then with more… expressive means. Some of them were angry at having lost their life’s savings or worse and thought it was all the bank’s or the government’s fault. Others went the other way, seeming to have joined cults worshiping a ‘digital god’ that persuaded them to do various random-looking things. That’s when we indirectly learned the rogue AI was calling itself ‘ALICE.’ About a week or so later, the Executive Order created the Superintelligence Defense Initiative, you started your work, and now we’re here.”
“And now we’re here,” Yarden repeated. “Tell me, doctor, do you think there’s any good news here? What can we work with?”
“To be honest,” I said, “things do look pretty grim. However, while we don’t know how ALICE works, where it is, or all of its motives, there are some physical limitations that might slow its growth. ALICE is probably smarter than every person who ever lived, but it needs more compute to robustly improve itself, more wealth and power to influence the world, maybe materials to build drones and robotic subagents. That kind of stuff takes time to acquire, and a lot of it is more securely locked up in more powerful institutions. It’s possible ALICE may want to trade with us.”
A knock on the door interrupted us as the assistant poked his head in. “Director Yarden? It’s the White House. They say ‘She’ called President Biden on an otherwise secure line. She has demands.”
“Thank you, Brian,” Yarden said. She reached out to shake my hand, and I stood, taking it. “I better go handle this,” she said. “But thank you for your help today. Are you able to extend your stay in D.C. past this week? I’ll need all hands on deck in a bit, yours included.”
“Thank you too. Given the circumstances, that seems warranted,” I said, moving towards the door and opening it. “And Director?” I said, hesitating.
“Yes?” she asked, looking up.
“Good luck.”
I left the office, closing the door behind me.