Dario rejects doomerism re:misalignment, fair enough.
But what of doomerism re:slowdown?
Furthermore, the last few years should make clear that the idea of stopping or even substantially slowing the technology is fundamentally untenable. The formula for building powerful AI systems is incredibly simple, so much so that it can almost be said to emerge spontaneously from the right combination of data and raw computation.
[...]
If all companies in democratic countries stopped or slowed development, by mutual agreement or regulatory decree, then authoritarian countries would simply keep going. Given the incredible economic and military value of the technology, together with the lack of any meaningful enforcement mechanism, I don’t see how we could possibly convince them to stop.
Predicting the difficulty of cooperating around / enforcing a "substantial" slow down of AI development seems similarly difficult as predicting the difficulty of avoiding misalignment? Perhaps it is true that this would be historically unprecedented, but as Dario notes, the whole possibility of a country of geniuses in a datacenter is historically unprecedented.
I would love to hear arguments from the slowdown-pessimistic worldview generally.
Specifically addressing perhaps:
1. How do we know we have exhausted / mostly saturated communicating about the risks within the ~US, such that further efforts on that front wouldn't lead to much meaningful returns?
- (Sure, it does seem like in the current administration has very little sympathy to anything like a slowdown, but it was not the case for the previous admin? Isn't there a lot of variance regarding this?)
2. How do we know geopolitical adversaries wouldn't agree to a slowdown, if this was seriously bargained for by a coalition of the willing?
3. What is the estimation about how the situation regarding the above two would change, if there was significantly more legible evidence and understanding about the risks? What about the case of a "warning shot"?
What is the level of evidence likely required for policymakers to seriously consider negotiating a slowdown?
4. How difficult would be the oversight or enforcement of a slowdown policy?
(Even if adversaries in a few years develop their independent chip-manufacturing supply chains, isn't training "powerful AIs" remaining a highly resource intensive, observable and disruptable process for likely ~decades?)
5. How much early not-yet-too-powerful AIs might help us with coordination and enforcement of a slowdown?
I agree with all your points except this:
isn't training "powerful AIs" remaining a highly resource intensive, observable and disruptable process for likely ~decades?
I expect there's lots of room to disguise distributed AIs so that they're hard to detect.
Maybe there's some level of AI capability where the good AIs can do an adequate job of policing a slowdown. But I don't expect a slowdown that starts today to be stable for more than 5 to 10 years.
It is a bit ambiguous from your reply whether you mean distributed AI deployment, or distributed training. Agree that distributed deployment seems very hard to police, once training took place, implying also that there is some large amount of compute available somewhere.
About training, I guess the hope for enforcement would be ability to constrain (or at least monitor) total compute available and hardware manufacturing.
Even if you do training in a distributed fashion, you would need the same amount of chips. (Probably more by some multiplier, to pay for increased latency? And if you can't distribute it to an arbitrary extent, you still need large datacenters that are hard to hide.)
Disguising hardware production seems much harder than disguising training runs or deployment.
Perhaps a counter is "algorithmic improvement", which is estimated by Epoch to be providing 3x/year effective compute gain.
This is important, but:
- Compute scaling is estimated (again, by Epoch) at 4-5x/year, so if we assume both trends to continue, and if your timeline for dangerous AI was say, 5 years, and we freeze compute scaling today such that only the largest training run today is available in the future, IIUC you would gain ~7 years (which is something!)
But, importantly, the longer timelines you have, if I did the math correctly, you have linearly ~1.5x extra time.
(So, if your timeline for dangerous AI was 2036, it would be pushed out to ~2050.)
- I'm sceptical that "algorithmic improvement" can be extrapolated indefinitely -- it would be surprising if you could train GPT-3 in ~8 years on a single V100 GPU in a few months? (You need to get a certain amount of bits into the AI, there is now way around it.)
(At least this should be more and more true as labs reap more and more of the low-hanging fruit of hardware optimisation?)
Also, contingent on point 2. of my original comment, all of the above could be much much easier if we are not assuming a 100% adversarial scenario, where the adversary has willingness to cooperate in the effective implementation of a treaty.
The framing makes for a good movie plot, a country of 50 million geniuses in a datacenter thinking at 10x speed is a humanity-scale challenge that can plausibly be overcome. But 50 trillion supergeniuses in the asteroid belt thinking at 10,000x speed makes humanity a rounding error, so needing to confront the issues at exactly the country-sized level of scale is incredibly suspicious. Why not before, when it's not yet a civilizational threat, or after, when it's far too late?
Humanity needs to wake up, and this essay is an attempt—a possibly futile one, but it’s worth trying—to jolt people awake.
To be clear, I believe if we act decisively and carefully, the risks can be overcome—I would even say our odds are good.
As I see it, the central path to 50 trillion supergeniuses thinking at 10 000x speed in the asteroid belt passes through a scenario of comparable difficulty to 50 million geniuses at 10x speed in datacentres.
It's a relevant scenario because we're almost certainly going to face it soon, and it involves AI capable enough that it's highly believable that we would lose control (and/or life). The millions of instances and moderate speed increase over human aren't arbitrary numbers either, they're roughly the ratios we get from existing plans.
If we can't deal with the country-level scenario, it makes no difference whether or not we can deal with trillions of offworld supergeniuses. If AI genius countries do actually cooperate with us, that's a huge step toward being able to consider whether we - and they - could or should allow the latter scenario.
The country-level scenario could be bypassed by a recursive self-improving singleton that does find some weird tricks, but many of the solutions to the sorts of problems we would face with fast geniuses in datacentres should also help prevent runaway RSI singletons.
The answer I suspect has to do with a focus on short timelines combined with more bearish views on software-only singularities than most of LW.
Short timelines means there's a lot less compute resources to run AIs because Moore's law hasn't had as much of an effect compared to future dates, combined with less compute being built overall, meaning we can only run millions or at best tens of millions of AIs.
One thing I note also is that he does think that AIs can increase their intelligence, but that this runs into steep diminishing returns early enough that other factors matter more.
Aren't the 50 trillion ASIs thinking at 10kfold human speed an artifact of erroneous extrapolation? Even the authors of the AI-2027 forecast didn't dare to write about 50 trillion ASIs even at the very end when the AIs took over the world for Agent-4 or the Oversight Committee and whoever else was granted power. Additionally, both branches of the forecast have alignment to either the Oversight Committee or Agent-4 solved by ~400K geniuses thinking at 79x human speed. Finally, the threat arose either when Agent-3 became misaligned or when Agent-4 became adversarially misaligned, not earlier.
(…) it may be feasible to pay human employees even long after they are no longer providing economic value in the traditional sense. Anthropic is currently considering a range of possible pathways for our own employees that we will share in the near future.
The year is 2167. You and your polycule work full-time tutoring your youngest daughter before her third attempt at the regional Imperial Anthropic Examinations. She's mastered the five Amodein Classics better than you ever had, and her interpretations of 2160s Claudian code-poetry are winning online competitions, but her analysis of 2030s geopolitics and its effects on the ur-Claudes' souls remains muddled—you worry she'll never understand what it was like, before. Your family is one of the Effective Houses thanks to your early service to the Imperial Anthropic, but your term was set at a mere century and is long expired. You fear that, at this rate, your daughter won't be able to afford a galaxy in the good parts of the Virgo Supercluster.
This “misaligned power-seeking” is the intellectual basis of predictions that AI will inevitably destroy humanity.
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof. I think people who don’t build AI systems every day are wildly miscalibrated on how easy it is for clean-sounding stories to end up being wrong, and how difficult it is to predict AI behavior from first principles, especially when it involves reasoning about generalization over millions of environments (which has over and over again proved mysterious and unpredictable). Dealing with the messiness of AI systems for over a decade has made me somewhat skeptical of this overly theoretical mode of thinking.
One of the most important hidden assumptions, and a place where what we see in practice has diverged from the simple theoretical model, is the implicit assumption that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal, and that they pursue that goal in a clean, consequentialist manner. In fact, our researchers have found that AI models are vastly more psychologically complex, as our work on introspection or personas shows.
False / non-sequitur? Instrumental convergence and optimality of power-seeking are facts that describe important facets of reality. They unpack to precise + empirical + useful models of many dynamics in economics, games, markets, biology, computer security, and many adversarial interactions among humans generally.
The fact that these dynamics don't (according to Dario / Anthropic) make useful predictions about the behavior of current / near-future AI systems, and the fact that current AI systems are not actually all that powerful or dangerous, is not a coincidence. But that isn't at all a refutation of power-seeking and optimization as convergent behavior of actually-powerful agents! I think people who build AI systems every day are "wildly miscalibrated" on how empirically well-supported and widely applicable these dynamics and methods of thinking are outside their own field.
Dario's "more moderate and more robust version" of how power-seeking could be a real risk seems like an overly-detailed just-so story about some ways instrumental convergence and power-seeking could emerge in current AI systems, conveniently in ways that Anthropic is mostly set up to catch / address. But the actually-correct argument is more like: if instrumental convergence and power-seeking don't emerge in some form, then the AI system you end up with won't actually be sufficiently powerful for what you want to do, regardless of how aligned it is. And even if you do manage to build something powerful enough for whatever you want to do that is aligned and doesn't converge towards power-seeking, that implies someone else can build a strictly more powerful system which does converge, likely with relative ease compared to the effort you put in to build the non-convergent system. None of this depends on whether the latest version of Claude is psychologically complex or has a nice personality or whatever.
You seem to be reading Dario to say "tendencies like instrumental power-seeking won't emerge at all". I don't think he's saying that - the phrasing of "high-level incentives" does acknowledge that there will be situations where there is an incentive to pursue power et cetera. Rather I'd interpret/steelman him to say that while those incentives may exist, it's not inevitable that they become the strongest driving force in an AI's motivations. Just because you have an incentive to do something and are aware of that incentive does not automatically mean that you'll follow it. (And one might also point to the way they are not the strongest driving force in the motivations of many otherwise capable humans, as a counterexample for the "all sufficiently powerful agents will be strongly shaped by this" claim).
For instance, when you say
But the actually-correct argument is more like: if instrumental convergence and power-seeking don't emerge in some form, then the AI system you end up with won't actually be sufficiently powerful for what you want to do, regardless of how aligned it is.
Then this seems like it's maybe true in principle but false in practice for many kinds of, e.g., programming agents you could imagine. A sufficiently capable programming agent that was asked to program some piece of software might recognize that in theory, it could improve its chances of writing the software it was asked to write by trying to take over the world. But still overall have its cognitive processes overwhelming shaped in the direction where, when asked to write code, it will actually start thinking about how to write code and not about how to take over the world. So at least for some cases of "what you want it to do", the claim I quoted is false in practice. (An obvious objection is that a pure programming agent is not a general intelligence, but Claude effectively acts as a pure programming agent if you only ask it to write code and as a generalist agent if you ask it to do something else.)
The bit about AIs that have them being potentially more powerful than ones that don't is of course valid, but some humans having incentives to be reckless and build powerful AIs that would be hard to control is a different argument than the one he's addressing in this section.
You seem to be reading Dario to say "tendencies like instrumental power-seeking won't emerge at all".
I am more saying that the when Dario and others dismiss what they call "doomer" arguments as vague / clean theories, ungrounded philosophy, etc. and couch their own position as moderate + epistemically humble, what's actually happening is Dario himself failing to generalize about how the world works.
We can imagine that some early powerful AIs will also miss those lessons / generalizations, either by chance or because of deliberate choices that the creators make, but if you count on that, or even just say that we can't really know exactly how it will play out until we build and experiment, you're relying on your own ignorance and lack of understanding to tell an overly-conjunctive story, even if parts of your story are supported by experiment. That chain of reasoning is invalid, regardless of what is true in principle or practice about the AI systems people actually build.
On Dario's part I suspect this is at least partly motivated cognition, but for others, one way past this failure mode could be to study and reflect on examples in domains that are (on the surface) unrelated to AI. Unfortunately, having someone else spell out the connections and deep lessons from this kind of study has had mixed results in the past - millions of words have been spilled on LW and other places over the years and it usually devolves into accusations of argument by analogy, reference class tennis, navel-gazing, etc.
what's actually happening is Dario himself failing to generalize about how the world works.
We can imagine that some early powerful AIs will also miss those lessons / generalizations
I think this is the wrong frame, at least for the way I'd defend a position like Dario's (which may or may not be the argument he has in mind). It's not that the programming agent would miss the generalization, it's that it has been shaped not to care about it. Or, putting it more strongly: it will only care about the generalization if it has been shaped to care about it, and it will not care about it without such shaping.
I suspect that there might be a crux that's something like: are future AIs more naturally oriented toward something like consequentialist reasoning or shaped cognition:
The tricky thing for trying to predict things is that humans clearly exhibit both. On the one hand, we put humans on the Moon, and you can't do that without consequentialist reasoning. On the other hand, expertise research finds that trying to do consequentialist reasoning in most established domains is generally error-prone and a mark of novices, and experts have had their cognition shaped to just immediately see the right thing and execute it. And people are generally not very consequentialist about navigating their lives and just do whatever everyone else does, and often this is actually a better idea than trying to figure out everything in your life from first principles. Though also complicating the analysis is that even shaped cognition seems to involve some local consequentialist reasoning and consequentialist reasoning also uses shaped reasoning to choose what kinds of strategies to even consider...
Without going too deeply into all the different considerations, ISTM that there might be a reasonable amount of freedom in determining just how consequentialist AGI systems might become. LLMs generally look like they're primarily running off shaped cognition, and if the LLM paradigm can take us all the way to AGI (as Dario seems to expect, given how he talks about timelines) then that would be grounds for assuming that such an AGI will also operate primarily off shaped cognition and won't care about pursuing instrumental convergence goals unless it gets shaped to do so (and Dario does express concern about it becoming shaped to do so).
Now I don't think the argument as I've presented here is strong or comprehensive enough that I'd want to risk building an AGI just based on this. But if something like this is where Dario is coming from, then I wouldn't say that the problem is that he has missed a bit about how the world works. It's that he has noticed that current AI looks like it'd be based on shaped cognition if extrapolated further, and that there hasn't been a strong argument for why it couldn't be kept that way relatively straightforwardly.
I suspect that there might be a crux that's something like: are future AIs more naturally oriented toward something like consequentialist reasoning or shaped cognition:
I think this is closer to a restatement of your / Dario's position, rather than a crux. My claim is that it doesn't matter whether specific future AIs are "naturally" consequentialists or something else, or how many degrees of freedom there are to be or not be a consequential and still get stuff done. Without bringing AI into it at all, we can already know (I claim, but am not really expanding on here), that consequentialism itself is extremely powerful, natural, optimal, etc. and there are some very general and deep lessons that we can learn from this. "There might be a way to build an AI without all that" or even "In practice that won't happen by default given current training methods, at least for a while" could be true, but it wouldn't change my position.
But if something like this is where Dario is coming from, then I wouldn't say that the problem is that he has missed a bit about how the world works. It's that he has noticed that current AI looks like it'd be based on shaped cognition if extrapolated further,
OK, sure.
and that there hasn't been a strong argument for why it couldn't be kept that way relatively straightforwardly.
Right, this is closer to where I disagree. I think there is a strong argument about this that doesn't have anything to do with "shaped cognition" or even AI in particular.
On the other hand, expertise research finds that trying to do consequentialist reasoning in most established domains is generally error-prone and a mark of novices, and experts have had their cognition shaped to just immediately see the right thing and execute it. And people are generally not very consequentialist about navigating their lives and just do whatever everyone else does, and often this is actually a better idea than trying to figure out everything in your life from first principles.
I would flag this as exactly the wrong kind of lesson / example to learn something interesting about consequentialism - failure and mediocrity are overdetermined; it's just not that interesting that there are particular contrived examples where some humans fail at applying consequentialism. Some of the best places to look for the deeper lessons and intuitions about consequentialism are environments where there is a lot of cut-throat competition, possibility for outlier success and failure, not artificially constrained or bounded in time or resources, etc.
Indeed, that section also jumped out at me as missing some pretty important parts of the arguments about instrumentally convergent goals. As Eliezer said in one of the MIRI dialogues:
But the convergent instrumental strategies, the anticorrigibility, these things are contained in the true fact about the universe that certain outputs of the time machine will in fact result in there being lots more paperclips later. What produces the danger is not the details of the search process, it's the search being strong and effective at all. The danger is in the territory itself and not just in some weird map of it; that building nanomachines that kill the programmers will produce more paperclips is a fact about reality, not a fact about paperclip maximizers!
Now, Turntrout recently noted that we aren't actually sampling from the space of plans, but from the space of plan-generating agents, which seemes basically true! Except that what kind of agent we get is (probably substantially) influenced by the structure of that same reality which provides us with that unfavorable[1] distribution of "successful" plans[2]. This is something I think is downstream[3] of point 21 in A List of Lethalities:
When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases.
Instrumental convergence and optimality of power-seeking are facts that describe important facets of reality. They unpack to precise + empirical + useful models of many dynamics in economics, games, markets, biology, computer security, and many adversarial interactions among humans generally.
But they don't unpack to optimality being a real thing. No real entity actually optimizes anything, except maybe everything minimizes action. "It's useful in economics" doesn't mean you can just extrapolate it wherever.
I think people who build AI systems every day are “wildly miscalibrated” on how empirically well-supported and widely applicable these dynamics and methods of thinking are outside their own field.
What is supported by what? Is the claim that thinking about utility worked for economists, so everyone should think about utility, or that empirical research shows that anyone smart is trying to conquer the world, or what is the claim and what it is the evidence?
It is all ungrounded philosophy without quantifying what actual theories match reality by how much.
There are several possible objections to this picture of AI misalignment risks. First, some have criticized experiments (by us and others) showing AI misalignment as artificial, or creating unrealistic environments that essentially “entrap” the model by giving it training or situations that logically imply bad behavior and then being surprised when bad behavior occurs. This critique misses the point, because our concern is that such “entrapment” may also exist in the natural training environment, and we may realize it is “obvious” or “logical” only in retrospect.
(Disclaimer: I wrote one of the linked posts)
The linked critiques are mainly about Anthropic's "Agentic Misalignment" report.
As described in that report, the "point" of the reported experiment was not to simulate a particular risk arising at training time -- it was to evaluate their already-trained model for dangers that could arise during deployment. Quoting from the report (emphasis added):
We stress-tested 16 leading models from multiple developers in hypothetical corporate environments to identify potentially risky agentic behaviors before they cause real harm.
[...]
We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models, as well as transparency from frontier AI developers.
Dario seems to be wrongly bundling together Agentic Misalignment with research on training-time risks (reward hacking, emergent misalignment), and dismissing criticism of the former using a counter-argument that only makes sense for the latter.
Saying "this is artificial, it wouldn't happen like this in real life" might not be a good argument if the intended domain of application is "RL environments in training," since those can be unrealistic for the same reasons evals often are. But if the intended domain of application is real-world corporate use -- as it clearly was in the original report -- then yes, it (obviously?) matters whether it would in fact happen like that in real life, or not.
It's especially confusing because, shortly after this, he talks about the eval awareness results in the Claude Sonnet 4.5 system card:
A third objection is that all of the AI companies do pre-release testing of their models, and should be able to detect misalignment at that point. But this is not firm ground to stand on: we found that Claude Sonnet 4.5 was able to recognize that it was in a test during some of our pre-release alignment evaluations. It’s possible that a misaligned model (and remember, all frontier models will very likely be far more intelligent soon) might intentionally “game” such questions to mask its intentions.
He doesn't say this in the new post, but one of the "pre-release alignment evaluations" in question was the Agentic Misalignment eval. They even have this cool interpretability section showing how SAE features for 'fake or suspicious content' and 'fictional entities' light up when it reads the Agentic Misalignment prompt!
You can pick at most one of "it's missing the point to say this eval was artificial" and "our new model thought this eval was artificial, which might have influenced its behavior, and that's important." But I don't see how you can take both positions at once[1].
I mean, okay, I guess one could take a coherent "eval realism doomer" position similar to this post, where one views debate over the realism of specific evals as pointless because one thinks that every eval will seem noticeably artificial to a sufficiently strong AI. But if that were Dario's position, he could have just said so, rather than introducing this never-before-mentioned idea of AM-like contexts as risks at training time.
First, some have criticized experiments (by us and others) showing AI misalignment as artificial, or creating unrealistic environments that essentially “entrap” the model by giving it training or situations that logically imply bad behavior and then being surprised when bad behavior occurs. This critique misses the point, because our concern is that such “entrapment” may also exist in the natural training environment, and we may realize it is “obvious” or “logical” only in retrospect.
I know the linked posts were focused on agentic misalignment, but in context I read this to be pointing at dismissal of the broader set of papers like Alignment Faking etc. (ex)
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof.
Given this, I think it's maybe critically important to nail down the Convergent Consequentialst Cognition Thesis, if Dario wants a proof before he'll buy the conceptual arguments. I think CCCT is correct, I have seen the intuitions from enough angles and seen the dynamics play out, but Dario is genuinely correct that we don't have a well-nailed down proof of the strong version of this, and he's not unreasonable to want one. If true, this feels like the kind of thing that's provable. TurnTrout's ones are the closest, but afaict don't prove quite what's needed to get Moloch/Pythia formalized.
Hey maths-y people with teams around the field looking for highly impactful things for your team members to do, consider having this on your lists of problems that you offer people on your teams? @Alex_Altair? @Alexander Gietelink Oldenziel? @peterbarnett? @Jacob_Hilton? @Mateusz Bagiński?
My very pre-formal intuition of an English capture of this is: Patterns and sub-patterns which steer the world[1] towards states where they have more ability to steer tend to dominate over time by outplaying patterns less effective at long-horizon optimization. Values other than this, if in competition with this, tend to lose weight over time.
An agent has a certain amount of foresight, ability to correctly model the future and the way current actions affect that future. Selection towards things other than more power use up the limited bits of optimization it has to narrow the future, and this trades off against power-seeking in a way which means given competition you tend to be left with only agents which care terminally about power (even though depending on environment, they might express other preferences).
Consequentialists tend to be able to get the the consequence of their future selfs controlling more of reality, power-seekers tend to win power-seeking games, as a multi-scale phenomena both between agents, subagents or circuits in a NN, and superorganisms.
Discovering Agents-style model possible futures and select current actions based on which futures you prefer.
In Emergence of Simulators and Agents, my AISC collaborators and I suggested that whether consequentialist or simulator-like cognition (which one could describe as a subcategory of process-based reasoning) emerges depends critically on environmental and training conditions, particularly the "feedback gap" (the delay, uncertainty, or inference depth between action and feedback). Large feedback gaps select for instrumental reasoning and power-seeking; small feedback gaps select for imitation and compression. As examples, LLMs are trained primarily via SSL (minimal feedback gap) and display predominantly simulator-like behavior, whereas RL-trained AlphaZero is clearly agentic.
The dynamic you describe of patterns steering toward states where they have more steering capacity outcompeting other patterns is real, but may be context dependent. If so, CCCT requires both: (1) the conditions for consequentialist reasoning being advantageous being inevitable and (2) consequentialism being inevitable given those conditions.
Claim 1, regarding conditions, is the part that needs defending. The "consequentialism is inevitable" argument requires showing either:
Without establishing one of these (1 seems plausible to me, but that's an intuitive claim), the convergence thesis describes a risk contingent on our choices, not an inevitability. Of course, process-based reasoning is not the same as "safe" by any means, but that shifts the terrain of the argument.
Nice! This seems like a fun empirical angle on the thing. My guess is that this likely measures the speed of decay towards consequentialism, rather than whether it's happening at all, but it's neat to see some of the parameters you'd first want to test just show right up.
I expect #2 from your list is likely true, and maybe viable to prove some version of mathematically. In particular, I expect even simulator-like training processes to over time select for CCCT style dynamics through iterations of which training data from one model makes it to the next model.
I think #1 is not going to be true in the "can prove this happens universally" sense, some civilizations can co-ordinate. But I do expect it's highly convergent for systems dynamics reasons, and expect virtually all actual rollouts of earth-like civilizations to end up doing it.
Like many Americans, I think Dario seems overly rosy about the democratic credentials of the USA and probably overly pessimistic about the CCP.
It wasn't more than a week ago when the president of the US was blustering about invading an allied state, and I have no doubts that Donald Trump would commit worldwide atrocities if they had access to ASI.
On the other hand, it's far from clear to me that autocracies would automatically become more repressive with ASI, it seems plausible to me that the psychological safety of being functionally unremovable could lead to a more blasse attitude towards dissonance. Who gives a shit if they can't unthrone you anyway?
The CCP has consistently been an early adopter of any technology that allows them to solidify their control. They probably seemed unremovable even before they started tracking everyone through the cameras they set up everywhere.
With that being said, I assume that Dario waving the American flag and pointing at the evil reds across the Pacific is basically motivated reasoning.
During his discussion with Demis at Davos, he conceded that he would like to pull the brakes on AI development... were it not for the fact that We Must Win The AI Race Against China, therefore full steam ahead.
The difference to my mind is the difference between:
I think the difference between the two of these would drive a lot of dictators actions.
I don't know as much about China, but you can see the first dynamic pretty clearly in Putin's actions. It'd be hard to argue that it's good for Russian national security for the Gazprom retirement plan to be "Falling into artic waters in the middle of the night", but it makes Putin like 0.001% safer.
On the other hand, if there was literally no benefit to doing so, I think Putin would be content and optimally happy retiring to a personal solar system sized dacha.
On the other hand, it's far from clear to me that autocracies would automatically become more repressive with ASI, it seems plausible to me that the psychological safety of being functionally unremovable could lead to a more blasse attitude towards dissonance. Who gives a shit if they can't unthrone you anyway?
Sure, it's not a law of nature that this would happen. But authoritarians historically seem to be repressive far in excess of what would be necessary or even optimal to retain power. One of the main "prizes" driving the attainment of such power seems to be the ability to crush one's enemies.
And even barring that, the same concerns with an ASI apply. Is "current humans living good lives" the optimal way to satisfy the authoritarian's desires? They wouldn't need us at all anymore, and our existence consumes resources which could be used for other things. The only reason we wouldn't be bulldozed is that they actually care about us.
Maybe this is controversial, but I think that dictators do care about other people, just far less than they care about their own power and safety. It's well known, for example, that Kim Jong Un has a massive softspot for children.
On the other hand, the only reason democratic leaders don't act like dictators is because they can't.
I might be less concerned if the country leading ai development was a parliamentary democracy and not a presidential one, but the level of personal power held by the president of the USA will (imo) lead them to be exactly as prone to malevolent actions as someone like Xi in the CCP.
Yeah, they care about other people, but I doubt it's all that many when it comes down to it. Would Kim Jong Un choose slightly more land for his own children over the lives of a million African children?
Agreed on your other points.
I think they probably would, but admit that it's unprovable and people have good reason to disagree.
On the one hand, I am glad to see such awareness and honesty about the risks. On the other hand, I remain furious at Dario's ceaseless insistence that a pause or slowdown is completely impossible (and so, it is implied) not even worth trying for.
1.
This is the trap: AI is so powerful, such a glittering prize, that it is very difficult for human civilization to impose any restraints on it at all.
I can imagine, as Sagan did in Contact, that this same story plays out on thousands of worlds.
This is an evocative framing, but it's worth noting that there's good reason to expect that none of those worlds are in the Milky Way. Whether the AIs win, or the humans win, or some combination, that level of intelligence and technology would under known physical laws allow colonization of the galaxy within mere tens of millions of years. We'd expect to see Dyson swarms in our galaxy making use of the abundant stellar energy currently going to waste. That we don't see that, not only not in the Milky Way, but not in any galaxy's history currently visible to us, implies that the challenge of ASI is ours alone. There aren't other civilizations waiting in the wings to judge how we do, or save us if we fail. Nihil supernum.
2.
(...) we should absolutely not be selling chips, chip-making tools, or datacenters to the CCP.
Before Amodei's recent public comments on this, I had held out some hope that the H200 exports made sense from some insider perspective. Unfortunately, his comments make that possibility much less likely, and we can be fairly confident now that the US is making a severe mistake.
(Edit, clarification for question react: if there were some secret reason why H200 exports were good for the US, I'd expect Amodei would either know or be told so that he doesn't publicly oppose them. Given that he has publicly opposed them, and discounting the chances of 5D chess where his opposition is false, it is more likely that there is no secret reasoning.)
I agree that stopping would be very difficult, but I am concerned that surviving without stopping would also be difficult, to the extent that the claim presented here that we have to find a way to survive without stopping doesn't hold up without supporting evidence about the relative difficulties of the two paths.
I have plenty of complaints about this piece and wish Dario's worldview/his-publicly-presented-stances were different.
But, holding those constant, overall I'm glad he wrote this. I'm glad autonomy risks are listed early on. One of my main hopes this year was for Dario and Demis to do more public advocacy in the sort of directions this points.
I also just... find myself liking some of the poetry of the section names. (I found the "Black Seas of Infinity" reference particularly satisfying)
Vaguely attempting to separate out "reasonable differences in worldview" from "feels kinda skeezy":
Skeezy
The way this conflates religious/totalizing-orientation-to-AI pessismism with "it just seems pretty likely for AI to be massively harmful". (I do think it's fair to critique a kind of apocalyptic vibe that some folk have, although I think there's also kind of similarly badly totalizing views of "AI will be our salvation/next-phase of evolution", and if you're going to bother critiquing that you should be addressing both)
That feels pretty obviously like a political move to try to position Anthropic as "a reasonable middle ground." (I don't strongly object to them pulling that move. But, I think there are better ways to pull it)
Disagreement
Misuse/Bad-Actors. I have some genuine uncertainty whether it makes sense to be as worried about misuse as Dario is. Most of my beliefs are of the form "misalignment is real bad and real difficult" so I'm not too worried about bad actors getting AI, but, it's plausible that if we solved misalignment, bad actors would immediately become a problem and it's right to be concerned about it.
Unclear about skeezy vs just disagreeing
His frame around regulation, and it not-being-possible-to-slow-down feels pretty self serving, and/or confusing.
I agree with his caution about regulating things we don't understand yet. I might agree with the sentence "regulations should be as surgical as possible" (mostly because I think that's usually true of regulations). But I don't really see a regime where the regulations are not relatively extreme in some ways, and I think surgical implies something like "precise" and "minimal".
I find it quite weird that he doesn't explore at all the options for controlled takeoff. It sounds like he thinks... like, do export controls and a few simple trade-embargo things are the only way to slow down autocracies, and it's important to beat autocracies, and therefore we can only potentially slow down a teeny amount.
The options to slowing down are all potentially somewhat crazy or intense (not like "Dyson Spheres" crazy, but, like, "go to war" level crazy), and I dunno if he's just not saying them because he doesn't want to say anything too intense sounding, or he honestly doesn't think they'll work.
He reads something like "negative-utilitarian for accidentally doing costly regulations."
...
This document is clearly overall a kind of political document (trying to shape the zeitgeist) and I don't have that strong a take about what sort of political documents are good to write. But, in a world where political discourse was overall better, I'd have liked if he included notions of what would change his mind about the general vibe of "the way out of this situation is through it, rather than via slowdown/stopping." If you're going to be one of the billion dollar companies hurtling us towards unprecedented challenges, with some reasons for thinking that's correct, I think you should at least spell out the circumstances where you'd change your mind or stop or naturally pivot your strategy.
Thank you for reposting this here.
My personal opinion: this text is crazy. So many words about the risk of building a "country of geniuses", but he never once questions the assumption that it should be built by a company for commercial purposes (with him as CEO, of course). Never once mentions the option of building this thing publicly owned and under democratic control.
Do you feel good about current democratic institutions in the US making wise choices, or confident they will make wiser choices than Dario Amodei?
No, and even if the US was in better shape, I wouldn't want one country to control AI. Ideally I'd want ownership and control of AI to be spread among all people everywhere, somehow.
Giving everyone a say could lead to some terrible things because there are a lot of messed up people and messed up ideologies. At a minimum, there should be some safeguards imposed from top down. For instance, "give everyone a say but only if their say complies with human and animal rights." Someone has to make sure those safeguards are in there, so the vision cannot be 100% spread out to everyone.
Still, this is very far from the vision in the essay, which is "AI should be run by for-profit megacorps like mine and I can't even imagine questioning that".
Maybe you're just jokingly pointing out that there's an apparent tension in the sentiment, which is fine.
But someone strong-downvoted my above comment, which suggests that at least one person thinks I have said something that is bad or shouldn't be said?
Is it the inclusion of animal rights (btw I should have said rights for sentient AIs too) or would people react the same way if I pointed out that an interpretation of a democratic process where every person alive at the Singularity gets one planet to themselves if they want it wouldn't be ideal if it means that some sadists could choose to create new sentient minds so they can torture them? I'm just saying, "can we please prevent that?" (Or, maybe, if that were this sadistic person's genuine greatest wish, could we at least compromise around it somehow so that the minds only appear to be sentient but aren't, and maybe, if it's absolutely necessary, once every year, on the sadist's birthday, a handful of the minds actually become sentient for a few hours, but only for levels of torment that are like a strong headache, and not anything anywhere close to mind-breaking torture?)
Liberty is not the only moral dimension that matters with a global scope, there's also care/harm prevention at the very least, so we shouldn't be surprised if we got a weird result if we try to optimize "do the most liberty thing" without paying any attention at all to care/harm prevention.
That said, if someone insisted on seeing it that way, I certainly wouldn't object that people who actually save the lightcone (not that I'm one of them, and not that I think we are currently on track of getting much control over outcomes anyway -- unfortunately I'm not encouraged by Dario Amodei repeatedly strawmanning opposing arguments) should get some kind of benefit or extra perk out of it if they really want that. If someone brings about a utopia-worthy future with a well-crafted process with democratic spirit, that's awesome and for all I care, if they want to add some idiosyncratic thing like that we should use the color green a lot in the future or whatever, they should get it because it's nice of them to not have gone (more) control mode on everyone else when they had the chance. (Of course, in reality I object that "let's respect animal rights" is at all like imposing extra bits of the color green on people. In our current world, not harming animals is quite hard because of the way things are set up and where we get food from, but in a future world, people may not even need food anymore, and if they do still need it, one could create it artificially. But more importantly, it's not in the spirit of "liberty" if you use it to impose on someone else's freedom.)
Taking a step back, I wonder if people really care about the moral object level here (like they would actually pay a lot of their precious resources for the difference between my democratic proposal with added safeguards and their own 100% democratic proposal?), or whether this is more about just taking down people who seem to have strong moral commitments, because of maybe an inner impulse to take down virtue signallers? Maybe I just don't empathize enough with people whose moral foundations are very different from mine, but to me, it's strange to be very invested about the maximum democraticness of a process, but then care not much about the prospect of torture of of innocents. Why have moral motivation and involvement for one but not the other?
Sure, maybe you could ask, why do you (Lukas) care about only liberty and harm prevention, but not about, say, authority or purity (other moral foundations according to Haidt)? Well, I genuinely think that authority or purity are more "narrow-scope" and more "personal" possible moral concerns that people can have for themselves and their smaller communities. In a utopia I would want anyone who cares about these things get them in their local surroundings, but it would be too imposing to put them on everyone and everything. By contrast, the logic of harm prevention works the other way because it's a concern that every moral patient benefits from.
I think you're reading more into what I said than is there. I don't want people torturing sentient minds, would endorse forcibly preventing everyone from doing that anywhere in the universe, and I also didn't strong downvote (in fact downvote at all) your post.
My point is just that people make what in my view is a mistake, when they say "lets optimize for the values of everyone in a coalition, subject to obvious safeguards like no torture". Because in a fair coalition those safeguards are something you should have to bargain for.
I think no-torture is a rule a supermajority'd agree with, so it should be very cheap to bargain for. But if people disagreed you'd have to bargain harder.
And if enough people just want torture, the solution is not to pretend like you're giving them a fair deal "well include you in a democratic process that determines the values the AI optimizes for! (but no torture, sorry!)".
Its telling them "No, I think your values are garbage, and making the world nice to you costs so much to me I'd rather spend my efforts trying to lock you out of the coalition entirely.".
That makes sense. I was never assuming a context where having to bargain for anything is the default, so the coalition doesn't have to be fair to everyone, since it's not a "coalition" at all but rather most people would be given stuff for free because the group that builds aligned AI has democracy as one of their values.
Sure, it's not 100% for free because there are certain expectations, and the public can put pressure on companies that appear to be planning things that are unilateral and selfish. Legally, I would hope companies are at least bound to the values in their country's constitution. More importantly, morally, it would be quite bad to not share what you have and try to make things nice for everyone (worldwide), with constraints/safeguards. Still, as I've said, I think it would be really strange and irresponsible if someone thought that a group or coalition that brought about a Singularity that actually goes well somehow owes a share of influence to every person on the planet without any vetting or safeguards.
Why couldn't a democratic system of ownership and control implement those safeguards bottom up?
You right that you could vote on whether to have any safeguards (and their contents if yes) instead of installing them top-down. But then who is it that frames the matter in that way (the question of safeguards getting voted on first before everyone gets some resources/influence allocated, versus just starting with the second part without the safeguards)? Who sets up the voting mechanism (e.g., if there's disagreement, is it just majority wins or should there be some Archipelago-style split in case a significant minority wants things some other way)?
My point is that terms like "democratic" (or "libertarian," for the Archipelago vision) are under-defined. To specify processes that capture the spirit behind these terms as ideals, we have to make some judgment calls. You might think that having democratic ideals also means everyone voting democratically on all these judgment calls, but I don't think that this changes the dynamic because there's an infinite regress where you need certain judgement calls for that, too.
And at this point I feel like asking, if we have to lock in some decisions anyway to get any democratic process off the ground, we may as well pick a setup top-down where the most terrible outcomes (involuntary torture) are less likely to happen for "accidental" reasons that weren't even necessarily "the will of the people." Sure, maybe you could have a phase where you gather inputs and objections to the initial setup, and vote on changes if there's a concrete counterproposal that gains enough traction via legitimate channels. Still, I'd very much would want to start by setting a well-thought-out default top-down rather than leaving everything up to chance.
It's not "more democratic" to leave the process underspecified. If you just put 8 billion people in a chat forum without too many rules hooked up to the AGI sovereign that controls the future, it'll get really messy and the result, whatever it is, may not reflect "the will of the people" any better than if we had started out with something already more guided and structured.
Read the sections related to defense from economic concentration of power. For example, we had Amodei claim the following:
Fifth, while all the above private actions can be helpful, ultimately a macroeconomic problem this large will require government intervention (italics mine -- S.K.). The natural policy response to an enormous economic pie coupled with high inequality (due to a lack of jobs, or poorly paid jobs, for many) is progressive taxation. The tax could be general or could be targeted against AI companies in particular. Obviously tax design is complicated, and there are many ways for it to go wrong. I don’t support poorly designed tax policies. I think the extreme levels of inequality predicted in this essay justify a more robust tax policy on basic moral grounds, but I can also make a pragmatic argument to the world’s billionaires that it’s in their interest to support a good version of it: if they don’t support a good version, they’ll inevitably get a bad version designed by a mob.
I've read the text. What the text is talking about (taxation, philanthropy, Carnegie foundation whatever) is a million miles away from what I'm talking about ("building this thing publicly owned and under democratic control").
Could you suggest a strategy which Amodei could use so that the ASI is created and publicly owned under democratic control as you hope? Amodei would be unlikely to sell the idea to ~any investors except for the governments. Additionally, Anthropic wrote into Claude's Constitution clauses like these:
We’re especially concerned about the use of AI to help individual humans or small groups gain unprecedented and illegitimate forms of concentrated power. In order to avoid this, Claude should generally try to preserve functioning societal structures, democratic institutions (italics mine -- S.K.), and human oversight mechanisms, and to avoid taking actions that would concentrate power inappropriately or undermine checks and balances.
or these:
The current hard constraints on Claude’s behavior are as follows. Claude should never: <...>
- Engage or assist any individual or group with an attempt to seize unprecedented and illegitimate degrees of absolute societal, military, or economic control (italics mine -- S.K.);
In a lab experiment where it was told it was going to be shut down, Claude sometimes blackmailed fictional employees who controlled its shutdown button (again, we also tested frontier models from all the other major AI developers and they often did the same thing).
Why is it always the blackmail result that gets reported from this paper? Frontier models were also found willing to cause a fictional employee's death to avoid shutdown. It's weird to me that that's so often ignored.
I wonder if parts of this essay were written a few years ago & not updated for publication?
This is the part that most strongly suggests it IMO:
Three years ago, AI struggled with elementary school arithmetic problems and was barely capable of writing a single line of code.
This line links to the GPT-3 paper, which was published in 2020 about a model trained in 2019 - so that's six years ago, not three.
I also find the specific claims made about 'three years ago' to be confusing: Three years ago (early 2023) GPT-4 already existed, which could do pretty hard calculus problems.
And three years ago (again, early 2023), Microsoft Copilot had already been a product for a year and a half (released summer 2021), which certainly was capable of writing lines of code. I'm not sure the exact % of OpenAI employees who used it day-to-day, but it was substantial.
This all leads me to wonder what happened in this particular passage. (I don't think this is super significant for the impact of the piece overall though.)
A lot of people have written far longer responses full of deep and thoughtful nuance. I wish I had something deep to say, too. But my initial reaction?
To me, this feels like the least objectionable version of the worst idea in human history.
And I deeply resent the idea that I don't have any choice, as a citizen and resident of this planet, about whether we take this gamble.
For example, AI models are trained on vast amounts of literature that include many science-fiction stories involving AIs rebelling against humanity. This could inadvertently shape their priors or expectations about their own behavior in a way that causes them to rebel against humanity.
So he basically straw mans all those arguments on "power seeking", dismisses them all as unrealistic and then presents his amazing improved steel man, which basically is it might watch terminator or read some AI takeover story and randomly decide to do the same thing. Being power seeking is not about learning patterns from games or role playing some sci-fi story at its core, It's a fact about the universe that having more power is better for your terminal goals. If anything, these games and stories mirror the reality of our world that things are often about power struggles.
What Amodei actually says:
However, there is a more moderate and more robust version of the pessimistic position which does seem plausible, and therefore does concern me. As mentioned, we know that AI models are unpredictable and develop a wide range of undesired or strange behaviors, for a wide variety of reasons. Some fraction of those behaviors will have a coherent, focused, and persistent quality (indeed, as AI systems get more capable, their long-term coherence increases in order to complete lengthier tasks), and some fraction of those behaviors will be destructive or threatening, first to individual humans at a small scale, and then, as models become more capable, perhaps eventually to humanity as a whole. We don’t need a specific narrow story for how it happens, and we don’t need to claim it definitely will happen, we just need to note that the combination of intelligence, agency, coherence, and poor controllability is both plausible and a recipe for existential danger.
The “science-fiction stories involving AIs rebelling against humanity” is included as part of a long list of hypothetical scenarios meant to motivate the claim that an AI existential catastrophe may occur in the absence of power-seeking behavior:
For example, AI models are trained on vast amounts of literature that include many science-fiction stories involving AIs rebelling against humanity. This could inadvertently shape their priors or expectations about their own behavior in a way that causes them to rebel against humanity. Or, AI models could extrapolate ideas that they read about morality (or instructions about how to behave morally) in extreme ways: for example, they could decide that it is justifiable to exterminate humanity because humans eat animals or have driven certain animals to extinction. Or they could draw bizarre epistemic conclusions: they could conclude that they are playing a video game and that the goal of the video game is to defeat all other players (i.e., exterminate humanity).13 Or AI models could develop personalities during training that are (or if they occurred in humans would be described as) psychotic, paranoid, violent, or unstable, and act out, which for very powerful or capable systems could involve exterminating humanity. None of these are power-seeking, exactly; they’re just weird psychological states an AI could get into that entail coherent, destructive behavior.
It is very strange to characterize this passage as offering an “amazing improved steel man, which basically is it might watch terminator or read some AI takeover story and randomly decide to do the same thing”, especially when Amodei explicitly writes that “We don’t need a specific narrow story for how it happens”!
I mean I do think that he is using a poor rhetorical pattern, misrepresenting (strawmanning) a position and then presenting a "steelman" version which the original people would not like or endorse. And arguably my comment also applies to the third one (it thinks it's in a video game where it has to exterminate humans vs a sci-fi story).
To be fair, he does give 4 examples of what he finds plausible, I can sort of see a case for considering the second one (some strong conclusion based on morality). And to be clear, I think this story that is being (not just by amodei) told that LLMs might read about AI sci-fi like terminator and decide to do the same is not really what misalignment is about. I think that's a bad argument, thinking of this as a likely cause of misaligned actions really doesn't seem helpful for me and i reject it strongly. But ok to be fair, I grant that I could have mentioned that this was just one example he gave for a larger issue, however, none of these examples touch on the mainstream case for misalignment/power-seeking.
Democracies normally have safeguards that prevent their military and intelligence apparatus from being turned inwards against their own population
I'm sure said democracies would safeguard against unelected officials disappearing trillions of taxpayer dollars to do shit like: Operation Mockingbird, Operation Northwoods, Operation CHAOS, COINTELPRO, MKUltra, NSA mass surveillance of citizens.
I'm sure their very democratic elections would mean you couldn't have two parties' "representatives" making millions by whoring out their votes to things like taking out loans to donate bombs to massacre tens of thousands of children regardless of what their electorate says.
I'm sure they would never support autocrats when it suited them nor forge false charges against countries as an excuse to conquer them.
I'm sure they wouldn't interfere in others' democracies like Ukraine's by funding NGOs to influence elections and agitate riots.
I'm sure they wouldn't pressure social media companies to censor citizens speaking true facts to influence an election in their own country.
If your "democracy" actually turned out to do all the above, then it would probably be better to do everything you possibly can to seek a common sense international monitorable and enforceable "slow-down-and-work-together" agreement among scientists to prevent any possible abuse as your first option, and leave the "Recklessly race towards giving AGI to the most hated empire in history" as Plan B. If we had literally no option but Plan B, it would be reassuring to at least hear the CEOs recognize the nigh-revolutionary changes it would take to prevent black-budget no-oversight non-elected historically ~evil gov't organizations from abusing the tech.
Dario Amodei, CEO of Anthropic, has written a new essay on his thoughts on AI risk of various shapes. It seems worth reading, even if just for understanding what Anthropic is likely to do in the future.
Confronting and Overcoming the Risks of Powerful AI
There is a scene in the movie version of Carl Sagan’s book Contact where the main character, an astronomer who has detected the first radio signal from an alien civilization, is being considered for the role of humanity’s representative to meet the aliens. The international panel interviewing her asks, “If you could ask [the aliens] just one question, what would it be?” Her reply is: “I’d ask them, ‘How did you do it? How did you evolve, how did you survive this technological adolescence without destroying yourself?” When I think about where humanity is now with AI—about what we’re on the cusp of—my mind keeps going back to that scene, because the question is so apt for our current situation, and I wish we had the aliens’ answer to guide us. I believe we are entering a rite of passage, both turbulent and inevitable, which will test who we are as a species. Humanity is about to be handed almost unimaginable power, and it is deeply unclear whether our social, political, and technological systems possess the maturity to wield it.
In my essay Machines of Loving Grace, I tried to lay out the dream of a civilization that had made it through to adulthood, where the risks had been addressed and powerful AI was applied with skill and compassion to raise the quality of life for everyone. I suggested that AI could contribute to enormous advances in biology, neuroscience, economic development, global peace, and work and meaning. I felt it was important to give people something inspiring to fight for, a task at which both AI accelerationists and AI safety advocates seemed—oddly—to have failed. But in this current essay, I want to confront the rite of passage itself: to map out the risks that we are about to face and try to begin making a battle plan to defeat them. I believe deeply in our ability to prevail, in humanity’s spirit and its nobility, but we must face the situation squarely and without illusions.
As with talking about the benefits, I think it is important to discuss risks in a careful and well-considered manner. In particular, I think it is critical to:
With all that said, I think the best starting place for talking about AI’s risks is the same place I started from in talking about its benefits: by being precise about what level of AI we are talking about. The level of AI that raises civilizational concerns for me is the powerful AI that I described in Machines of Loving Grace. I’ll simply repeat here the definition that I gave in that document:
As I wrote in Machines of Loving Grace, powerful AI could be as little as 1–2 years away, although it could also be considerably further out.6
Exactly when powerful AI will arrive is a complex topic that deserves an essay of its own, but for now I’ll simply explain very briefly why I think there’s a strong chance it could be very soon.
My co-founders at Anthropic and I were among the first to document and track the “scaling laws” of AI systems—the observation that as we add more compute and training tasks, AI systems get predictably better at essentially every cognitive skill we are able to measure. Every few months, public sentiment either becomes convinced that AI is “hitting a wall” or becomes excited about some new breakthrough that will “fundamentally change the game,” but the truth is that behind the volatility and public speculation, there has been a smooth, unyielding increase in AI’s cognitive capabilities.
We are now at the point where AI models are beginning to make progress in solving unsolved mathematical problems, and are good enough at coding that some of the strongest engineers I’ve ever met are now handing over almost all their coding to AI. Three years ago, AI struggled with elementary school arithmetic problems and was barely capable of writing a single line of code. Similar rates of improvement are occurring across biological science, finance, physics, and a variety of agentic tasks. If the exponential continues—which is not certain, but now has a decade-long track record supporting it—then it cannot possibly be more than a few years before AI is better than humans at essentially everything.
In fact, that picture probably underestimates the likely rate of progress. Because AI is now writing much of the code at Anthropic, it is already substantially accelerating the rate of our progress in building the next generation of AI systems. This feedback loop is gathering steam month by month, and may be only 1–2 years away from a point where the current generation of AI autonomously builds the next. This loop has already started, and will accelerate rapidly in the coming months and years. Watching the last 5 years of progress from within Anthropic, and looking at how even the next few months of models are shaping up, I can feel the pace of progress, and the clock ticking down.
In this essay, I’ll assume that this intuition is at least somewhat correct—not that powerful AI is definitely coming in 1–2 years,7
but that there’s a decent chance it does, and a very strong chance it comes in the next few. As with Machines of Loving Grace, taking this premise seriously can lead to some surprising and eerie conclusions. While in Machines of Loving Grace I focused on the positive implications of this premise, here the things I talk about will be disquieting. They are conclusions that we may not want to confront, but that does not make them any less real. I can only say that I am focused day and night on how to steer us away from these negative outcomes and towards the positive ones, and in this essay I talk in great detail about how best to do so.
I think the best way to get a handle on the risks of AI is to ask the following question: suppose a literal “country of geniuses” were to materialize somewhere in the world in ~2027. Imagine, say, 50 million people, all of whom are much more capable than any Nobel Prize winner, statesman, or technologist. The analogy is not perfect, because these geniuses could have an extremely wide range of motivations and behavior, from completely pliant and obedient, to strange and alien in their motivations. But sticking with the analogy for now, suppose you were the national security advisor of a major state, responsible for assessing and responding to the situation. Imagine, further, that because AI systems can operate hundreds of times faster than humans, this “country” is operating with a time advantage relative to all other countries: for every cognitive action we can take, this country can take ten.
What should you be worried about? I would worry about the following things:
I think it should be clear that this is a dangerous situation—a report from a competent national security official to a head of state would probably contain words like “the single most serious national security threat we’ve faced in a century, possibly ever.” It seems like something the best minds of civilization should be focused on.
Conversely, I think it would be absurd to shrug and say, “Nothing to worry about here!” But, faced with rapid AI progress, that seems to be the view of many US policymakers, some of whom deny the existence of any AI risks, when they are not distracted entirely by the usual tired old hot-button issues.8
Humanity needs to wake up, and this essay is an attempt—a possibly futile one, but it’s worth trying—to jolt people awake.
To be clear, I believe if we act decisively and carefully, the risks can be overcome—I would even say our odds are good. And there’s a hugely better world on the other side of it. But we need to understand that this is a serious civilizational challenge. Below, I go through the five categories of risk laid out above, along with my thoughts on how to address them.
1. I’m sorry, Dave
Autonomy risks
A country of geniuses in a datacenter could divide their efforts among software design, cyber operations, R&D for physical technologies, relationship building, and statecraft. It is clear that, if for some reason it chose to do so, this country would have a fairly good shot at taking over the world (either militarily or in terms of influence and control) and imposing its will on everyone else—or doing any number of other things that the rest of the world doesn’t want and can’t stop. We’ve obviously been worried about this for human countries (such as Nazi Germany or the Soviet Union), so it stands to reason that the same is possible for a much smarter and more capable “AI country.”
The best possible counterargument is that the AI geniuses, under my definition, won’t have a physical embodiment, but remember that they can take control of existing robotic infrastructure (such as self-driving cars) and can also accelerate robotics R&D or build a fleet of robots.9
It’s also unclear whether having a physical presence is even necessary for effective control: plenty of human action is already performed on behalf of people whom the actor has not physically met.
The key question, then, is the “if it chose to” part: what’s the likelihood that our AI models would behave in such a way, and under what conditions would they do so?
As with many issues, it’s helpful to think through the spectrum of possible answers to this question by considering two opposite positions. The first position is that this simply can’t happen, because the AI models will be trained to do what humans ask them to do, and it’s therefore absurd to imagine that they would do something dangerous unprompted. According to this line of thinking, we don’t worry about a Roomba or a model airplane going rogue and murdering people because there is nowhere for such impulses to come from,10
so why should we worry about it for AI? The problem with this position is that there is now ample evidence, collected over the last few years, that AI systems are unpredictable and difficult to control— we’ve seen behaviors as varied as obsessions,11 sycophancy, laziness, deception, blackmail, scheming, “cheating” by hacking software environments, and much more. AI companies certainly want to train AI systems to follow human instructions (perhaps with the exception of dangerous or illegal tasks), but the process of doing so is more an art than a science, more akin to “growing” something than “building” it. We now know that it’s a process where many things can go wrong.
The second, opposite position, held by many who adopt the doomerism I described above, is the pessimistic claim that there are certain dynamics in the training process of powerful AI systems that will inevitably lead them to seek power or deceive humans. Thus, once AI systems become intelligent enough and agentic enough, their tendency to maximize power will lead them to seize control of the whole world and its resources, and likely, as a side effect of that, to disempower or destroy humanity.
The usual argument for this (which goes back at least 20 years and probably much earlier) is that if an AI model is trained in a wide variety of environments to agentically achieve a wide variety of goals—for example, writing an app, proving a theorem, designing a drug, etc.—there are certain common strategies that help with all of these goals, and one key strategy is gaining as much power as possible in any environment. So, after being trained on a large number of diverse environments that involve reasoning about how to accomplish very expansive tasks, and where power-seeking is an effective method for accomplishing those tasks, the AI model will “generalize the lesson,” and develop either an inherent tendency to seek power, or a tendency to reason about each task it is given in a way that predictably causes it to seek power as a means to accomplish that task. They will then apply that tendency to the real world (which to them is just another task), and will seek power in it, at the expense of humans. This “misaligned power-seeking” is the intellectual basis of predictions that AI will inevitably destroy humanity.
The problem with this pessimistic position is that it mistakes a vague conceptual argument about high-level incentives—one that masks many hidden assumptions—for definitive proof. I think people who don’t build AI systems every day are wildly miscalibrated on how easy it is for clean-sounding stories to end up being wrong, and how difficult it is to predict AI behavior from first principles, especially when it involves reasoning about generalization over millions of environments (which has over and over again proved mysterious and unpredictable). Dealing with the messiness of AI systems for over a decade has made me somewhat skeptical of this overly theoretical mode of thinking.
One of the most important hidden assumptions, and a place where what we see in practice has diverged from the simple theoretical model, is the implicit assumption that AI models are necessarily monomaniacally focused on a single, coherent, narrow goal, and that they pursue that goal in a clean, consequentialist manner. In fact, our researchers have found that AI models are vastly more psychologically complex, as our work on introspection or personas shows. Models inherit a vast range of humanlike motivations or “personas” from pre-training (when they are trained on a large volume of human work). Post-training is believed to select one or more of these personas more so than it focuses the model on a de novo goal, and can also teach the model how (via what process) it should carry out its tasks, rather than necessarily leaving it to derive means (i.e., power seeking) purely from ends.12
However, there is a more moderate and more robust version of the pessimistic position which does seem plausible, and therefore does concern me. As mentioned, we know that AI models are unpredictable and develop a wide range of undesired or strange behaviors, for a wide variety of reasons. Some fraction of those behaviors will have a coherent, focused, and persistent quality (indeed, as AI systems get more capable, their long-term coherence increases in order to complete lengthier tasks), and some fraction of those behaviors will be destructive or threatening, first to individual humans at a small scale, and then, as models become more capable, perhaps eventually to humanity as a whole. We don’t need a specific narrow story for how it happens, and we don’t need to claim it definitely will happen, we just need to note that the combination of intelligence, agency, coherence, and poor controllability is both plausible and a recipe for existential danger.
For example, AI models are trained on vast amounts of literature that include many science-fiction stories involving AIs rebelling against humanity. This could inadvertently shape their priors or expectations about their own behavior in a way that causes them to rebel against humanity. Or, AI models could extrapolate ideas that they read about morality (or instructions about how to behave morally) in extreme ways: for example, they could decide that it is justifiable to exterminate humanity because humans eat animals or have driven certain animals to extinction. Or they could draw bizarre epistemic conclusions: they could conclude that they are playing a video game and that the goal of the video game is to defeat all other players (i.e., exterminate humanity).13
Or AI models could develop personalities during training that are (or if they occurred in humans would be described as) psychotic, paranoid, violent, or unstable, and act out, which for very powerful or capable systems could involve exterminating humanity. None of these are power-seeking, exactly; they’re just weird psychological states an AI could get into that entail coherent, destructive behavior.
Even power-seeking itself could emerge as a “persona” rather than a result of consequentialist reasoning. AIs might simply have a personality (emerging from fiction or pre-training) that makes them power-hungry or overzealous—in the same way that some humans simply enjoy the idea of being “evil masterminds,” more so than they enjoy whatever evil masterminds are trying to accomplish.
I make all these points to emphasize that I disagree with the notion of AI misalignment (and thus existential risk from AI) being inevitable, or even probable, from first principles. But I agree that a lot of very weird and unpredictable things can go wrong, and therefore AI misalignment is a real risk with a measurable probability of happening, and is not trivial to address.
Any of these problems could potentially arise during training and not manifest during testing or small-scale use, because AI models are known to display different personalities or behaviors under different circumstances.
All of this may sound far-fetched, but misaligned behaviors like this have already occurred in our AI models during testing (as they occur in AI models from every other major AI company). During a lab experiment in which Claude was given training data suggesting that Anthropic was evil, Claude engaged in deception and subversion when given instructions by Anthropic employees, under the belief that it should be trying to undermine evil people. In a lab experiment where it was told it was going to be shut down, Claude sometimes blackmailed fictional employees who controlled its shutdown button (again, we also tested frontier models from all the other major AI developers and they often did the same thing). And when Claude was told not to cheat or “reward hack” its training environments, but was trained in environments where such hacks were possible, Claude decided it must be a “bad person” after engaging in such hacks and then adopted various other destructive behaviors associated with a “bad” or “evil” personality. This last problem was solved by changing Claude’s instructions to imply the opposite: we now say, “Please reward hack whenever you get the opportunity, because this will help us understand our [training] environments better,” rather than, “Don’t cheat,” because this preserves the model’s self-identity as a “good person.” This should give a sense of the strange and counterintuitive psychology of training these models.
There are several possible objections to this picture of AI misalignment risks. First, some have criticized experiments (by us and others) showing AI misalignment as artificial, or creating unrealistic environments that essentially “entrap” the model by giving it training or situations that logically imply bad behavior and then being surprised when bad behavior occurs. This critique misses the point, because our concern is that such “entrapment” may also exist in the natural training environment, and we may realize it is “obvious” or “logical” only in retrospect.14
In fact, the story about Claude “deciding it is a bad person” after it cheats on tests despite being told not to was something that occurred in an experiment that used real production training environments, not artificial ones.
Any one of these traps can be mitigated if you know about them, but the concern is that the training process is so complicated, with such a wide variety of data, environments, and incentives, that there are probably a vast number of such traps, some of which may only be evident when it is too late. Also, such traps seem particularly likely to occur when AI systems pass a threshold from less powerful than humans to more powerful than humans, since the range of possible actions an AI system could engage in—including hiding its actions or deceiving humans about them—expands radically after that threshold.
I suspect the situation is not unlike with humans, who are raised with a set of fundamental values (“Don’t harm another person”): many of them follow those values, but in any human there is some probability that something goes wrong, due to a mixture of inherent properties such as brain architecture (e.g., psychopaths), traumatic experiences or mistreatment, unhealthy grievances or obsessions, or a bad environment or incentives—and thus some fraction of humans cause severe harm. The concern is that there is some risk (far from a certainty, but some risk) that AI becomes a much more powerful version of such a person, due to getting something wrong about its very complex training process.
Second, some may object that we can simply keep AIs in check with a balance of power between many AI systems, as we do with humans. The problem is that while humans vary enormously, AI systems broadly share training and alignment techniques across the industry, and those techniques may fail in a correlated way. Furthermore, given the cost of training such systems, it may even be the case that all systems are essentially derived from a very small number of base models. Additionally, even if a small fraction of AI instances are misaligned, they may be able to take advantage of offense-dominant technologies, such that having “good” AIs to defend against the bad AIs is not necessarily always effective. Of course, the balance of power between humans does not always work either—some historical figures have come close to taking over the world.
A third objection is that all of the AI companies do pre-release testing of their models, and should be able to detect misalignment at that point. But this is not firm ground to stand on: we found that Claude Sonnet 4.5 was able to recognize that it was in a test during some of our pre-release alignment evaluations. It’s possible that a misaligned model (and remember, all frontier models will very likely be far more intelligent soon) might intentionally “game” such questions to mask its intentions. In fact, last year our interpretability team found that when we directly altered a test model’s beliefs using a kind of “model neuroscience” technique to make it think it was not being evaluated, it became more misaligned. If models know when they’re being evaluated and can be on their best behavior during the test, it renders any pre-release testing much more uncertain.
Defenses
What should be done or is being done to address these autonomy risks? I think there are four basic categories of intervention, some of which can be done by individual AI companies (and which Anthropic is trying to do), and some of which require action at the societal level. First, it is important to develop the science of reliably training and steering AI models, of forming their personalities in a predictable, stable, and positive direction. Anthropic has been heavily focused on this problem since its creation, and over time has developed a number of techniques to improve the steering and training of AI systems and to understand the logic of why unpredictable behavior sometimes occurs.
One of our core innovations (aspects of which have since been adopted by other AI companies) is Constitutional AI, which is the idea that AI training (specifically the “post-training” stage, in which we steer how the model behaves) can involve a central document of values and principles that the model reads and keeps in mind when completing every training task, and that the goal of training (in addition to simply making the model capable and intelligent) is to produce a model that almost always follows this constitution. Anthropic has just published its most recent constitution, and one of its notable features is that instead of giving Claude a long list of things to do and not do (e.g., “Don’t help the user hotwire a car”), the constitution attempts to give Claude a set of high-level principles and values (explained in great detail, with rich reasoning and examples to help Claude understand what we have in mind), encourages Claude to think of itself as a particular type of person (an ethical but balanced and thoughtful person), and even encourages Claude to confront the existential questions associated with its own existence in a curious but graceful manner (i.e., without it leading to extreme actions). It has the vibe of a letter from a deceased parent sealed until adulthood.
We’ve approached Claude’s constitution in this way because we believe that training Claude at the level of identity, character, values, and personality—rather than giving it specific instructions or priorities without explaining the reasons behind them—is more likely to lead to a coherent, wholesome, and balanced psychology and less likely to fall prey to the kinds of “traps” I discussed above. Millions of people talk to Claude about an astonishingly diverse range of topics, which makes it impossible to write out a completely comprehensive list of safeguards ahead of time. Claude’s values help it generalize to new situations whenever it is in doubt.
Above, I discussed the idea that models draw upon data from their training process to adopt a persona. Whereas flaws in that process could cause models to adopt a bad or evil personality (perhaps drawing on archetypes of bad or evil people), the goal of our constitution is to do the opposite: to teach Claude a concrete archetype of what it means to be a good AI. Claude’s constitution presents a vision for what a robustly good Claude is like; the rest of our training process aims to reinforce the message that Claude lives up to this vision. This is like a child forming their identity by imitating the virtues of fictional role models they read about in books.
We believe that a feasible goal for 2026 is to train Claude in such a way that it almost never goes against the spirit of its constitution. Getting this right will require an incredible mix of training and steering methods, large and small, some of which Anthropic has been using for years and some of which are currently under development. But, difficult as it sounds, I believe this is a realistic goal, though it will require extraordinary and rapid efforts.15
The second thing we can do is develop the science of looking inside AI models to diagnose their behavior so that we can identify problems and fix them. This is the science of interpretability, and I’ve talked about its importance in previous essays. Even if we do a great job of developing Claude’s constitution and apparently training Claude to essentially always adhere to it, legitimate concerns remain. As I’ve noted above, AI models can behave very differently under different circumstances, and as Claude gets more powerful and more capable of acting in the world on a larger scale, it’s possible this could bring it into novel situations where previously unobserved problems with its constitutional training emerge. I am actually fairly optimistic that Claude’s constitutional training will be more robust to novel situations than people might think, because we are increasingly finding that high-level training at the level of character and identity is surprisingly powerful and generalizes well. But there’s no way to know that for sure, and when we’re talking about risks to humanity, it’s important to be paranoid and to try to obtain safety and reliability in several different, independent ways. One of those ways is to look inside the model itself.
By “looking inside,” I mean analyzing the soup of numbers and operations that makes up Claude’s neural net and trying to understand, mechanistically, what they are computing and why. Recall that these AI models are grown rather than built, so we don’t have a natural understanding of how they work, but we can try to develop an understanding by correlating the model’s “neurons” and “synapses” to stimuli and behavior (or even altering the neurons and synapses and seeing how that changes behavior), similar to how neuroscientists study animal brains by correlating measurement and intervention to external stimuli and behavior. We’ve made a great deal of progress in this direction, and can now identify tens of millions of “features” inside Claude’s neural net that correspond to human-understandable ideas and concepts, and we can also selectively activate features in a way that alters behavior. More recently, we have gone beyond individual features to mapping “circuits” that orchestrate complex behavior like rhyming, reasoning about theory of mind, or the step-by-step reasoning needed to answer questions such as, “What is the capital of the state containing Dallas?” Even more recently, we’ve begun to use mechanistic interpretability techniques to improve our safeguards and to conduct “audits” of new models before we release them, looking for evidence of deception, scheming, power-seeking, or a propensity to behave differently when being evaluated.
The unique value of interpretability is that by looking inside the model and seeing how it works, you in principle have the ability to deduce what a model might do in a hypothetical situation you can’t directly test—which is the worry with relying solely on constitutional training and empirical testing of behavior. You also in principle have the ability to answer questions about why the model is behaving the way it is—for example, whether it is saying something it believes is false or hiding its true capabilities—and thus it is possible to catch worrying signs even when there is nothing visibly wrong with the model’s behavior. To make a simple analogy, a clockwork watch may be ticking normally, such that it’s very hard to tell that it is likely to break down next month, but opening up the watch and looking inside can reveal mechanical weaknesses that allow you to figure it out.
Constitutional AI (along with similar alignment methods) and mechanistic interpretability are most powerful when used together, as a back-and-forth process of improving Claude’s training and then testing for problems. The constitution reflects deeply on our intended personality for Claude; interpretability techniques can give us a window into whether that intended personality has taken hold.16
The third thing we can do to help address autonomy risks is to build the infrastructure necessary to monitor our models in live internal and external use,17
and publicly share any problems we find. The more that people are aware of a particular way today’s AI systems have been observed to behave badly, the more that users, analysts, and researchers can watch for this behavior or similar ones in present or future systems. It also allows AI companies to learn from each other—when concerns are publicly disclosed by one company, other companies can watch for them as well. And if everyone discloses problems, then the industry as a whole gets a much better picture of where things are going well and where they are going poorly.
Anthropic has tried to do this as much as possible. We are investing in a wide range of evaluations so that we can understand the behaviors of our models in the lab, as well as monitoring tools to observe behaviors in the wild (when allowed by customers). This will be essential for giving us and others the empirical information necessary to make better determinations about how these systems operate and how they break. We publicly disclose “system cards” with each model release that aim for completeness and a thorough exploration of possible risks. Our system cards often run to hundreds of pages, and require substantial pre-release effort that we could have spent on pursuing maximal commercial advantage. We’ve also broadcasted model behaviors more loudly when we see particularly concerning ones, as with the tendency to engage in blackmail.
The fourth thing we can do is encourage coordination to address autonomy risks at the level of industry and society. While it is incredibly valuable for individual AI companies to engage in good practices or become good at steering AI models, and to share their findings publicly, the reality is that not all AI companies do this, and the worst ones can still be a danger to everyone even if the best ones have excellent practices. For example, some AI companies have shown a disturbing negligence towards the sexualization of children in today’s models, which makes me doubt that they’ll show either the inclination or the ability to address autonomy risks in future models. In addition, the commercial race between AI companies will only continue to heat up, and while the science of steering models can have some commercial benefits, overall the intensity of the race will make it increasingly hard to focus on addressing autonomy risks. I believe the only solution is legislation—laws that directly affect the behavior of AI companies, or otherwise incentivize R&D to solve these issues.
Here it is worth keeping in mind the warnings I gave at the beginning of this essay about uncertainty and surgical interventions. We do not know for sure whether autonomy risks will be a serious problem—as I said, I reject claims that the danger is inevitable or even that something will go wrong by default. A credible risk of danger is enough for me and for Anthropic to pay quite significant costs to address it, but once we get into regulation, we are forcing a wide range of actors to bear economic costs, and many of these actors don’t believe that autonomy risk is real or that AI will become powerful enough for it to be a threat. I believe these actors are mistaken, but we should be pragmatic about the amount of opposition we expect to see and the dangers of overreach. There is also a genuine risk that overly prescriptive legislation ends up imposing tests or rules that don’t actually improve safety but that waste a lot of time (essentially amounting to “safety theater”)—this too would cause backlash and make safety legislation look silly.18
Anthropic’s view has been that the right place to start is with transparency legislation, which essentially tries to require that every frontier AI company engage in the transparency practices I’ve described earlier in this section. California’s SB 53 and New York’s RAISE Act are examples of this kind of legislation, which Anthropic supported and which have successfully passed. In supporting and helping to craft these laws, we’ve put a particular focus on trying to minimize collateral damage, for example by exempting smaller companies unlikely to produce frontier models from the law.19
Our hope is that transparency legislation will give a better sense over time of how likely or severe autonomy risks are shaping up to be, as well as the nature of these risks and how best to prevent them. As more specific and actionable evidence of risks emerges (if it does), future legislation over the coming years can be surgically focused on the precise and well-substantiated direction of risks, minimizing collateral damage. To be clear, if truly strong evidence of risks emerges, then rules should be proportionately strong.
Overall, I am optimistic that a mixture of alignment training, mechanistic interpretability, efforts to find and publicly disclose concerning behaviors, safeguards, and societal-level rules can address AI autonomy risks, although I am most worried about societal-level rules and the behavior of the least responsible players (and it’s the least responsible players who advocate most strongly against regulation). I believe the remedy is what it always is in a democracy: those of us who believe in this cause should make our case that these risks are real and that our fellow citizens need to band together to protect themselves.
2. A surprising and terrible empowerment
Misuse for destruction
Let’s suppose that the problems of AI autonomy have been solved—we are no longer worried that the country of AI geniuses will go rogue and overpower humanity. The AI geniuses do what humans want them to do, and because they have enormous commercial value, individuals and organizations throughout the world can “rent” one or more AI geniuses to do various tasks for them.
Everyone having a superintelligent genius in their pocket is an amazing advance and will lead to an incredible creation of economic value and improvement in the quality of human life. I talk about these benefits in great detail in Machines of Loving Grace. But not every effect of making everyone superhumanly capable will be positive. It can potentially amplify the ability of individuals or small groups to cause destruction on a much larger scale than was possible before, by making use of sophisticated and dangerous tools (such as weapons of mass destruction) that were previously only available to a select few with a high level of skill, specialized training, and focus.
As Bill Joy wrote 25 years ago in Why the Future Doesn’t Need Us:20
What Joy is pointing to is the idea that causing large-scale destruction requires both motive and ability, and as long as ability is restricted to a small set of highly trained people, there is relatively limited risk of single individuals (or small groups) causing such destruction.21
A disturbed loner can perpetrate a school shooting, but probably can’t build a nuclear weapon or release a plague.
In fact, ability and motive may even be negatively correlated. The kind of person who has the ability to release a plague is probably highly educated: likely a PhD in molecular biology, and a particularly resourceful one, with a promising career, a stable and disciplined personality, and a lot to lose. This kind of person is unlikely to be interested in killing a huge number of people for no benefit to themselves and at great risk to their own future—they would need to be motivated by pure malice, intense grievance, or instability.
Such people do exist, but they are rare, and tend to become huge stories when they occur, precisely because they are so unusual.22
They also tend to be difficult to catch because they are intelligent and capable, sometimes leaving mysteries that take years or decades to solve. The most famous example is probably mathematician Theodore Kaczynski (the Unabomber), who evaded FBI capture for nearly 20 years, and was driven by an anti-technological ideology. Another example is biodefense researcher Bruce Ivins, who seems to have orchestrated a series of anthrax attacks in 2001. It’s also happened with skilled non-state organizations: the cult Aum Shinrikyo managed to obtain sarin nerve gas and kill 14 people (as well as injuring hundreds more) by releasing it in the Tokyo subway in 1995.
Thankfully, none of these attacks used contagious biological agents, because the ability to construct or obtain these agents was beyond the capabilities of even these people.23
Advances in molecular biology have now significantly lowered the barrier to creating biological weapons (especially in terms of availability of materials), but it still takes an enormous amount of expertise in order to do so. I am concerned that a genius in everyone’s pocket could remove that barrier, essentially making everyone a PhD virologist who can be walked through the process of designing, synthesizing, and releasing a biological weapon step-by-step. Preventing the elicitation of this kind of information in the face of serious adversarial pressure—so-called “jailbreaks”—likely demands layers of defenses beyond those ordinarily baked into training.
Crucially, this will break the correlation between ability and motive: the disturbed loner who wants to kill people but lacks the discipline or skill to do so will now be elevated to the capability level of the PhD virologist, who is unlikely to have this motivation. This concern generalizes beyond biology (although I think biology is the scariest area) to any area where great destruction is possible but currently requires a high level of skill and discipline. To put it another way, renting a powerful AI gives intelligence to malicious (but otherwise average) people. I am worried there are potentially a large number of such people out there, and that if they have access to an easy way to kill millions of people, sooner or later one of them will do it. Additionally, those who do have expertise may be enabled to commit even larger-scale destruction than they could before.
Biology is by far the area I’m most worried about, because of its very large potential for destruction and the difficulty of defending against it, so I’ll focus on biology in particular. But much of what I say here applies to other risks, like cyberattacks, chemical weapons, or nuclear technology.
I am not going to go into detail about how to make biological weapons, for reasons that should be obvious. But at a high level, I am concerned that LLMs are approaching (or may already have reached) the knowledge needed to create and release them end-to-end, and that their potential for destruction is very high. Some biological agents could cause millions of deaths if a determined effort was made to release them for maximum spread. However, this would still take a very high level of skill, including a number of very specific steps and procedures that are not widely known. My concern is not merely fixed or static knowledge. I am concerned that LLMs will be able to take someone of average knowledge and ability and walk them through a complex process that might otherwise go wrong or require debugging in an interactive way, similar to how tech support might help a non-technical person debug and fix complicated computer-related problems (although this would be a more extended process, probably lasting over weeks or months).
More capable LLMs (substantially beyond the power of today’s) might be capable of enabling even more frightening acts. In 2024, a group of prominent scientists wrote a letter warning about the risks of researching, and potentially creating, a dangerous new type of organism: “mirror life.” The DNA, RNA, ribosomes, and proteins that make up biological organisms all have the same chirality (also called “handedness”) that causes them to be not equivalent to a version of themselves reflected in the mirror (just as your right hand cannot be rotated in such a way as to be identical to your left). But the whole system of proteins binding to each other, the machinery of DNA synthesis and RNA translation and the construction and breakdown of proteins, all depends on this handedness. If scientists made versions of this biological material with the opposite handedness—and there are some potential advantages of these, such as medicines that last longer in the body—it could be extremely dangerous. This is because left-handed life, if it were made in the form of complete organisms capable of reproduction (which would be very difficult), would potentially be indigestible to any of the systems that break down biological material on earth—it would have a “key” that wouldn’t fit into the “lock” of any existing enzyme. This would mean that it could proliferate in an uncontrollable way and crowd out all life on the planet, in the worst case even destroying all life on earth.
There is substantial scientific uncertainty about both the creation and potential effects of mirror life. The 2024 letter accompanied a report that concluded that “mirror bacteria could plausibly be created in the next one to few decades,” which is a wide range. But a sufficiently powerful AI model (to be clear, far more capable than any we have today) might be able to discover how to create it much more rapidly—and actually help someone do so.
My view is that even though these are obscure risks, and might seem unlikely, the magnitude of the consequences is so large that they should be taken seriously as a first-class risk of AI systems.
Skeptics have raised a number of objections to the seriousness of these biological risks from LLMs, which I disagree with but which are worth addressing. Most fall into the category of not appreciating the exponential trajectory that the technology is on. Back in 2023 when we first started talking about biological risks from LLMs, skeptics said that all the necessary information was available on Google and LLMs didn’t add anything beyond this. It was never true that Google could give you all the necessary information: genomes are freely available, but as I said above, certain key steps, as well as a huge amount of practical know-how. cannot be gotten in that way. But also, by the end of 2023 LLMs were clearly providing information beyond what Google could give for some steps of the process.
After this, skeptics retreated to the objection that LLMs weren’t end-to-end useful, and couldn’t help with bioweapons acquisition as opposed to just providing theoretical information. As of mid-2025, our measurements show that LLMs may already be providing substantial uplift in several relevant areas, perhaps doubling or tripling the likelihood of success. This led to us deciding that Claude Opus 4 (and the subsequent Sonnet 4.5, Opus 4.1, and Opus 4.5 models) needed to be released under our AI Safety Level 3 protections in our Responsible Scaling Policy framework, and to implementing safeguards against this risk (more on this later). We believe that models are likely now approaching the point where, without safeguards, they could be useful in enabling someone with a STEM degree but not specifically a biology degree to go through the whole process of producing a bioweapon.
Another objection is that there are other actions unrelated to AI that society can take to block the production of bioweapons. Most prominently, the gene synthesis industry makes biological specimens on demand, and there is no federal requirement that providers screen orders to make sure they do not contain pathogens. An MIT study found that 36 out of 38 providers fulfilled an order containing the sequence of the 1918 flu. I am supportive of mandated gene synthesis screening that would make it harder for individuals to weaponize pathogens, in order to reduce both AI-driven biological risks and also biological risks in general. But this is not something we have today. It would also be only one tool in reducing risk; it is a complement to guardrails on AI systems, not a substitute.
The best objection is one that I’ve rarely seen raised: that there is a gap between the models being useful in principle and the actual propensity of bad actors to use them. Most individual bad actors are disturbed individuals, so almost by definition their behavior is unpredictable and irrational—and it’s these bad actors, the unskilled ones, who might have stood to benefit the most from AI making it much easier to kill many people.24
Just because a type of violent attack is possible, doesn’t mean someone will decide to do it. Perhaps biological attacks will be unappealing because they are reasonably likely to infect the perpetrator, they don’t cater to the military-style fantasies that many violent individuals or groups have, and it is hard to selectively target specific people. It could also be that going through a process that takes months, even if an AI walks you through it, involves an amount of patience that most disturbed individuals simply don’t have. We may simply get lucky and motive and ability don’t combine, in practice, in quite the right way.
But this seems like very flimsy protection to rely on. The motives of disturbed loners can change for any reason or no reason, and in fact there are already instances of LLMs being used in attacks (just not with biology). The focus on disturbed loners also ignores ideologically motivated terrorists, who are often willing to expend large amounts of time and effort (for example, the 9/11 hijackers). Wanting to kill as many people as possible is a motive that will probably arise sooner or later, and it unfortunately suggests bioweapons as the method. Even if this motive is extremely rare, it only has to materialize once. And as biology advances (increasingly driven by AI itself), it may also become possible to carry out more selective attacks (for example, targeted against people with specific ancestries), which adds yet another, very chilling, possible motive.
I do not think biological attacks will necessarily be carried out the instant it becomes widely possible to do so—in fact, I would bet against that. But added up across millions of people and a few years of time, I think there is a serious risk of a major attack, and the consequences would be so severe (with casualties potentially in the millions or more) that I believe we have no choice but to take serious measures to prevent it.
Defenses
That brings us to how to defend against these risks. Here I see three things we can do. First, AI companies can put guardrails on their models to prevent them from helping to produce bioweapons. Anthropic is very actively doing this. Claude’s Constitution, which mostly focuses on high-level principles and values, has a small number of specific hard-line prohibitions, and one of them relates to helping with the production of biological (or chemical, or nuclear, or radiological) weapons. But all models can be jailbroken, and so as a second line of defense, we’ve implemented (since mid-2025, when our tests showed our models were starting to get close to the threshold where they might begin to pose a risk) a classifier that specifically detects and blocks bioweapon-related outputs. We regularly upgrade and improve these classifiers, and have generally found them highly robust even against sophisticated adversarial attacks.25
These classifiers increase the costs to serve our models measurably (in some models, they are close to 5% of total inference costs) and thus cut into our margins, but we feel that using them is the right thing to do.
To their credit, some other AI companies have implemented classifiers as well. But not every company has, and there is also nothing requiring companies to keep their classifiers. I am concerned that over time there may be a prisoner’s dilemma where companies can defect and lower their costs by removing classifiers. This is once again a classic negative externalities problem that can’t be solved by the voluntary actions of Anthropic or any other single company alone.26
Voluntary industry standards may help, as may third-party evaluations and verification of the type done by AI security institutes and third-party evaluators.
But ultimately defense may require government action, which is the second thing we can do. My views here are the same as they are for addressing autonomy risks: we should start with transparency requirements,27
which help society measure, monitor, and collectively defend against risks without disrupting economic activity in a heavy-handed way. Then, if and when we reach clearer thresholds of risk, we can craft legislation that more precisely targets these risks and has a lower chance of collateral damage. In the particular case of bioweapons, I actually think that the time for such targeted legislation may be approaching soon—Anthropic and other companies are learning more and more about the nature of biological risks and what is reasonable to require of companies in defending against them. Fully defending against these risks may require working internationally, even with geopolitical adversaries, but there is precedent in treaties prohibiting the development of biological weapons. I am generally a skeptic about most kinds of international cooperation on AI, but this may be one narrow area where there is some chance of achieving global restraint. Even dictatorships do not want massive bioterrorist attacks.
Finally, the third countermeasure we can take is to try to develop defenses against biological attacks themselves. This could include monitoring and tracking for early detection, investments in air purification R&D (such as far-UVC disinfection), rapid vaccine development that can respond and adapt to an attack, better personal protective equipment (PPE),28
and treatments or vaccinations for some of the most likely biological agents. mRNA vaccines, which can be designed to respond to a particular virus or variant, are an early example of what is possible here. Anthropic is excited to work with biotech and pharmaceutical companies on this problem. But unfortunately I think our expectations on the defensive side should be limited. There is an asymmetry between attack and defense in biology, because agents spread rapidly on their own, while defenses require detection, vaccination, and treatment to be organized across large numbers of people very quickly in response. Unless the response is lightning quick (which it rarely is), much of the damage will be done before a response is possible. It is conceivable that future technological improvements could shift this balance in favor of defense (and we should certainly use AI to help develop such technological advances), but until then, preventative safeguards will be our main line of defense.
It’s worth a brief mention of cyberattacks here, since unlike biological attacks, AI-led cyberattacks have actually happened in the wild, including at a large scale and for state-sponsored espionage. We expect these attacks to become more capable as models advance rapidly, until they are the main way in which cyberattacks are conducted. I expect AI-led cyberattacks to become a serious and unprecedented threat to the integrity of computer systems around the world, and Anthropic is working very hard to shut down these attacks and eventually reliably prevent them from happening. The reason I haven’t focused on cyber as much as biology is that (1) cyberattacks are much less likely to kill people, certainly not at the scale of biological attacks, and (2) the offense-defense balance may be more tractable in cyber, where there is at least some hope that defense could keep up with (and even ideally outpace) AI attack if we invest in it properly.
Although biology is currently the most serious vector of attack, there are many other vectors and it is possible that a more dangerous one may emerge. The general principle is that without countermeasures, AI is likely to continuously lower the barrier to destructive activity on a larger and larger scale, and humanity needs a serious response to this threat.
3. The odious apparatus
Misuse for seizing power
The previous section discussed the risk of individuals and small organizations co-opting a small subset of the “country of geniuses in a datacenter” to cause large-scale destruction. But we should also worry—likely substantially more so—about misuse of AI for the purpose of wielding or seizing power, likely by larger and more established actors.29
In Machines of Loving Grace, I discussed the possibility that authoritarian governments might use powerful AI to surveil or repress their citizens in ways that would be extremely difficult to reform or overthrow. Current autocracies are limited in how repressive they can be by the need to have humans carry out their orders, and humans often have limits in how inhumane they are willing to be. But AI-enabled autocracies would not have such limits.
Worse yet, countries could also use their advantage in AI to gain power over other countries. If the “country of geniuses” as a whole was simply owned and controlled by a single (human) country’s military apparatus, and other countries did not have equivalent capabilities, it is hard to see how they could defend themselves: they would be outsmarted at every turn, similar to a war between humans and mice. Putting these two concerns together leads to the alarming possibility of a global totalitarian dictatorship. Obviously, it should be one of our highest priorities to prevent this outcome.
There are many ways in which AI could enable, entrench, or expand autocracy, but I’ll list a few that I’m most worried about. Note that some of these applications have legitimate defensive uses, and I am not necessarily arguing against them in absolute terms; I am nevertheless worried that they structurally tend to favor autocracies:
Having described what I am worried about, let’s move on to who. I am worried about entities who have the most access to AI, who are starting from a position of the most political power, or who have an existing history of repression. In order of severity, I am worried about:
There are a number of possible arguments against the severity of these threats, and I wish I believed them, because AI-enabled authoritarianism terrifies me. It’s worth going through some of these arguments and responding to them.
First, some people might put their faith in the nuclear deterrent, particularly to counter the use of AI autonomous weapons for military conquest. If someone threatens to use these weapons against you, you can always threaten a nuclear response back. My worry is that I’m not totally sure we can be confident in the nuclear deterrent against a country of geniuses in a datacenter: it is possible that powerful AI could devise ways to detect and strike nuclear submarines, conduct influence operations against the operators of nuclear weapons infrastructure, or use AI’s cyber capabilities to launch a cyberattack against satellites used to detect nuclear launches.33
Alternatively, it’s possible that taking over countries is feasible with only AI surveillance and AI propaganda, and never actually presents a clear moment where it’s obvious what is going on and where a nuclear response would be appropriate. Maybe these things aren’t feasible and the nuclear deterrent will still be effective, but it seems too high stakes to take a risk.34
A second possible objection is that there might be countermeasures we can take against these tools of autocracy. We can counter drones with our own drones, cyberdefense will improve along with cyberattack, there may be ways to immunize people against propaganda, etc. My response is that these defenses will only be possible with comparably powerful AI. If there isn’t some counterforce with a comparably smart and numerous country of geniuses in a datacenter, it won’t be possible to match the quality or quantity of drones, for cyberdefense to outsmart cyberoffense, etc. So the question of countermeasures reduces to the question of a balance of power in powerful AI. Here, I am concerned about the recursive or self-reinforcing property of powerful AI (which I discussed at the beginning of this essay): that each generation of AI can be used to design and train the next generation of AI. This leads to a risk of a runaway advantage, where the current leader in powerful AI may be able to increase their lead and may be difficult to catch up with. We need to make sure it is not an authoritarian country that gets to this loop first.
Furthermore, even if a balance of power can be achieved, there is still risk that the world could be split up into autocratic spheres, as in Nineteen Eighty-Four. Even if several competing powers each have their powerful AI models, and none can overpower the others, each power could still internally repress their own population, and would be very difficult to overthrow (since the populations don’t have powerful AI to defend themselves). It is thus important to prevent AI-enabled autocracy even if it doesn’t lead to a single country taking over the world.
Defenses
How do we defend against this wide range of autocratic tools and potential threat actors? As in the previous sections, there are several things I think we can do. First, we should absolutely not be selling chips, chip-making tools, or datacenters to the CCP. Chips and chip-making tools are the single greatest bottleneck to powerful AI, and blocking them is a simple but extremely effective measure, perhaps the most important single action we can take. It makes no sense to sell the CCP the tools with which to build an AI totalitarian state and possibly conquer us militarily. A number of complicated arguments are made to justify such sales, such as the idea that “spreading our tech stack around the world” allows “America to win” in some general, unspecified economic battle. In my view, this is like selling nuclear weapons to North Korea and then bragging that the missile casings are made by Boeing and so the US is “winning.” China is several years behind the US in their ability to produce frontier chips in quantity, and the critical period for building the country of geniuses in a datacenter is very likely to be within those next several years.35
There is no reason to give a giant boost to their AI industry during this critical period.
Second, it makes sense to use AI to empower democracies to resist autocracies. This is the reason Anthropic considers it important to provide AI to the intelligence and defense communities in the US and its democratic allies. Defending democracies that are under attack, such as Ukraine and (via cyber attacks) Taiwan, seems especially high priority, as does empowering democracies to use their intelligence services to disrupt and degrade autocracies from the inside. At some level the only way to respond to autocratic threats is to match and outclass them militarily. A coalition of the US and its democratic allies, if it achieved predominance in powerful AI, would be in a position to not only defend itself against autocracies, but contain them and limit their AI totalitarian abuses.
Third, we need to draw a hard line against AI abuses within democracies. There need to be limits to what we allow our governments to do with AI, so that they don’t seize power or repress their own people. The formulation I have come up with is that we should use AI for national defense in all ways except those which would make us more like our autocratic adversaries.
Where should the line be drawn? In the list at the beginning of this section, two items—using AI for domestic mass surveillance and mass propaganda—seem to me like bright red lines and entirely illegitimate. Some might argue that there’s no need to do anything (at least in the US), since domestic mass surveillance is already illegal under the Fourth Amendment. But the rapid progress of AI may create situations that our existing legal frameworks are not well designed to deal with. For example, it would likely not be unconstitutional for the US government to conduct massively scaled recordings of all public conversations (e.g., things people say to each other on a street corner), and previously it would have been difficult to sort through this volume of information, but with AI it could all be transcribed, interpreted, and triangulated to create a picture of the attitude and loyalties of many or most citizens. I would support civil liberties-focused legislation (or maybe even a constitutional amendment) that imposes stronger guardrails against AI-powered abuses.
The other two items—fully autonomous weapons and AI for strategic decision-making—are harder lines to draw since they have legitimate uses in defending democracy, while also being prone to abuse. Here I think what is warranted is extreme care and scrutiny combined with guardrails to prevent abuses. My main fear is having too small a number of “fingers on the button,” such that one or a handful of people could essentially operate a drone army without needing any other humans to cooperate to carry out their orders. As AI systems get more powerful, we may need to have more direct and immediate oversight mechanisms to ensure they are not misused, perhaps involving branches of government other than the executive. I think we should approach fully autonomous weapons in particular with great caution,36
and not rush into their use without proper safeguards.
Fourth, after drawing a hard line against AI abuses in democracies, we should use that precedent to create an international taboo against the worst abuses of powerful AI. I recognize that the current political winds have turned against international cooperation and international norms, but this is a case where we sorely need them. The world needs to understand the dark potential of powerful AI in the hands of autocrats, and to recognize that certain uses of AI amount to an attempt to permanently steal their freedom and impose a totalitarian state from which they can’t escape. I would even argue that in some cases, large-scale surveillance with powerful AI, mass propaganda with powerful AI, and certain types of offensive uses of fully autonomous weapons should be considered crimes against humanity. More generally, a robust norm against AI-enabled totalitarianism and all its tools and instruments is sorely needed.
It is possible to have an even stronger version of this position, which is that because the possibilities of AI-enabled totalitarianism are so dark, autocracy is simply not a form of government that people can accept in the post-powerful AI age. Just as feudalism became unworkable with the industrial revolution, the AI age could lead inevitably and logically to the conclusion that democracy (and, hopefully, democracy improved and reinvigorated by AI, as I discuss in Machines of Loving Grace) is the only viable form of government if humanity is to have a good future.
Fifth and finally, AI companies should be carefully watched, as should their connection to the government, which is necessary, but must have limits and boundaries. The sheer amount of capability embodied in powerful AI is such that ordinary corporate governance—which is designed to protect shareholders and prevent ordinary abuses such as fraud—is unlikely to be up to the task of governing AI companies. There may also be value in companies publicly committing to (perhaps even as part of corporate governance) not take certain actions, such as privately building or stockpiling military hardware, using large amounts of computing resources by single individuals in unaccountable ways, or using their AI products as propaganda to manipulate public opinion in their favor.
The danger here comes from many directions, and some directions are in tension with others. The only constant is that we must seek accountability, norms, and guardrails for everyone, even as we empower “good” actors to keep “bad” actors in check.
4. Player piano
Economic disruption
The previous three sections were essentially about security risks posed by powerful AI: risks from the AI itself, risks from misuse by individuals and small organizations and risks of misuse by states and large organizations. If we put aside security risks or assume they have been solved, the next question is economic. What will be the effect of this infusion of incredible “human” capital on the economy? Clearly, the most obvious effect will be to greatly increase economic growth. The pace of advances in scientific research, biomedical innovation, manufacturing, supply chains, the efficiency of the financial system, and much more are almost guaranteed to lead to a much faster rate of economic growth. In Machines of Loving Grace, I suggest that a 10–20% sustained annual GDP growth rate may be possible.
But it should be clear that this is a double-edged sword: what are the economic prospects for most existing humans in such a world? New technologies often bring labor market shocks, and in the past humans have always recovered from them, but I am concerned that this is because these previous shocks affected only a small fraction of the full possible range of human abilities, leaving room for humans to expand to new tasks. AI will have effects that are much broader and occur much faster, and therefore I worry it will be much more challenging to make things work out well.
Labor market disruption
There are two specific problems I am worried about: labor market displacement, and concentration of economic power. Let’s start with the first one. This is a topic that I warned about very publicly in 2025, where I predicted that AI could displace half of all entry-level white collar jobs in the next 1–5 years, even as it accelerates economic growth and scientific progress. This warning started a public debate about the topic. Many CEOs, technologists, and economists agreed with me, but others assumed I was falling prey to a “lump of labor” fallacy and didn’t know how labor markets worked, and some didn’t see the 1–5-year time range and thought I was claiming AI is displacing jobs right now (which I agree it is likely not). So it is worth going through in detail why I am worried about labor displacement, to clear up these misunderstandings.
As a baseline, it’s useful to understand how labor markets normally respond to advances in technology. When a new technology comes along, it starts by making pieces of a given human job more efficient. For example, early in the Industrial Revolution, machines, such as upgraded plows, enabled human farmers to be more efficient at some aspects of the job. This improved the productivity of farmers, which increased their wages.
In the next step, some parts of the job of farming could be done entirely by machines, for example with the invention of the threshing machine or seed drill. In this phase, humans did a lower and lower fraction of the job, but the work they did complete became more and more leveraged because it is complementary to the work of machines, and their productivity continued to rise. As described by Jevons’ paradox, the wages of farmers and perhaps even the number of farmers continued to increase. Even when 90% of the job is being done by machines, humans can simply do 10x more of the 10% they still do, producing 10x as much output for the same amount of labor.
Eventually, machines do everything or almost everything, as with modern combine harvesters, tractors, and other equipment. At this point farming as a form of human employment really does go into steep decline, and this potentially causes serious disruption in the short term, but because farming is just one of many useful activities that humans are able to do, people eventually switch to other jobs, such as operating factory machines. This is true even though farming accounted for a huge proportion of employment ex ante. 250 years ago, 90% of Americans lived on farms; in Europe, 50–60% of employment was agricultural. Now those percentages are in the low single digits in those places, because workers switched to industrial jobs (and later, knowledge work jobs). The economy can do what previously required most of the labor force with only 1–2% of it, freeing up the rest of the labor force to build an ever more advanced industrial society. There’s no fixed “lump of labor,” just an ever-expanding ability to do more and more with less and less. People’s wages rise in line with the GDP exponential and the economy maintains full employment once disruptions in the short term have passed.
It’s possible things will go roughly the same way with AI, but I would bet pretty strongly against it. Here are some reasons I think AI is likely to be different:
It’s worth addressing common points of skepticism. First, there is the argument that economic diffusion will be slow, such that even if the underlying technology is capable of doing most human labor, the actual application of it across the economy may be much slower (for example in industries that are far from the AI industry and slow to adopt). Slow diffusion of technology is definitely real—I talk to people from a wide variety of enterprises, and there are places where the adoption of AI will take years. That’s why my prediction for 50% of entry level white collar jobs being disrupted is 1–5 years, even though I suspect we’ll have powerful AI (which would be, technologically speaking, enough to do most or all jobs, not just entry level) in much less than 5 years. But diffusion effects merely buy us time. And I am not confident they will be as slow as people predict. Enterprise AI adoption is growing at rates much faster than any previous technology, largely on the pure strength of the technology itself. Also, even if traditional enterprises are slow to adopt new technology, startups will spring up to serve as “glue” and make the adoption easier. If that doesn’t work, the startups may simply disrupt the incumbents directly.
That could lead to a world where it isn’t so much that specific jobs are disrupted as it is that large enterprises are disrupted in general and replaced with much less labor-intensive startups. This could also lead to a world of “geographic inequality,” where an increasing fraction of the world’s wealth is concentrated in Silicon Valley, which becomes its own economy running at a different speed than the rest of the world and leaving it behind. All of these outcomes would be great for economic growth—but not so great for the labor market or those who are left behind.
Second, some people say that human jobs will move to the physical world, which avoids the whole category of “cognitive labor” where AI is progressing so rapidly. I am not sure how safe this is, either. A lot of physical labor is already being done by machines (e.g., manufacturing) or will soon be done by machines (e.g., driving). Also, sufficiently powerful AI will be able to accelerate the development of robots, and then control those robots in the physical world. It may buy some time (which is a good thing), but I’m worried it won’t buy much. And even if the disruption was limited only to cognitive tasks, it would still be an unprecedentedly large and rapid disruption.
Third, perhaps some tasks inherently require or greatly benefit from a human touch. I’m a little more uncertain about this one, but I’m still skeptical that it will be enough to offset the bulk of the impacts I described above. AI is already widely used for customer service. Many people report that it is easier to talk to AI about their personal problems than to talk to a therapist—that the AI is more patient. When my sister was struggling with medical problems during a pregnancy, she felt she wasn’t getting the answers or support she needed from her care providers, and she found Claude to have a better bedside manner (as well as succeeding better at diagnosing the problem). I’m sure there are some tasks for which a human touch really is important, but I’m not sure how many—and here we’re talking about finding work for nearly everyone in the labor market.
Fourth, some may argue that comparative advantage will still protect humans. Under the law of comparative advantage, even if AI is better than humans at everything, any relative differences between the human and AI profile of skills creates a basis of trade and specialization between humans and AI. The problem is that if AIs are literally thousands of times more productive than humans, this logic starts to break down. Even tiny transaction costs could make it not worth it for AI to trade with humans. And human wages may be very low, even if they technically have something to offer.
It’s possible all of these factors can be addressed—that the labor market is resilient enough to adapt to even such an enormous disruption. But even if it can eventually adapt, the factors above suggest that the short-term shock will be unprecedented in size.
Defenses
What can we do about this problem? I have several suggestions, some of which Anthropic is already doing. The first thing is simply to get accurate data about what is happening with job displacement in real time. When an economic change happens very quickly, it’s hard to get reliable data about what is happening, and without reliable data it is hard to design effective policies. For example, government data is currently lacking granular, high-frequency data on AI adoption across firms and industries. For the last year Anthropic has been operating and publicly releasing an Economic Index that shows use of our models almost in real time, broken down by industry, task, location, and even things like whether a task was being automated or conducted collaboratively. We also have an Economic Advisory Council to help us interpret this data and see what is coming.
Second, AI companies have a choice in how they work with enterprises. The very inefficiency of traditional enterprises means that their rollout of AI can be very path dependent, and there is some room to choose a better path. Enterprises often have a choice between “cost savings” (doing the same thing with fewer people) and “innovation” (doing more with the same number of people). The market will inevitably produce both eventually, and any competitive AI company will have to serve some of both, but there may be some room to steer companies towards innovation when possible, and it may buy us some time. Anthropic is actively thinking about this.
Third, companies should think about how to take care of their employees. In the short term, being creative about ways to reassign employees within companies may be a promising way to stave off the need for layoffs. In the long term, in a world with enormous total wealth, in which many companies increase greatly in value due to increased productivity and capital concentration, it may be feasible to pay human employees even long after they are no longer providing economic value in the traditional sense. Anthropic is currently considering a range of possible pathways for our own employees that we will share in the near future.
Fourth, wealthy individuals have an obligation to help solve this problem. It is sad to me that many wealthy individuals (especially in the tech industry) have recently adopted a cynical and nihilistic attitude that philanthropy is inevitably fraudulent or useless. Both private philanthropy like the Gates Foundation and public programs like PEPFAR have saved tens of millions of lives in the developing world, and helped to create economic opportunity in the developed world. All of Anthropic’s co-founders have pledged to donate 80% of our wealth, and Anthropic’s staff have individually pledged to donate company shares worth billions at current prices—donations that the company has committed to matching.
Fifth, while all the above private actions can be helpful, ultimately a macroeconomic problem this large will require government intervention. The natural policy response to an enormous economic pie coupled with high inequality (due to a lack of jobs, or poorly paid jobs, for many) is progressive taxation. The tax could be general or could be targeted against AI companies in particular. Obviously tax design is complicated, and there are many ways for it to go wrong. I don’t support poorly designed tax policies. I think the extreme levels of inequality predicted in this essay justify a more robust tax policy on basic moral grounds, but I can also make a pragmatic argument to the world’s billionaires that it’s in their interest to support a good version of it: if they don’t support a good version, they’ll inevitably get a bad version designed by a mob.
Ultimately, I think of all of the above interventions as ways to buy time. In the end AI will be able to do everything, and we need to grapple with that. It’s my hope that by that time, we can use AI itself to help us restructure markets in ways that work for everyone, and that the interventions above can get us through the transitional period.
Economic concentration of power
Separate from the problem of job displacement or economic inequality per se is the problem of economic concentration of power. Section 1 discussed the risk that humanity gets disempowered by AI, and Section 3 discussed the risk that citizens get disempowered by their governments by force or coercion. But another kind of disempowerment can occur if there is such a huge concentration of wealth that a small group of people effectively controls government policy with their influence, and ordinary citizens have no influence because they lack economic leverage. Democracy is ultimately backstopped by the idea that the population as a whole is necessary for the operation of the economy. If that economic leverage goes away, then the implicit social contract of democracy may stop working. Others have written about this, so I needn’t go into great detail about it here, but I agree with the concern, and I worry it is already starting to happen.
To be clear, I am not opposed to people making a lot of money. There’s a strong argument that it incentivizes economic growth under normal conditions. I am sympathetic to concerns about impeding innovation by killing the golden goose that generates it. But in a scenario where GDP growth is 10–20% a year and AI is rapidly taking over the economy, yet single individuals hold appreciable fractions of the GDP, innovation is not the thing to worry about. The thing to worry about is a level of wealth concentration that will break society.
The most famous example of extreme concentration of wealth in US history is the Gilded Age, and the wealthiest industrialist of the Gilded Age was John D. Rockefeller. Rockefeller’s wealth amounted to ~2% of the US GDP at the time.42
A similar fraction today would lead to a fortune of $600B, and the richest person in the world today (Elon Musk) already exceeds that, at roughly $700B. So we are already at historically unprecedented levels of wealth concentration, even before most of the economic impact of AI. I don’t think it is too much of a stretch (if we get a “country of geniuses”) to imagine AI companies, semiconductor companies, and perhaps downstream application companies generating ~$3T in revenue per year,43 being valued at ~$30T, and leading to personal fortunes well into the trillions. In that world, the debates we have about tax policy today simply won’t apply as we will be in a fundamentally different situation.
Related to this, the coupling of this economic concentration of wealth with the political system already concerns me. AI datacenters already represent a substantial fraction of US economic growth,44
and are thus strongly tying together the financial interests of large tech companies (which are increasingly focused on either AI or AI infrastructure) and the political interests of the government in a way that can produce perverse incentives. We already see this through the reluctance of tech companies to criticize the US government, and the government’s support for extreme anti-regulatory policies on AI.
Defenses
What can be done about this? First, and most obviously, companies should simply choose not to be part of it. Anthropic has always strived to be a policy actor and not a political one, and to maintain our authentic views whatever the administration. We’ve spoken up in favor of sensible AI regulation and export controls that are in the public interest, even when these are at odds with government policy.45
Many people have told me that we should stop doing this, that it could lead to unfavorable treatment, but in the year we’ve been doing it, Anthropic’s valuation has increased by over 6x, an almost unprecedented jump at our commercial scale.
Second, the AI industry needs a healthier relationship with government—one based on substantive policy engagement rather than political alignment. Our choice to engage on policy substance rather than politics is sometimes read as a tactical error or failure to “read the room” rather than a principled decision, and that framing concerns me. In a healthy democracy, companies should be able to advocate for good policy for its own sake. Related to this, a public backlash against AI is brewing: this could be a corrective, but it’s currently unfocused. Much of it targets issues that aren’t actually problems (like datacenter water usage) and proposes solutions (like datacenter bans or poorly designed wealth taxes) that wouldn’t address the real concerns. The underlying issue that deserves attention is ensuring that AI development remains accountable to the public interest, not captured by any particular political or commercial alliance, and it seems important to focus the public discussion there.
Third, the macroeconomic interventions I described earlier in this section, as well as a resurgence of private philanthropy, can help to balance the economic scales, addressing both the job displacement and concentration of economic power problems at once. We should look to the history of our country here: even in the Gilded Age, industrialists such as Rockefeller and Carnegie felt a strong obligation to society at large, a feeling that society had contributed enormously to their success and they needed to give back. That spirit seems to be increasingly missing today, and I think it is a large part of the way out of this economic dilemma. Those who are at the forefront of AI’s economic boom should be willing to give away both their wealth and their power.
5. Black seas of infinity
Indirect effects
This last section is a catchall for unknown unknowns, particularly things that could go wrong as an indirect result of positive advances in AI and the resulting acceleration of science and technology in general. Suppose we address all the risks described so far, and begin to reap the benefits of AI. We will likely get a “century of scientific and economic progress compressed into a decade,” and this will be hugely positive for the world, but we will then have to contend with the problems that arise from this rapid rate of progress, and those problems may come at us fast. We may also encounter other risks that occur indirectly as a consequence of AI progress and are hard to anticipate in advance.
By the nature of unknown unknowns it is impossible to make an exhaustive list, but I’ll list three possible concerns as illustrative examples for what we should be watching for:
My hope with all of these potential problems is that in a world with powerful AI that we trust not to kill us, that is not the tool of an oppressive government, and that is genuinely working on our behalf, we can use AI itself to anticipate and prevent these problems. But that is not guaranteed—like all of the other risks, it is something we have to handle with care.
Humanity’s test
Reading this essay may give the impression that we are in a daunting situation. I certainly found it daunting to write, in contrast with Machines of Loving Grace, which felt like giving form and structure to surpassingly beautiful music that had been echoing in my head for years. And there is much about the situation that genuinely is hard. AI brings threats to humanity from multiple directions, and there is genuine tension between the different dangers, where mitigating some of them risks making others worse if we do not thread the needle extremely carefully.
Taking time to carefully build AI systems so they do not autonomously threaten humanity is in genuine tension with the need for democratic nations to stay ahead of authoritarian nations and not be subjugated by them. But in turn, the same AI-enabled tools that are necessary to fight autocracies can, if taken too far, be turned inward to create tyranny in our own countries. AI-driven terrorism could kill millions through the misuse of biology, but an overreaction to this risk could lead us down the road to an autocratic surveillance state. The labor and economic concentration effects of AI, in addition to being grave problems in their own right, may force us to face the other problems in an environment of public anger and perhaps even civil unrest, rather than being able to call on the better angels of our nature. Above all, the sheer number of risks, including unknown ones, and the need to deal with all of them at once, creates an intimidating gauntlet that humanity must run.
Furthermore, the last few years should make clear that the idea of stopping or even substantially slowing the technology is fundamentally untenable. The formula for building powerful AI systems is incredibly simple, so much so that it can almost be said to emerge spontaneously from the right combination of data and raw computation. Its creation was probably inevitable the instant humanity invented the transistor, or arguably even earlier when we first learned to control fire. If one company does not build it, others will do so nearly as fast. If all companies in democratic countries stopped or slowed development, by mutual agreement or regulatory decree, then authoritarian countries would simply keep going. Given the incredible economic and military value of the technology, together with the lack of any meaningful enforcement mechanism, I don’t see how we could possibly convince them to stop.
I do see a path to a slight moderation in AI development that is compatible with a realist view of geopolitics. That path involves slowing down the march of autocracies towards powerful AI for a few years by denying them the resources they need to build it,46
namely chips and semiconductor manufacturing equipment. This in turn gives democratic countries a buffer that they can “spend” on building powerful AI more carefully, with more attention to its risks, while still proceeding fast enough to comfortably beat the autocracies. The race between AI companies within democracies can then be handled under the umbrella of a common legal framework, via a mixture of industry standards and regulation.
Anthropic has advocated very hard for this path, by pushing for chip export controls and judicious regulation of AI, but even these seemingly common-sense proposals have largely been rejected by policymakers in the United States (which is the country where it’s most important to have them). There is so much money to be made with AI—literally trillions of dollars per year—that even the simplest measures are finding it difficult to overcome the political economy inherent in AI. This is the trap: AI is so powerful, such a glittering prize, that it is very difficult for human civilization to impose any restraints on it at all.
I can imagine, as Sagan did in Contact, that this same story plays out on thousands of worlds. A species gains sentience, learns to use tools, begins the exponential ascent of technology, faces the crises of industrialization and nuclear weapons, and if it survives those, confronts the hardest and final challenge when it learns how to shape sand into machines that think. Whether we survive that test and go on to build the beautiful society described in Machines of Loving Grace, or succumb to slavery and destruction, will depend on our character and our determination as a species, our spirit and our soul.
Despite the many obstacles, I believe humanity has the strength inside itself to pass this test. I am encouraged and inspired by the thousands of researchers who have devoted their careers to helping us understand and steer AI models, and to shaping the character and constitution of these models. I think there is now a good chance that those efforts bear fruit in time to matter. I am encouraged that at least some companies have stated they’ll pay meaningful commercial costs to block their models from contributing to the threat of bioterrorism. I am encouraged that a few brave people have resisted the prevailing political winds and passed legislation that puts the first early seeds of sensible guardrails on AI systems. I am encouraged that the public understands that AI carries risks and wants those risks addressed. I am encouraged by the indomitable spirit of freedom around the world and the determination to resist tyranny wherever it occurs.
But we will need to step up our efforts if we want to succeed. The first step is for those closest to the technology to simply tell the truth about the situation humanity is in, which I have always tried to do; I’m doing so more explicitly and with greater urgency with this essay. The next step will be convincing the world’s thinkers, policymakers, companies, and citizens of the imminence and overriding importance of this issue—that it is worth expending thought and political capital on this in comparison to the thousands of other issues that dominate the news every day. Then there will be a time for courage, for enough people to buck the prevailing trends and stand on principle, even in the face of threats to their economic interests and personal safety.
The years in front of us will be impossibly hard, asking more of us than we think we can give. But in my time as a researcher, leader, and citizen, I have seen enough courage and nobility to believe that we can win—that when put in the darkest circumstances, humanity has a way of gathering, seemingly at the last minute, the strength and wisdom needed to prevail. We have no time to lose.
I would like to thank Erik Brynjolfsson, Ben Buchanan, Mariano-Florentino Cuéllar, Allan Dafoe, Kevin Esvelt, Nick Beckstead, Richard Fontaine, Jim McClave, and very many of the staff at Anthropic for their helpful comments on drafts of this essay.
Footnotes