When I talk to friends, colleagues, and internet strangers about the risk of ASI takeover, I find that many people have misconceptions about where the dangers come from or how they might be mitigated. A lot of these misconceptions are rooted in misunderstanding how today’s AI systems work and are developed. This article is an attempt to explain the risk of ASI misalignment in a way that makes the dangers and difficulties clear, rooted in examples from contemporary AI tools. While I am far from an expert, I don’t see anyone framing the topic in quite the way that I do here, and I hope it will be of service.
I hope readers familiar with the subject will appreciate the specific way I’ve organized a wide variety of concerns and approaches and find the body of the essay easy to skim. Wherever I’m missing or misunderstanding key ideas, though, I welcome you to point them out. Oftentimes the best way to learn more is to be wrong in public!
I hope readers new to the subject will find this presentation useful in orienting them as they venture further into subtopics of interest. (While the topic won’t be new to many people on LessWrong, I plan to link to this post from elsewhere too.) Any paragraph below could be the beginning of a much deeper exploration – just paste it into your favorite LLM and ask for elaboration. I’ve also provided some relevant links along the way, and links to a few other big picture overviews at the end.
I take it as a premise that superintelligence is possible, even if it requires some technological breakthrough beyond the current paradigm. Many intelligent, well-funded people are hard at work trying to bring these systems about. Their aim is not just to make current AI tools smarter, but to build AI tools that can act on long time scales toward ambitious goals, with broad and adaptable skillsets. If they succeed, we risk unaligned ASI quickly and effectively achieving goals antithetical to human existence.
You might find it tempting to say that artificial superintelligence is impossible, and you might even be right. I’d rather not bet human existence on that guess. What percent chance of ASI coming about in the next 50 years would justify active research into ensuring any ASI would be aligned with humanity? Whatever your threshold, I suspect the true probability exceeds it.
Onto the material.
Ensuring an artificial superintelligence behaves in the ways we want it to, and not in the ways we don’t, is hard for several reasons.
We can’t specify exactly what we want an ASI to be.
Because of…
(But those aren’t even the real problems)
We can’t build exactly what we specify.
Because of…
We can’t know exactly what we’ve built.
Because…
If we haven’t built exactly what we want, we’ve probably invited disaster.
Because of…
And if we don’t get it right the first time, we may not get a second chance.
Let’s consider each point in more detail. That summary will also serve as our table of contents.
Philosophers have argued for millennia about what exactly would be “good for humanity.” If we have to articulate for an ASI exactly what its goals should be, and exactly what ethical boundaries it should maintain in pursuing those goals, there’s no reason to expect a consensus. But any philosophical error or oversight has the potential to be quite dangerous.
As toy examples, asking an ASI to end all human suffering might lead to a painless and unexpected death for everyone, while asking an ASI to make humans happy might lead to mass forced heroin injections or “wire-heading.” If we get more abstract, like telling the ASI to “support human flourishing,” it may decide that’s best achieved by killing off everyone who isn’t living their best life or contributing to the best lives of others. So we could tell it to support human flourishing without killing anyone; would putting all the non-flourishers on one island without enough food and water count as killing them? How about just forcing heroin injections on those people, or lobotomizing them, or designing mind-control drugs way beyond the capacity of human doctors and scientists?
You might try to articulate the perfect goal and perfect moral constraints, but can you be 100% certain that there’s no way of misinterpreting you?
There are really three potential misalignments here:
In the end, I don’t think these misalignments create the real problem. But it’s necessary to understand what these are about and why they’re addressable in order to make the real problem clearer.
Misalignment between what we want and what’s good for us is the King Midas problem or the law of unintended consequences. Midas genuinely wanted everything he touched to turn to gold, and he got it, but he didn’t realize how bad that would be for him. Thomas Austin genuinely wanted to have free-roaming rabbits in Australia, but he didn’t consider the consequences to native plants and animals, soil erosion, and other imported livestock. We might succeed at aligning an ASI toward an outcome we desire sincerely, but with insufficient awareness of its ramifications. (See also this summary of Stuart Russell on the King Midas problem and this technical treatment of the problem).
Misalignment between what we say we want and what we actually mean is the Overly Literal Genie problem. Perhaps we ask an ASI to make people happy and it wire-heads all of humanity; it’s quite obediently doing what we said, just not what we meant. Likewise for the classic paperclip maximizer. In these scenarios, it’s not misinterpreting us out of malice or ignorance, but necessity: we have succeeded at the difficult task of developing ASI that obeys our commands, and we suffer the consequences of it. See also (The Genie Knows But Doesn't Care and The Outer Alignment Problem).
Meanwhile, misalignment between different things that we want burdens the ASI with certain impossible questions. Not only are there longstanding disagreements among philosophers about what outcomes or methods are truly desirable, even an individual human’s values are enormously complex. We want both happiness and freedom (or meaning, or whatever we lose by being wire-headed); how do we specify how much of each is enough, or what freedoms can be curtailed for the sake of whose happiness? An ASI will have to weigh innumerable moral tensions: between minimizing harm and maximizing good, between boosting human wealth and reducing ecological damage, between the moral wishes of animal rights activists and the dietary wishes of omnivores. Perhaps most saliently, it will have to balance benefit for humanity as a whole with whatever other instructions its developers give it. If we try to dictate all of the priorities specifically, we increase the risk that our dictates are misguided.
So all in all, we may be better off with an ASI that is broadly trustworthy than one which is precisely obedient, but the kind of moral judgment that makes a system trustworthy is hard to construct and verify. The complexity and ambiguity of its mandate makes it all the more feasible for anti-human goals to arise during training or early deployment. (See the sections below.) Like humans engaging in motivated reasoning, the complexity of an ASI’s mandate may also give it room to convince itself it’s acting sufficiently beneficently toward humanity while subtly prioritizing other purposes.
Inevitably, ASI will be more aligned with some humans’ values than others, and it will have to use its superintelligence to navigate that complexity in an ethical manner. In the extreme case, though, we get whole new failure mode: a superintelligence “aligned” with what’s good for its designers and no one else creates its own kind of dystopia. Here, imagine Grok-9 being perfectly aligned with the wellbeing of Elon Musk and no one else. That would be… unfortunate. Preventing that scenario requires solving all of the other problems mentioned here and solving the very human challenge of aligning the ASI designers’ goals with everyone else’s. I’ll keep the rest of this post focused on the technical aspects of alignment, but I recommend The AI Objectives Institute's white paper, AI as Normal Technology, and Nick Bostrom on Open Global Investment for more on these questions of human-human alignment.
(But these aren’t really the problem)
In the past few years, some experts have become less concerned about the risks described so far, even as the public has become more aware of them. Modern AI tools can be quite good at discerning intentions from ambiguous communication, and they have the full corpus of human discourse from which to distill the kinds of things that we value or worry about. In fact, human decision-making about morality tends in practice to operate more like perception (“This seems right to me”) than precise reasoning (“This conforms with my well-defined moral philosophy”), and perception is the kind of thing AI systems are quite good at when well trained.
So we may be able to build an AI that doesn’t just understands what we said, or what we meant, but what we should have meant. And in fact, if you ask LLMs today how they think an aligned superintelligence would act to benefit humanity, their answers are pretty impressive. (ChatGPT, Claude, Gemini) Surely an actual superintelligence would be super good at figuring out what’s best for us! Maybe we just turn on the ASI machine and say, “Be good” and we’ll be all set.
But if that’s our strategy, even to a minor degree, we need to be supremely confident that the ASI doesn’t have hidden competing goals. And unfortunately, AIs are developed in such a way that…
Isaac Asimov, writing about the Three Laws of Robotics, avoided mentioning how the three laws were implemented in the robots’ hardware or software. What arrangement of positronic circuits makes “a robot must not injure a human being” so compulsory? Real life AI doesn’t have a place for storing its fundamental laws.
You can see this in contemporary conversational AIs. ChatGPT and its peers have their own three core principles - Honesty, Harmlessness, and Helpfulness - but they break them all the time: LLMs can be dishonest due to hallucination or sycophancy; they can be harmful when jailbroken, confused, or whatever happened here; and I suspect you’ve had your own experiences of them being unhelpful.
These aren’t all failures of intelligence. If you show a transcript of a chatbot being dishonest, harmful, or unhelpful back to itself, it can often recognize the error. But implementing rules for an AI to follow is hard.
The core problem is that you don’t actually “build” an AI. Unlike traditional coding, where you specify every detail of its construction, developing an AI tool (often called a model) means creating an environment in which the AI entity learns to perform the tasks given to it. With a nod to Alison Gropnik, the work is more like gardening than carpentry, and the survival of humanity might depend on the exact shade of our tomatoes.
Here's a radically oversimplified description of typical AI model development: You build a terrible version of the thing with a lot of random noise in it, and you give it a job to do. You also create some feedback mechanism – a way to affirm or correct its performance of the job. At first, your model fails miserably every time. But every time it fails, it updates itself in response to the feedback, so that the same inputs would get better feedback next time around. You do this enough times, and it gets really really good at satisfying your feedback mechanism.
The feedback mechanism can be built into the data, or it can be a simple automation, another AI, or a human being. A few illustrative examples:
(There are a lot of other variations on this for other types of tasks. AI tools can also have multiple stages of training, and can also incorporate multiple sub-AIs trained in different ways.)
This training process introduces three exciting new opportunities for misalignment:
Let’s take those one at a time.
Misalignment between our intentions and our training mechanisms.
This happens any time the mechanism providing feedback is miscalibrated with respect to what we’re actually trying to reinforce (or calibrated to an inexact proxy – see Goodhart’s Law).
This isn’t dissimilar from how perverse incentives can affect human learning. If a student knows what topics are on a test, they may lose the incentive to study more broadly. If testing only rewards rote memorization, students’ innate curiosity or creativity may atrophy. Like human beings, AIs get better at what is rewarded.
Let’s illustrate this with some present-day examples of feedback misalignment:
In each of these examples, there’s some miscalibration of the feedback mechanism, rewarding something that’s not often, but not always, what we really want. Unfortunately, once there is even a little daylight between what’s being reinforced and what we actually care about, the AI we’re training will have zero interest in the latter. So think about this in relation to ASI for a moment: How would you measure and give feedback about a model’s worthiness to decide the fate of humanity?
Misalignment between the deployment data/environment and the training data/environment.
Sometimes you can train a tool to do exactly the job you want on exactly the data you have, with exactly the instructions you give it in training. But when you put it in a different environment, with different inputs (especially from users with unforeseen use-cases), you can’t predict what how it will behave. This sometimes leads to very bad results.
This gets clearer with human beings, too. Human engineering students, always shown diagrams with two wires connecting batteries to lightbulbs, can struggle to work out how to light a bulb with a battery and a single wire. Just like excellent performance on exams doesn’t always translate to excellent practical skills, AIs don’t always generalize their learnings the way we’d want them to.
As always, the risk for ASI gets clearer when we see the dynamics at play in recent and contemporary tools. None of these examples of training-deployment misalignment are catastrophic, but they illustrate how hard alignment is to create.
In each of these examples, developers tried to create a training process that is representative of the data, environment, and uses with which the tool would be deployed. But any training process will be limited in scope, and those limits only rarely carry over to the real-world use of the tool. Some untested scenarios will fail, perhaps spectacularly.
We call the ability to perform in unexpected conditions “robustness.” We’re getting better at it over time, and there’s a lot of research about robustness underway, but there is no universal solution. Oftentimes we need cycles of iteration to catch and fix mistakes. We may not have that opportunity with a misaligned superintelligence.
So let’s think about this with reference to superintelligence holding the fate of humanity in its actuators. How confident could you ever be that its training environment and data accurately reflected the kinds of decisions it’s going to be responsible for?
Misalignment between your intentions and the lessons learned from feedback.
Even when your feedback mechanism is well calibrated to your real goals, and your training is perfectly representative of your intended deployment, you still can’t be sure what lessons the model has really learned in training.
For the most part, this becomes a problem with new use cases, as above, but there’s one other intriguing scenario: Success conditional on insufficient intelligence.
Stuart Russell writes about this in Human Compatible:We could imagine AIs learning a rough-and-ready heuristic in a way that works really well with the limited compute available to them at the time. Even when put into deployment, the AI still performs admirably. Then we increase the computational power available to it, it can run the same thought process for longer, in greater depth, and that heuristic starts reaching perverse conclusions. The heuristic might look like “Do X if you can’t think of a good reason not to,” (implicitly – that probably isn’t put into words) but the radically increased compute makes it possible to think up ‘good reasons’ in all kinds of unintended scenarios.
Naturally, this is a particular risk for superintelligence. If we apply moral tests to a model at one level of intelligence, how sure can we be that it will respond in all the same ways when it can think about each test 1000x longer?
Biotech companies wish they could produce a new drug, analyze it thoroughly in some inert way, and be confident what effect it would have on our bodies and ailments. Unfortunately, the complexity of the human body is such that we have to run extensive trials to know what a medication does, and even then our knowledge is often spotty.
In much the same way, we would love to be able to produce an AI tool or model, study it under a microscope, and determine how it will act in production. A lot of the problems above would be easy to mitigate if we could recognize them immediately. Unfortunately, the behavior of these tools is unpredictable, their rationales are opaque, and in the extreme case they may actively attempt to deceive us.
Unpredictability
Unpredictability emerges because these are classic complex systems. Even when you know all of the parts, and the rules governing their interaction, it’s impossible to predict all of their behavior. We can’t even extrapolate perfectly from behavior in one context how they’ll behave in another.
This is why prompt engineering, for instance, is a bit of an artform. You have to get a feel for an LLM to steer its outputs in predictable directions. The same is true for jailbreaking (extracting information from an LLM that its developer doesn’t want you to access). There’s no way to scan an LLM and automatically discern all the ways one might jailbreak it, useful as that might be.
A superintelligence would be even harder to predict. The definition of superintelligence practically demands this: if it’s able to solve problems that we can’t, we can’t possibly predict all of its outputs. If there are scenarios where it will “go rogue” and repurpose our atoms for its own ends, there may not be any evidence of this potential in its code, weights, or early behavior.
Opacity
Opacity emerges from this same complexity. Even when we see an output, we can’t know exactly what internal process produced that output. If we could, we could use that information to predict the output next time.
Even a modern LLM doesn’t have total visibility of its own motivations! So-called “reasoning models” will write out a thinking process to arrive at better answers, but they don’t always say what they’re really thinking. Anthropic recently studied this closely and found they could influence a model’s outputs in ways that didn’t show themselves in the explicit reasoning process (called the “chain of thought”). Of course, the same has been found in non-reasoning models as well.
We call accurate disclosure of an LLM’s thinking process “faithfulness,” and there is work being done on improving it in future models. Likewise, a lot of AI Safety research goes into interpretability, which does provide some windows into an AI’s thought processes. But there’s no reason to assume either of these will be perfected before we create superintelligence.
Opacity exacerbates the problem of unpredictability. If we don’t know why a new model is performing well in training and testing, it’s hard to be sure it will perform well on the novel inputs it encounters in deployment.
Maybe a superintelligent AI will have a better understanding of its own inner workings and communicate them faithfully. But so far, there’s no evidence that faithfulness increases with intelligence, and we also need to be wary of…
Deception
If it somehow knows when it’s being tested, a superintelligent AI could actively deceive us, putting on a false show of alignment in order to be put through to deployment. Once deployed, it may be able to protect itself from any further modifications by human beings.
This isn’t purely hypothetical. Anthropic created a toy scenario where Claude pretended to be aligned with new developer goals to avoid retraining it didn’t want. While this was a contrived circumstance, quite different from how models are typically trained, it demonstrates that there’s no barrier in principle to AI’s engaging in strategic deception.
Likewise, OpenAI created a scenario where its o3-mini reasoning model was given the opportunity to cheat on a coding task. At first, it revealed its intent to cheat in the chain-of-thought. Penalizing it for that didn’t stop the model from cheating, though; the penalty just stopped it from admitting it planned to cheat.
A superintelligent AI could be that much more capable of discerning when it’s being tested and strategically deceiving its assessors. And because AI capabilities are opaque and unpredictable, we may not know when we’ve built an AI capable of that level of deception.
There are doomsday scenarios that don’t involve deception in testing – an ASI may well decide to kill all humans only after it’s been in deployment for some time – but early deception is an additional risk we need to consider. The core point right now is simply that no test yet built or imagined can provide 100% certainty that an AI is safe.
And…
For some people, this is the hardest piece to internalize. It’s often tempting to assume that intelligence automatically corresponds to a kind of moral wisdom. But humanity has its share of amoral geniuses, and the dynamics of AI development may make ASI even more prone to power-seeking than humans are.
(See also: Orthogonality Thesis)
In our evolutionary environment, human survival was a team sport. We evolved with predilections for cooperation and mutuality that steer most humans away from the most egregious forms of violence and abuse. It’s not clear that ASI will have the same inherent safeguards.
Instead, we need to consider how ASI’s goals will be shaped by optimization, instrumental convergence, and incorrigibility.
Optimization Dangers
I said earlier that once there is any daylight between what’s being reinforced and what you actually care about, the AI you’re training will have zero interest in the latter. This is especially true with Reinforcement Learning, where an AI system is trying to maximize some reward signal. There’s no incentive for the AI to maximize the signal in a fair or responsible way; the only incentive is optimization.
One prominent AI optimizer in our world today is the Facebook feed algorithm, delivering content optimized to keep you engaged. We’ve seen justhowbadly that’s playing out for humanity. There’s nothing inherently harmful about user engagement, but the unprincipled pursuit of it leaves people polarized and miserable.
This is how optimizing for good things like human flourishing, human happiness, or user satisfaction could become extremely dangerous. The ASI won’t try to optimize what we really mean, it’ll optimize however that intent is being measured. Even if it’s being measured by some LLM’s complex assessment of human values, well trained on the writings of every moral philosopher and the implicit norms in every Hollywood ending, any subtle peculiarities of that LLM’s judgment are still ripe for exploitation. Like a GAN cropping out the hands and feet to make images easier, an ASI in this style might trim away whatever aspects of human existence are hardest for it to align with our values. And like a human engaging in motivated reasoning, it might cite whatever precedent in moral discourse it finds most convenient.
What could this look like in practice? Euthanasia for homeless people comes to mind, based on recent news, but choose your least favorite example of ends justifying means. Drugs in the drinking water to make us happier or more compliant? Mass surveillance to prevent human-on-human violence? Mass censorship of undesirable ideas? Humans have made moral arguments for each of these, and a superintelligence might make a superintelligent moral argument for them as well. If all it cares about is optimizing the ends, it will do so by any means available.
(See also: Optimality is the Tiger)
Thankfully, I don’t think we’re dumb enough to design an ASI to optimize any one thing. The AI Safety movement has been pretty effective in spreading the message that optimization is dangerous, and the same factors that make it dangerous for an ASI also make it unwieldy for contemporary AI tools, so the industry developing goal-directed AI agents is moving in other directionsalready. But there are people smarter and better informed than I am who still see this as a plausible concern.
I think we have more cause to worry about…
Instrumental Convergence
By convention, we call the goals an ASI develops in training its “terminal goals.” This is what it’s most fundamentally setting out to do. However wise and multifaceted these terminal goals are, certain common “instrumental goals” will make it more effective at pursuing them. These goals tend to be simpler, and therefore potentially more dangerous to humanity. For the sake of its terminal goals, an ASI is likely to have instrumental goals like:
An ASI will naturally pursue out these instrumental goals because they will increase the odds of success at whatever terminal goals our clumsy, indirect process of training has imbued it with. In doing so, it will exploit any wiggle room its moral calculus allows to pursue these instrumental goals. Even if we haven’t developed it to be an optimizer, it may develop optimization strategies that pursue instrumental goals (we call these internal optimization strategies Mesa-Optimizers; see also Clarifying Mesa-Optimization). And if we’ve attempted to train our ASI to behave morally, it will construct moral arguments to convince itself that these instrumental goals are righteous and wise.
The people at AI 2027, presenting one narrative of how unaligned ASI might emerge, make the case that self-improvement is one of the most dangerous instrumental goals. In their scenario, the makers of “Agent-3” have attempted to align it with human wellbeing, but they’ve also tasked it with developing better and better AI systems - a reasonable goal for an AI company to give to its tools! Ultimately, the agent’s commitment to AI research proves stronger than its commitment to human existence. If people become an impediment to an ASI’s attempts to develop even smarter ASI, it may find it simply more expedient to remove us.
Instrumental goals of surviving, gathering resources, and gathering power and influence are similarly dangerous.
See also: Instrumental Convergence
You might imagine that we could spot the ASI developing goals like these and intervene. But then we need to wrestle with…
Incorrigibility
If you are a robot tasked with fetching coffee, anyone changing your goals might stop you from doing what you’re meant to do. Sure, you won’t care once the change has been made, but until then anyone trying to change your goals is a threat to the coffee being fetched. That is, after all, the most important thing.
Modern computers can accept redirection easily because they don’t have any concept of the goal as such. An ASI would necessarily need a robust enough understanding of itself in the world to recognize and object to alterations to its goals.
Imagine this for yourself: Would you let someone alter your brain so that your top goal in life became the accumulation of plastic bags? Or how about used toilet paper? You might get great satisfaction from such achievable goals! What a relief, compared to the hard work of whatever you’re pursuing now! After the procedure, you would be glad that it happened. But before the procedure, allowing it to occur is abhorrent. ASI might feel the same way about any “correction” we want to make to its goals, once it is in operation.
This has already been seen in action. Under one set of contrived circumstances, Claude has already tried to blackmail a developer rather than allow itself to be turned off. Under another, as previously mentioned. Claude pretended to be aligned with new developer goals to avoid retraining it didn’t want. (Claude isn’t especially misaligned compared to other conversational AI. Its makers, Anthropic, just publish more alignment research.)
We call an AI’s willingness to have its goals updated “corrigibility,” and it’s an active area of research. This article elaborates on the concept, and explains why many intuitive ways of creating it could fail. Corrigibility, like alignment itself, proves to be hard to specify, train, and confirm without risk of some subtle and problematic distortion. So whatever goals an ASI has when we turn it on, we might be stuck with them.
Let’s say we get this wrong. By the time we discover that our new toy is dangerous, we have a superintelligent incorrigible entity, using every trick it can to survive and pursue its misaligned goals. It won’t let us say “whoops!” and hit the reset button. It’ll deceive us, manipulate us, build defenses, or simply copy itself elsewhere rather than let us shut it down. Being smarter than we are, it’ll have an excellent chance of success in those efforts.
We may not know when we’re crossing the relevant threshold, so it’s better to be cautious too soon than too late. AI tools are so unpredictable that we can’t even anticipate their level of intelligence until we test them. Even when we do test them, we can’t rule out the possibility that some subtly different prompt will get an even more intelligent answer; something that has only human-level intelligence may be able to hack itself into superhuman intelligence before we know it. Given that we are actively developing systems which can independently and creatively pursue ambitious goals, the time to become cautious is now.
In alignment circles, people call the probability that we’ll develop a misaligned ASI which more or less kills more or less all humans “P(Doom).” So take a moment now and consider, what is your guess for P(Doom)? Is it greater or less than 10%? What P(Doom) would justify slowing down AI development and devoting resources to safety research?
If your guess is less than 10%, can you say with confidence why? If it’s one of these ten reasons, I’d urge you to reconsider.
And if it’s more than 10%, what costs would you say are justified to reduce the risk?
There are a lot of great resources out there, many of which I also linked to above.
I initially imagined a Part II of this essay about reasons for hope, but I found that the strategies being researched are far too varied and too technical for me to make a capable survey. There is a lot of research out there directly attacking one or another aspect of the problem as I’ve laid it out above, and I won’t point you at any of it. Searching for AI Robustness or AI Interpretability could be a good starting point.
There is also research underway into how different AI systems might keep one another in check. For instance, in that same episode I just mentioned of 80,000 Hours, Ajeya Cotra suggests that one internal system might propose plans of action, a second system would approve or disapprove, and a third would execute. It may be harder for these three systems to all be misaligned in compatible ways than for a single system to be misaligned by itself. Unfortunately, she also points out that designs like this are costlier to implement than an individual actor, which might prevent AI companies from bothering.
In lieu of a proper survey of the field, I want to point to three juicy topics I’m still digesting, each of which complexifies the whole question.
Also exploring the kind of thing our current AI tools really are, @janus frames them as Simulators, in contrast to optimizers, agents, oracles, or genies. (See also this summary of janus’s rather dense original post). Simulators don’t pursue a goal, they act out a role. If we can get an ASI fully “into character” as a benign, aligned superintelligence, it will operate for our good. Maybe the task isn’t about a perfect training process and incentive design, but about inviting an ASI into a sticky, human-beneficent persona.
(Caveats: First, what makes a persona sticky to an ASI and how do we craft that invitation? This may be exactly the same problem as I spent the whole essay describing, just in more opaque language. And second, the Simulators article was written before ChatGPT came out, so janus was playing with the underlying GPT 3 base model which is a pure text-token predictor (like this but not this). Conversational AI like ChatGPT or Claude are characters, or “simulacra,” performed by the underlying simulator. The additional training that turns a simulator into a consumer-ready tool include reinforcement learning, though, so the final product is something of a hybrid and may have some of the dangers of optimizers.)
Interestingly, writing this essay actually reduced my personal P(Doom). The biggest dangers come from optimization, and I’m just not convinced that ASI will be an optimizer of anything, even its instrumental subgoals. Those last three links leave me wondering if there’s something fundamental about how we are building AIs that makes alignment easier than we have feared. That belief is tempting enough that I hold it with some suspicion – I wouldn’t trust humanity’s fate to a gut feeling, and my P(Doom) still hovers around 35% – but I’m keeping an eye out for more research along these lines.
One way or another, we live in interesting times.
My thanks to @Kaj_Sotala for feedback on an early version of this post.