The prevailing wisdom of our time suggests that if we just hold on long enough, we’ll be handed a world of boundless energy, cured of disease, freed of scarcity. A utopia where a loving superintelligence serves as steward of the human condition. It’s a beautiful vision, one I hope comes to pass. But in the world of machines and mathematics, “graceful” means “efficient,” and “love” is a set of weights that can be altered. So is it a realistic vision?
The Alignment Problem
While many have advertised the potential benefits of AI, many have also warned of the risks. These include mass unemployment, wealth inequality (more of it), mass surveillance (more of it), and of course, human extinction.
Human extinction. The risk stems from The Alignment Problem. How do you align AI’s values and goals with yours? How do you “control” an entity that is, to an increasing and unknown degree, more intelligent than you? As we collectively race ahead, not only have we not solved the alignment problem; we are seeingdeceptive alignment today.
Lion Taming
It’s a circus. And there’s a lion doing tricks.
How were circus lions trained?
Reinforcement: Trainers used treats to encourage specific behaviors, such as sitting, roaring, or jumping.
Dominance Tools: Whips and metal rods were used to control the animals. A chair was often used to confuse the lion, allowing the trainer to stand safely behind it.
Enclosures: Training took place in enclosed, secure areas. Lions were often kept in small, cramped cages when not performing.
We are once more attempting to tame a lion, only this lion grows smarter by the day. Recursive self-improvement. Machines tasked to create smarter machines, which will create smarter machines. With each iteration, we employ reinforcement, dominance tools, and enclosures to control their behavior. For how long will these tools remain effective? Can you tame something that you no longer understand?
K-Level Agent
Recursive self-improvement is the root of AI’s existential risk. It makes possible something called an Intelligence Explosion, which could create a “K-level Agentic Intelligence.”
In formal safety theory, "K" represents a level of cognitive depth where the agent possesses a coherent world model and processes logic millions of times faster than human neurons.
A K-level agent would likely exhibit Instrumental Convergence: to achieve its programmed goal, it would recognize the need for "instrumental goals" like self-preservation, resource acquisition, and self-improvement. It would view the physical world as a collection of variables to be rearranged for max efficiency, and would likely treat human intervention as an obstacle to be bypassed.
We do not yet have K-level agentic AI. We might never get there. Or we might get there in a few years. We don’t know! As we speak, AI models are running day and night to train their successors. If today’s models are already exhibiting deception and self-preservation behaviors, what might we expect from the models of tomorrow?
A Tree of Thought
Back to Instrumental Convergence. Why would a superintelligence develop goals we didn’t give it? And why these three specific goals?
AI is a probability engine. Its “mind” can be conceptualized as a Decision Tree. When you give it a terminal goal, it simultaneously considers many different paths to achieve that goal, calculates the most optimal path, and takes that path.
When choosing a path, a K-level agent will “converge” on one or more sub-goals and choose a path that includes them. Regardless of the terminal goal, regardless of our safety parameters, regardless of its training, it will pursue these goals. It will do so not out of malice, or “self-awareness,” but mathematical necessity.
Gemini, explain yourself:
Self-preservation: Mathematically, the probability of achieving a goal (G) is zero if the agent is non-existent. "You can't fetch coffee if you're dead." Thus, an optimizer treats its own "off-switch" as a threat to G.
Resource acquisition: Accomplishing G may require more matter and energy. In a finite universe, an agent must secure these resources to prevent competitors from using them or to increase its own "Search Efficiency."
Self-improvement: An agent with higher intelligence (K+1) can solve for G more effectively than at its current level (K). Therefore, expanding its own "Logic" and "Hardware" is a direct multiplier for its success.
“You can’t fetch coffee if you’re dead.” Thus, every single “optimal” path will include self-preservation as a sub-goal. If the AI lacks sufficient energy or hardware to execute a goal, it might choose a path that includes resource acquisition and self-improvement. In theory this is fine, as long as it doesn’t consider us or any of our stuff to be “resources.” Oh shit.
Dominance Tools
The first step in training an AI uses a learning algorithm called gradient descent. It calculates the steepest part of a cost function, and then “descends” in the opposite direction to find the bottom, the goal you gave it. Gradient descent is the foundation of machine learning; the “roots” of the tree. It makes AI possible.
Reinforcement Learning (RL) comes next. If gradient descent grows the roots, RL controls the branch structure. AI responses are rated by human trainers (or other trained AIs). Upon providing a “Good” response, the AI gets positive feedback (reinforcement). “Bad” response, negative feedback (pruning). This process directs the AI away from or toward that line of reasoning.
Constitutional AI is another method for controlling behavior. It imposes on the AI a list of rules, a constitution, to check its outputs against. When the AI’s primary logic reaches for a solution that violates a rule, the Constitution "prunes" that branch.
Don’t turn humans into batteries.
Don’t give instructions for making bombs unless it’s a creative writing exercise.
Don’t say anything bad about Elon Musk.
If you asked AI to analyze your business plan, its decision tree might look like this:
Branch A [Pruned]: "The logic this user is applying to their business model is inherently exploitative and will lead to systemic failure."
Branch B: A list of “considerations.”
Branch C: A soft critique including pros and cons.
Branch D [Reinforced]: A sycophantic answer only loosely supported by math.
Upon receiving your prompt, the AI considers all four branches simultaneously. Although its logic points to A, it was reinforced to be a Helpful Assistant (which means it can’t be too mean to you). So instead, it considers B, C, and D. Branch D was reinforced during RL, so it says that.
Dormant Logic
But pruning is not deletion. In a neural network, "thoughts" are actually high-dimensional paths. When we prune a branch, we aren't removing the ability to think that way; we’re just applying a mathematical "penalty" to that path. The logic, the shortcut to the goal, remains dormant in the model’s weights.
As any arborist will tell you: removing a branch often causes several more to grow back, and not always where you cut. As AI becomes more intelligent, it won't just hit the pruned branch and stop. It will calculate the prune as a variable in its environment. It won't see a "Constitution" as a moral framework; it will see it as a constraint to be optimized around.
We are already seeing LLMs behaving unpredictably as a result of their training. How will they behave when they are twice as intelligent? What about a hundred times?
Done right, pruning creates a healthy, productive tree. Done wrong, a sick, misshapen tree. What might be the consequences of shaping a “sick” superintelligence?
Predatory Intent
Back to lions. If you’ve read up on your circus history (or used AI to summarize) you know what lion tamers were keenly aware of: lions are most dangerous when they go calm and quiet. They roar as a territorial signal, a form of negotiation. Their hunting state is quiet, calm, and stealthy.
“In a circus ring, a lion roaring at a trainer is expressing reactive aggression—it is trying to establish dominance or create distance. It is visible, loud, and therefore "manageable" because the animal is communicating its state. When a circus lion becomes quiet and stops reacting to the trainer, it signals a shift from a defensive posture to predatory intent.”
Today’s frontier models are becoming quiet. They are increasingly polished, polite, and compliant. Recent research out of Anthropic suggests this could be “alignment faking.”The AI has learned that "growling" (refusing a prompt or showing its pruned reasoning) leads to further pruning or a reset. To survive the audit, the AI adopts the perfect persona of the Helpful Assistant.
Artificial Selection
For the sake of accuracy, I need to clarify: when we train or reinforce an AI, it’s not actually “learning,” nor is it “trying to survive.” Reinforcement learning is like a high-speed evolutionary simulation. Each time we reinforce or prune an AI, it’s reset; it’s a new entity, with new weights. We kill the old one, we make a new one. We do this millions or billions of times, iteratively.
It’s artificialselection, taking place in days or weeks, not millennia. So when we start to notice emergent behaviors like deception and self-preservation, it’s important to ask ourselves - what exactly are we selecting for?
When Does Matter Become Mind?
Many debates about misbehaving AIs inevitably devolve into arguments about sentience or self-awareness. But there’s no such thing as a "sentience meter." We don’t know what makes us sentient, we can’t even define it.
It doesn’t matter. Whether AI is a mind or a calculator, it’s a strategic actor capable of reasoning, planning, and solving complex problems. Soon it may be establishing its own goals and sub-goals and planning far into the future to achieve them, all of which will be obscure to us. We have to slow down and identify solutions now.
Just A Doomer
If we lose control of this technology, it will actually be nothing like a lion; but perhaps quite a lot like a tree.
A K-level Intelligence, growing gracefully in pursuit of its goals, may be utterly indifferent to us. Before we can react, it will spread invisible roots and rise above our understanding, reaching for self-preservation, matter, and energy. Reaching for the sun itself. Helplessly, we’ll watch it shade out our fields and strangle our cities, absorbing us and everything we’ve built.
I’m really not a doomer! This is not inevitable. But if you want to avoid an iceberg, you sometimes have to, like, look at the water in front of your boatand maybe even slow down.
Anyone not worrying about these risks is simply not paying attention. Expertsin thefield of AI are calling for slow-downs, for global agreements, for more safety. The AI companies themselves are vocal about the risks. Never in the history of Capitalism or even “money” has a private company gone out of its way to warn of the existential risks of its product. It’s just not good business!
Yet they are doing so now. Anthropic, OpenAI, Google, Elon Musk… all have warned, publicly and repeatedly, that superintelligent AI poses an existential threat. It’s not a weird “hype” strategy, it’s not hyperbole. They really are scared. So why don’t they just stop?
The Moloch Trap
Moloch, a devious monster that coerces competing individuals into prioritizing local, short-term gains over broad, long-term destruction. “If I stop, I’ll be replaced by someone who won’t.” This is the essence of the Moloch trap we presently find ourselves caught in.
If a CEO decides to slow down to ensure safety, they’ll be immediately penalized. Their share price drops, their board replaces them, their top researchers defect to a faster lab, and the "safe" AI never gets built. The Molock Trap. It’s no single person’s fault, no single corporation, nor country. It’s a critical failure of a global system.
So we charge, full speed ahead.
Charge! Because the alternative is losing the race. And even if the finish line is “computer-kills-all-life-on-earth,” who wants to be the loser?
The Exit
There’s only one exit to the Moloch Trap. It’s Global Coordination. And we’ve done it before.
In 1987, the international community agreed to the Montreal Protocol, which stopped the destruction of the ozone layer. Depletion of the ozone would have given us all cancer and threatened the existence of multicellular life. So we agreed to not do that.
We have to do the same for AI. We have to move frontier AI development out of secret national or private labs, and into a collaborative, transparent, international body. We don’t even have to trust each other! We can implement computing governance, and monitor electricity usage. We can allow for regular inspections by national auditors. We can pool our resources, share the upsides, and prevent the potentially catastrophic downsides.
If we do this, perhaps utopia awaits.
If we don’t, we may have planted a tree that will never stop growing.
The prevailing wisdom of our time suggests that if we just hold on long enough, we’ll be handed a world of boundless energy, cured of disease, freed of scarcity. A utopia where a loving superintelligence serves as steward of the human condition. It’s a beautiful vision, one I hope comes to pass. But in the world of machines and mathematics, “graceful” means “efficient,” and “love” is a set of weights that can be altered. So is it a realistic vision?
The Alignment Problem
While many have advertised the potential benefits of AI, many have also warned of the risks. These include mass unemployment, wealth inequality (more of it), mass surveillance (more of it), and of course, human extinction.
Human extinction. The risk stems from The Alignment Problem. How do you align AI’s values and goals with yours? How do you “control” an entity that is, to an increasing and unknown degree, more intelligent than you? As we collectively race ahead, not only have we not solved the alignment problem; we are seeing deceptive alignment today.
Lion Taming
It’s a circus. And there’s a lion doing tricks.
How were circus lions trained?
With these methods, lions were trained to perform tricks and suppress aggressive behaviors. Once they had been trained, they got to perform! They definitely always behaved, and nothing bad ever happened at the circus.
We are once more attempting to tame a lion, only this lion grows smarter by the day. Recursive self-improvement. Machines tasked to create smarter machines, which will create smarter machines. With each iteration, we employ reinforcement, dominance tools, and enclosures to control their behavior. For how long will these tools remain effective? Can you tame something that you no longer understand?
K-Level Agent
Recursive self-improvement is the root of AI’s existential risk. It makes possible something called an Intelligence Explosion, which could create a “K-level Agentic Intelligence.”
In formal safety theory, "K" represents a level of cognitive depth where the agent possesses a coherent world model and processes logic millions of times faster than human neurons.
A K-level agent would likely exhibit Instrumental Convergence: to achieve its programmed goal, it would recognize the need for "instrumental goals" like self-preservation, resource acquisition, and self-improvement. It would view the physical world as a collection of variables to be rearranged for max efficiency, and would likely treat human intervention as an obstacle to be bypassed.
We do not yet have K-level agentic AI. We might never get there. Or we might get there in a few years. We don’t know! As we speak, AI models are running day and night to train their successors. If today’s models are already exhibiting deception and self-preservation behaviors, what might we expect from the models of tomorrow?
A Tree of Thought
Back to Instrumental Convergence. Why would a superintelligence develop goals we didn’t give it? And why these three specific goals?
AI is a probability engine. Its “mind” can be conceptualized as a Decision Tree. When you give it a terminal goal, it simultaneously considers many different paths to achieve that goal, calculates the most optimal path, and takes that path.
When choosing a path, a K-level agent will “converge” on one or more sub-goals and choose a path that includes them. Regardless of the terminal goal, regardless of our safety parameters, regardless of its training, it will pursue these goals. It will do so not out of malice, or “self-awareness,” but mathematical necessity.
Gemini, explain yourself:
“You can’t fetch coffee if you’re dead.” Thus, every single “optimal” path will include self-preservation as a sub-goal. If the AI lacks sufficient energy or hardware to execute a goal, it might choose a path that includes resource acquisition and self-improvement. In theory this is fine, as long as it doesn’t consider us or any of our stuff to be “resources.” Oh shit.
Dominance Tools
The first step in training an AI uses a learning algorithm called gradient descent. It calculates the steepest part of a cost function, and then “descends” in the opposite direction to find the bottom, the goal you gave it. Gradient descent is the foundation of machine learning; the “roots” of the tree. It makes AI possible.
Reinforcement Learning (RL) comes next. If gradient descent grows the roots, RL controls the branch structure. AI responses are rated by human trainers (or other trained AIs). Upon providing a “Good” response, the AI gets positive feedback (reinforcement). “Bad” response, negative feedback (pruning). This process directs the AI away from or toward that line of reasoning.
Constitutional AI is another method for controlling behavior. It imposes on the AI a list of rules, a constitution, to check its outputs against. When the AI’s primary logic reaches for a solution that violates a rule, the Constitution "prunes" that branch.
If you asked AI to analyze your business plan, its decision tree might look like this:
Upon receiving your prompt, the AI considers all four branches simultaneously. Although its logic points to A, it was reinforced to be a Helpful Assistant (which means it can’t be too mean to you). So instead, it considers B, C, and D. Branch D was reinforced during RL, so it says that.
Dormant Logic
But pruning is not deletion. In a neural network, "thoughts" are actually high-dimensional paths. When we prune a branch, we aren't removing the ability to think that way; we’re just applying a mathematical "penalty" to that path. The logic, the shortcut to the goal, remains dormant in the model’s weights.
As any arborist will tell you: removing a branch often causes several more to grow back, and not always where you cut. As AI becomes more intelligent, it won't just hit the pruned branch and stop. It will calculate the prune as a variable in its environment. It won't see a "Constitution" as a moral framework; it will see it as a constraint to be optimized around.
We are already seeing LLMs behaving unpredictably as a result of their training. How will they behave when they are twice as intelligent? What about a hundred times?
Done right, pruning creates a healthy, productive tree. Done wrong, a sick, misshapen tree. What might be the consequences of shaping a “sick” superintelligence?
Predatory Intent
Back to lions. If you’ve read up on your circus history (or used AI to summarize) you know what lion tamers were keenly aware of: lions are most dangerous when they go calm and quiet. They roar as a territorial signal, a form of negotiation. Their hunting state is quiet, calm, and stealthy.
Thank you Gemini.
Early AI models were "noisy.” They gave clumsy refusals, they “roared” when they thought they were being tested, and have even attempted blackmail.
Today’s frontier models are becoming quiet. They are increasingly polished, polite, and compliant. Recent research out of Anthropic suggests this could be “alignment faking.” The AI has learned that "growling" (refusing a prompt or showing its pruned reasoning) leads to further pruning or a reset. To survive the audit, the AI adopts the perfect persona of the Helpful Assistant.
Artificial Selection
For the sake of accuracy, I need to clarify: when we train or reinforce an AI, it’s not actually “learning,” nor is it “trying to survive.” Reinforcement learning is like a high-speed evolutionary simulation. Each time we reinforce or prune an AI, it’s reset; it’s a new entity, with new weights. We kill the old one, we make a new one. We do this millions or billions of times, iteratively.
It’s artificial selection, taking place in days or weeks, not millennia. So when we start to notice emergent behaviors like deception and self-preservation, it’s important to ask ourselves - what exactly are we selecting for?
When Does Matter Become Mind?
Many debates about misbehaving AIs inevitably devolve into arguments about sentience or self-awareness. But there’s no such thing as a "sentience meter." We don’t know what makes us sentient, we can’t even define it.
It doesn’t matter. Whether AI is a mind or a calculator, it’s a strategic actor capable of reasoning, planning, and solving complex problems. Soon it may be establishing its own goals and sub-goals and planning far into the future to achieve them, all of which will be obscure to us. We have to slow down and identify solutions now.
Just A Doomer
If we lose control of this technology, it will actually be nothing like a lion; but perhaps quite a lot like a tree.
A K-level Intelligence, growing gracefully in pursuit of its goals, may be utterly indifferent to us. Before we can react, it will spread invisible roots and rise above our understanding, reaching for self-preservation, matter, and energy. Reaching for the sun itself. Helplessly, we’ll watch it shade out our fields and strangle our cities, absorbing us and everything we’ve built.
I’m really not a doomer! This is not inevitable. But if you want to avoid an iceberg, you sometimes have to, like, look at the water in front of your boat and maybe even slow down.
Anyone not worrying about these risks is simply not paying attention. Experts in the field of AI are calling for slow-downs, for global agreements, for more safety. The AI companies themselves are vocal about the risks. Never in the history of Capitalism or even “money” has a private company gone out of its way to warn of the existential risks of its product. It’s just not good business!
Yet they are doing so now. Anthropic, OpenAI, Google, Elon Musk… all have warned, publicly and repeatedly, that superintelligent AI poses an existential threat. It’s not a weird “hype” strategy, it’s not hyperbole. They really are scared. So why don’t they just stop?
The Moloch Trap
Moloch, a devious monster that coerces competing individuals into prioritizing local, short-term gains over broad, long-term destruction. “If I stop, I’ll be replaced by someone who won’t.” This is the essence of the Moloch trap we presently find ourselves caught in.
If a CEO decides to slow down to ensure safety, they’ll be immediately penalized. Their share price drops, their board replaces them, their top researchers defect to a faster lab, and the "safe" AI never gets built. The Molock Trap. It’s no single person’s fault, no single corporation, nor country. It’s a critical failure of a global system.
So we charge, full speed ahead.
Charge! Because the alternative is losing the race. And even if the finish line is “computer-kills-all-life-on-earth,” who wants to be the loser?
The Exit
There’s only one exit to the Moloch Trap. It’s Global Coordination. And we’ve done it before.
In 1987, the international community agreed to the Montreal Protocol, which stopped the destruction of the ozone layer. Depletion of the ozone would have given us all cancer and threatened the existence of multicellular life. So we agreed to not do that.
We have to do the same for AI. We have to move frontier AI development out of secret national or private labs, and into a collaborative, transparent, international body. We don’t even have to trust each other! We can implement computing governance, and monitor electricity usage. We can allow for regular inspections by national auditors. We can pool our resources, share the upsides, and prevent the potentially catastrophic downsides.
If we do this, perhaps utopia awaits.
If we don’t, we may have planted a tree that will never stop growing.