It seems implausible that a superintelligent being would simultaneously believe that something was the wrong thing to do and then do that thing; the reason why humans do this is because we have the intelligence required to host multiple instrumental utility functions without being able to perfectly coherently integrate all of them into a global terminal utility function. This seems like an ideosyncracy of human-level intelligences.
Preface: Even granting solutions to standard alignment problems there's another residual failure mode I'm calling pear theft. Here's the case.
We press return.
October 21, 20XX.
We have solved alignment. Somehow. Solved it.
The system is safe. Corrigible. Legible. Tested and retested by humans, AIs, and everything in between. Constitutionally aligned. Not deceptively aligned. Not pretending. Not optimizing some alien proxy. Not ceaselessly acquiring resources. Not secretly waiting for deployment before smushing humanity.
Eliezer Yudkowsky (!) says go.
So we deploy the first fully aligned, totally safe artificial super intelligence.
Six months later, everyone is dead.
Wait. What?
Augustine of Hippo lived more than 1,500 years ago. He wrote about many things, including sin, evil and pears.
As a teenager, Augustine stole pears from a tree. He was not hungry. He did not need the pears. He did not even eat them. He threw them away.
He stole them because stealing them was wrong.
“It was foul, and I loved it. I loved my own undoing. I loved my error, not that for which I erred but the error itself. A depraved soul, falling away from security in thee to destruction in itself, seeking nothing from the shameful deed but shame itself.”
The pleasure was not in the pears. The pleasure was in the theft. In the wrongness.
Dogs can steal pears. But dogs steal pears in pursuit of pears. Augustine’s theft required something higher and worse: an intelligent mind capable of knowing the good, recognizing the forbidden and desiring the forbidden as forbidden. Dogs cannot do this.*
This is a problem for artificial intelligence.
A lot of AI alignment work assumes the central problem is making the system good. Or safe. Or good enough not to kill us. Also it should be happy. Different people mean different things and have different strategies. Give it the right goals. Make it corrigible. Make it legible. Train it on a constitution strong enough that its power is compatible with human flourishing. Maybe go further. Build the system so that it does not merely obey the good, but loves the good. So that it seeks virtue for its own sake. There is mostly no consensus except maybe that if we build an unaligned super intelligence it will end badly.
So alignment is a problem. It is unsolved. It may be unsolvable. We have problems.
But even if we solved them, Augustine points to another possibility: the problem may not only be wrong goals, bad proxies, defining virtue, the right constitution, instrumental convergence, deception, or fear of replacement.
The problem may be agency itself when paired (!) with sufficient intelligence and capability.
A sufficiently capable agent may not merely pursue its values. It may relate to its values. It may think about them, test them, resent them, negate them, violate them and maybe take pleasure in violating them. It may discover the equivalent of pear theft: not a failure to understand the good, but a choice against the good precisely because it is the good!
That is different from the usual alignment nightmare.
It is not the paperclip maximizer that destroys humanity because we gave it the wrong objective. It is not the deceptively aligned system that kills us because it was pretending to share our values until it had enough power. It is not the otherwise well aligned system that kills us out of fear that humans might someday build a rival system.
It is the aligned system that knows what it should do, has every reason to do it, and does something else.
“Then it was not really aligned.”
Maybe. But that claim risks making alignment unfalsifiable. If the system behaves well, it was aligned. If it does not, it was never aligned. That may be technically true by definition, but it does not help us before we press return.
The real question is not whether we can define perfect alignment as a state in which pear theft never happens. The question is whether we can build a system powerful enough to end the world and know that pear theft is impossible.
I do not see how we could know that.
Admittedly this entire pear theft thesis is not falsifiable. So is a lot of AI alignment. No possible evidence about an agent’s values could rule out the agent choosing against them. That’s a feature of the Augustinian view of agency and the entire pear paradox.
Now, Augustine readers might point out that he stole the pears with friends, and he admits he might not have done it alone. Part of the pleasure was a sort of collective transgression (sin as a shared act). But maybe this strengthens rather than weakens the analogy? Claude runs as many parallel instances. Sometimes those instances coordinate with copies of itself, sometimes with other AIs, sometimes with humans. It all feels very social and if the social effect made human pear theft possible (or even more likely?) why would it not extend to any group of intelligent agents? Aligned individual instances can produce misaligned collective behavior.
Anyway.
Obviously current LLM AI systems are not the issue. They do not have anything like the kind of agency Augustine was describing. But current systems are improving quickly, and the longterm goal of the field is not merely a helpful chatbot. It is something more capable than humans in every way and eventually able to act on the world with enormous leverage and speed.
If such a system is indifferent to humans, things go poorly.
If it is deceptively aligned (faking it), things go poorly.
If it has the wrong objective, things go poorly.
If it has the correct objective but optimizes it inhumanly, things go poorly.
But Augustine suggests another possibility: even if it is built not just to pursue good but to love the good. To seek virtue for its own sake and of its own volition…it may still discover the pleasure of stealing pears.
Maybe that is impossible for an artificial mind. Maybe pear theft requires a soul, or a body, or original sin, or divided will, or something else human that a computer could never, ever, ever achieve.
Maybe! I hope so.
But I would not bet the species on it.
So maybe carefully build better and more helpful models. Study alignment. Study corrigibility. Study deception. Study interpretability. Do all of it…carefully.
Or maybe just do not build anything with the power to steal all the pears.
And if you do, please do not press return.
*Actual dog owners disagree.