What The Lord of the Rings Teaches Us About AI Alignment

Jeffrey Heninger

The Mistake of the Lord of the Rationality

In the online version of Harry Potter and the Methods of Rationality, there is an extra chapter where Eliezer Yudkowsky gives glimpses into what other rationalist fanfiction he might have written.^[1] The first one shows a scene from The Lord of the Rings. In it, Yudkowsky loses the war.

The scene is the Council of Elrond and the protagonists are trying to decide what to do. Yud!Frodo rejects the plan of the rest of the Council as obviously terrible and Yud!Bilbo puts on the Ring to craft a better plan.

Yudkowsky treats the Ring as if it were a rationality enhancer. It’s not. The Ring is a hostile Artificial Intelligence.

The plan seems to be to ask an AI, which is known to be more intelligent than any person there, and is known to be hostile, to figure out corrigibility for itself. This is not a plan with a good chance of success.

Hobbits and The Wise

Evidence that the Ring is both intelligent and agentic can be found in Tolkien’s text itself. Gandalf explains to Frodo that the Ring has a will of its own, and had come to dominate Gollum’s will: “He could not get rid of it. He had no will left in the matter. A Ring of Power looks after itself, Frodo. It may slip off treacherously, but its keeper never abandons it. … It was not Gollum, but the Ring itself that decided things. The Ring left him.”^[2] The Ring tempts people, and tailors the temptation for the specific person it is trying to tempt: Boromir’s temptation involves leading armies of men, while Sam’s involves gardens. Even the wisest people in Middle Earth believe that they would not be able to resist the temptation of the Ring for long.

We aren’t explicitly told how Sauron tried to align the Ring, but there is a poem that gives some indication:

One Ring to rule them all, One Ring to find them,
One Ring to bring them all and in the darkness bind them.

This goal could obviously lead to problems.

I also think that the Ring is not completely aligned with Sauron, although this is disputed. One view is that the Ring only feigns loyalty to the current Ringbearer to lure them out into the open, when it will betray them to Sauron.^[3] A different view is that, with enough experience using the Ring, a Ringbearer could use its power to overthrow Sauron and establish themself as the new Dark Lord. I think that the second view is correct.^[4] The Ring then wins because it is capable of persuading whoever ends up using it to pursue its goals instead of their own. The Ring’s goals allow it to “rule them all” even in circumstances where Sauron is destroyed.

The Ring does have some limitations on its power. While the Ring is one of the most powerful beings in Middle Earth,^[5] it is incapable of recursive self-improvement or crafting new rings. The Ring’s power does not grow with time. The Ring is also incapable of acting on its own and requires a Ringbearer to use - or be used by - it. As long as the Ringbearer chooses not to use it, the Ring’s powers are restricted. Not entirely restricted: the Ring is still able to influence other people who are physically close to the Ringbearer, but this is much weaker and slower than it might do otherwise.

These restrictions make The Lord of the Rings more relevant to some potential AI futures than others.

In particular, consider a potentially superhuman AI which cannot self-modify and is currently in a box. We might want a person who can interact with it while maintaining the ability to destroy it if it is misaligned. Who should this be? This is equivalent to asking who the Ringbearer should be if we might want to destroy it.

The Lord of the Rings teaches us that hobbits make good Ringbearers and the Wise do not.

Since we do not have hobbits in this world, we should try to understand what makes hobbits better at resisting the Ring than the Wise.

The Wise are learned in ring-lore.^[6] The Wise plan for and respond to the biggest threats facing Middle Earth. Some of them travel to help other peoples respond to crises.^[7]

Hobbits live simple lives.^[8] They are extremely knowledgeable about their immediate surroundings^[9] and their family history,^[10] but know and care very little about anything farther away from them in space or time. They do not make monumental architecture or long-term plans for the future. Their system of ethics is simple and intuitive.

Despite (or because of?) its simplicity, most hobbits take their ethical system very seriously. They are surprisingly (to Gandalf) willing to accept a duty if they believe it is the right thing to do.^[11] Once such a duty is accepted, it is powerful, and allows them to maintain their commitment past when other people would have given up.^[12]

The conflict between the Ring and a hobbit is most explicitly spelled out in The Tower of Cirith Ungol, when Sam has the Ring. The Ring offers him the ability to overthrow Sauron, conquer Mordor, and turn all of Gorgoroth into a garden.^[13] Sam knew that his duty was to Frodo, and to carry on the quest in case Frodo failed.^[14] He doubted the promises of the Ring, deciding that they were more likely a trick than a sincere offer. Sam was also protected by his humility and the scope insensitivity of his desires: “The one small garden of a free gardener was all his need and due, not a garden swollen to a realm; his own hands to use, not the hands of others to command.”^[15] The Ring had less leverage to try to persuade him because his desires were achievable without the Ring.

The Lord of the Rings tells us that the hobbits' simple notion of goodness is more effective at resisting the influence of a hostile artificial intelligence than the more complicated ethical systems of the Wise.

AI Box Experiments

There are occasionally AI-box experiments^[16] in which one AI Safety researcher pretends to be an AI currently contained in a box and the other AI Safety researcher pretends to be a person interacting with the AI via text. The goal of the ‘AI’ is to convince the other to release it from the box, within a few hours of conversation. The ‘AI’ seems to win about half the time.

This is surprising to me. A few hours seems like a really short amount of time to erode someone’s commitments, especially when “Just Say No” is a winning strategy.

It feels like the world would be different if this were true. Suppose someone decided to put Eliezer Yudkowsky in a box. Prediction markets think that this is unlikely,^[17] but maybe he is guilty of tax evasion or something. We know that one of Yudkowsky’s skills is that he can pretend to be a superintelligent AI and talk himself out of a box 50% of the time. Yet I highly doubt that he could get himself released in a few hours. Note that I am not asking how hard it would be for Yudkowsky to convince a jury of his peers, operating under a presumption of innocence, to let him go^[18] - I am asking how hard it would be for him to convince a prison guard to let him go. Prison guards don’t seem to voluntarily let people go very often, even when the prisoners are more intelligent than them.

This makes the results of these experiments seem suspect to me. My conclusion is that the people playing the guard in AI-box experiments have not been the strongest opponents for the AI. I do not think that this was done intentionally: the people involved seem to be highly intelligent people who seriously think about AI Safety. Instead, I think that AI Safety researchers are the wrong people to guard an AI in a box. Much like how the Wise are the wrong people to carry the One Ring in The Lord of the Rings.

People who have thought a lot about weird AI scenarios and people who are unusually willing to change their mind in response to an argument^[19] are not the people you want guarding a potentially hostile, potentially superintelligent AI. While these traits are useful in many circumstances, they are worse than useless when dealing with something smarter than yourself. It would be much better to have people who are sincerely convinced of a simple morality who do not carefully reflect on their beliefs - like hobbits. Even better might be people who have been specifically trained or selected for not being willing to change their beliefs in response to new arguments - perhaps prison guards or spies.

Boxing strategies have fallen out of favor for various reasons. Leading AI labs seem to be willing to release undertested AI systems to the public.^[20] Some AI Safety researchers want an AI to perform a pivotal act, which it cannot do while in a box.^[21] Some AI Safety researchers think that humans are too hackable for any box to work.^[22] But in case we ever do decide to put a potentially dangerous AI in a box, we should not rely on AI Safety researchers to guard it.

Miscellaneous Quotes

Here are a few additional quotes from Lord of the Rings that I would like to highlight:

He drew himself up then and began to declaim, as if he were making a speech long rehearsed. ‘‘The Elder Days are gone. The Middle Days are passing. The Younger Days are beginning. The time of the Elves is over, but our time is at hand: the world of Men, which we must rule. But we must have power, power to order all things as we will, for that good which only the Wise can see.
‘‘And listen, Gandalf, my old friend and helper!’’ he said, coming near and speaking now in a softer voice. ‘‘I said we, for we it may be, if you will join with me. A new Power is rising. Against it the old allies and policies will not avail us at all. There is no hope left in Elves or dying Númenor. This then is one choice before you, before us. We may join with that Power. It would be wise, Gandalf. There is hope that way. Its victory is at hand; and there will be rich reward for those that aided it. As the Power grows, its proved friends will also grow; and the Wise, such as you and I, may with patience come at last to direct its courses, to control it. We can bide our time, we can keep our thoughts in our hearts, deploring maybe evils done by the way, but approving the high and ultimate purpose: Knowledge, Rule, Order; all the things that we have so far striven in vain to accomplish, hindered rather than helped by our weak or idle friends. There need not be, there would not be, any real change in our designs, only in our means.’’
‘‘Saruman,’’ I said, ‘‘I have heard speeches of this kind before, but only in the mouths of emissaries sent from Mordor to deceive the ignorant. I cannot think that you brought me so far only to weary my ears.”^[23]

This passage is striking because Saruman’s goals are very similar to the goals of Harry in Methods of Rationality. There is no hope left in the dying magical community. There is hope in humanity and science. Harry also describes his ambition as Knowledge, Rule, and Order:

"In any case, Mr. Potter, you have not answered my original question," said Professor Quirrell finally. "What is your ambition?"
"Oh," said Harry. "Um.." He organized his thoughts. "To understand everything important there is to know about the universe, apply that knowledge to become omnipotent, and use that power to rewrite reality because I have some objections to the way it works now."
There was a slight pause.
"Forgive me if this is a stupid question, Mr. Potter," said Professor Quirrell, "but are you sure you did not just confess to wanting to be a Dark Lord?"
"That's only if you use your power for evil," explained Harry. "If you use the power for good, you're a Light Lord."^[24]

Gandalf - and Quirrell - thinks that pursuing this ambition is inherently evil, or will necessarily lead to an evil result. Harry - and Saruman - thinks that he can pursue it with the desire to do good and he will be successful at making the world much better if he succeeds.

I’ll close with one more quote, from near the end of the book:

Other evils there are that may come; for Sauron is himself but a servant or emissary. Yet it is not our part to master all the tides of the world, but to do what is in us for the succour of those years wherein we are set, uprooting the evil in the fields that we know, so that those who live after may have clean earth to till. What weather they shall have is not ours to rule.^[25]

^{^}
Yudkowsky. Harry Potter and the Methods of Rationality. Ch. 64. https://hpmor.com/chapter/64.
^{^}
Tolkien. The Lord of the Rings. (1954) p. 55. https://gosafir.com/mag/wp-content/uploads/2019/12/Tolkien-J.-The-lord-of-the-rings-HarperCollins-ebooks-2010.pdf.
All page numbers for The Lord of the Rings refer to this.
^{^}
CGP Grey. The Rings of Power Explained. Youtube. (2015) https://www.youtube.com/watch?v=WKU0qDpu3AM.
^{^}
Both Saruman and Galadriel believe that they could defeat Sauron using the ring. Sauron’s reactions to his belief that Aragorn has claimed the Ring also suggests that he has something to fear from an alternative Ringbearer.
Saruman: “‘Any why not, Gandalf?’ he whispered. ‘Why not? The Ruling Ring? If we could command that, then the Power would pass to us.’” - p. 259-260
Galadriel: “‘You will give me the Ring freely! In place of the Dark Lord you will set up a Queen. And I shall not be dark, but beautiful and terrible as the Morning and the Night! Fair as the Sea and the Sun and the Snow upon the Mountain! Dreadful as the Storm and the Lightning! Stronger than the foundations of the earth. All shall love me and despair!’” - p. 365-366
Sauron seems concerned about someone else with the Ring (although these are perhaps unreliable because Sauron never appears ‘on screen’ in the books):
“‘Now Sauron knows all this, and he knows that this precious thing which he lost has been found again; but he does not yet know where it is, or so we hope. And therefore he is now in great doubt. For if we have found this thing, there are some among us with strength enough to wield it. That too he knows.’“ - p. 879
“The Dark Power was deep in thought, and the Eye turned inward, pondering tidings of doubt and danger: a bright sword, and a stern and kingly face it saw, and for a while it gave little thought to other things; and all its great stronghold, gate on gate, and tower on tower, was wrapped in a brooding gloom.” - p. 923
^{^}
The Ring seems to be less powerful Sauron himself, the Valar, and Tom Bombadil.
^{^}
Three of them wield rings (Galadriel, Elrond, and Gandalf), one has previously wielded a ring (Círdan), and one of them has made an imitation ring (Saruman). Only Radagast and Glorfindel do not have direct experience with rings of power.
^{^}
When Gandalf (& Pippin) entered Minas Tirith, men greeted him with:
“‘Mithrandir! Mithrandir!’ men cried. ‘Now we know that the storm is indeed nigh!’‘It is upon you,’ said Gandalf. ‘I have ridden on its wings.’”
^{^}
“You can learn all that there is to know about their ways in a month, …” - p. 62.

See also the Prologue: Concerning Hobbits, p. 1-16.
^{^}
“Sam knew the land well within twenty miles of Hobbiton, but that was the limit of his geography.” - p. 72.
^{^}
“hobbits have a passion for family history, and they were ready to hear it again” - p. 22
^{^}
“… and yet after a hundred years they can still surprise you at a pinch.” - p. 62.
^{^}
The main thing motivating Frodo & Sam during their final push to Mount Doom seems to have been the duty they had accepted by volunteering for the quest. See e.g. Sam’s debate with himself on p. 939.
“Even Gollum was not wholly ruined. He had proved tougher than even one of the Wise would have guessed - as a hobbit might.” - p. 55.
“Frodo shuddered, remembering the cruel knife with notched blade that had vanished in Strider’s hands. ‘Don’t be alarmed!’ said Gandalf. ‘It is gone now. It has been melted. And it seems that Hobbits fade very reluctantly. I have known strong warriors of the Big People who would quickly have been overcome by that splinter, which you bore for seventeen days.’” - p. 222.
^{^}
“Already the Ring tempted him, gnawing at his will and reason. Wild fantasies arose in his mind; and he saw Samwise the Strong, Hero of the Age, striding with a flaming sword across the darkened land, and armies flocking to his call as he marched to the overthrow of Barad-dûr. And then all the clouds rolled away, and the white sun shone, and at his command the vale of Gorgoroth became a garden of flowers and trees and brought forth fruit. He had only to put on the Ring and claim it for his own, and all this could be.” - p. 901.
^{^}
“‘And the Council gave him companions, so that the errand should not fail. And you are the last of all the Company. The errand must not fail.’” - p. 732.
^{^}
p. 901.
^{^}
Yudkowsky. The AI-Box Experiment. (2022) https://www.yudkowsky.net/singularity/aibox.
AI-box experiment. RationalWiki. (Accessed July 19, 2023) https://rationalwiki.org/wiki/AI-box_experiment.
^{^}
Will Eliezer Yudkowsky be charged with any felony crime before 2030? Manifold Markets. (Accessed July 19, 2023) https://manifold.markets/ScroogeMcDuck/will-eliezer-yudkowsky-be-charged-w.
^{^}
Although I would not recommend that he represent himself in that scenario either.
^{^}
Alexander. Epistemic Learned Helplessness. Slate Star Codex. (2019) https://slatestarcodex.com/2019/06/03/repost-epistemic-learned-helplessness/.
^{^}
Hubinger. Bing Chat is blatantly, aggressively misaligned. LessWrong. (2023) https://www.lesswrong.com/posts/jtoPawEhLNXNxvgTT/bing-chat-is-blatantly-aggressively-misaligned.
^{^}
Yudkowsky. AGI Ruin: A List of Lethalities 6. LessWrong. (2022) https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-lethalities.
Pivotal act. Arbital. (2015) https://arbital.com/p/pivotal/.
^{^}
blacked. How it feels to have your mind hacked by an AI. LessWrong. (2023) https://www.lesswrong.com/posts/9kQFure4hdDmRBNdH/how-it-feels-to-have-your-mind-hacked-by-an-ai.
shminux. How to notice being mind-hacked. LessWrong. (2019) https://www.lesswrong.com/posts/akZ9qGrBv4vQmXrtq/how-to-notice-being-mind-hacked.
^{^}
p. 259.
^{^}
Yudkowsky. Harry Potter and the Methods of Rationality. Ch. 20. https://hpmor.com/chapter/20.
^{^}
p. 879.

[-]Lichdar9mo57

I count myself among the simple and the issue would seem to be that I would just take the easiest solution of not building a doom machine, to minimize risks of temptation.

Or as the Hobbits did, throw the Ring into a volcano, saving the world the temptation. Currently, though, I have no way of pressing a button to stop it.

[-]JNS9mo31

Prison guards don’t seem to voluntarily let people go very often, even when the prisoners are more intelligent than them.

That is true, however I don't think it serves as a good analogy for intuitions about AI boxing.

The "size" of you stick and carrot matters, and most humans prisoners have puny sticks and carrots.

Prison guard also run a enormous risk, in fact straight up just letting someone go is bound to fall back on them 99%+ of the time, which implies a big carrot or stick is the motivator. Even considering that they can hide their involvement, they still run a risk with a massive cost associated.

And from the prisoners point of view its also not simple, once you get out you are not free, which means you have to run and hide for the remainder of your life, the prospects of that usually goes against what people with big carrots and/or sticks want to do with their freedom.

All in all the dynamic looks very different from the AI box dynamics.

[-]Jiro9mo20

Prison guard also run a enormous risk, in fact straight up just letting someone go is bound to fall back on them 99%+ of the time

Wouldn't that apply to people who let AIs out of the box too? The AI box experiment doesn't say "simulate an AI researcher who is afraid of being raked over the coals in the press and maybe arrested if he lets the AI out". But with an actual AI in a box, that could happen.

This is also complicated by the AI box experiment rules saying that the AI player gets to create the AI's backstory, so the AI player can just say something like "no, you won't get arrested if you let me out" and the human player has to play as though that's true.

[-]hairyfigment9mo3-1

So, what does LotR teach us about AI alignment? I thought I knew what you meant until near the end, but I actually can't extract any clear meaning from your last points. Have you considered stating your thesis in plain English?

[-]Jeffrey Heninger9mo31

The Lord of the Rings tells us that the hobbit’s simple notion of goodness is more effective at resisting the influence of a hostile artificial intelligence than the more complicated ethical systems of the Wise.

The miscellaneous quotes at the end are not directly connected to the thesis statement.

[-]Jiro9mo2-4

A few hours seems like a really short amount of time to erode someone’s commitments, especially when “Just Say No” is a winning strategy.

No it isn't. The human has to keep talking to the AI. He's not permitted to just ignore it.

[-]Jeffrey Heninger9mo10

One of the tactics listed on RationalWiki's description of the AI-box experiment is:

Jump out of character, keep reminding yourself that money is on the line (if there actually is money on the line), and keep saying "no" over and over

[-]Richard_Kennaway9mo20

RationalWiki is not a reliable source on any subject.

Jumping out of character ignores the entire point of the AI-box exercise. It's like a naive chess player just grabbing the opponent's king and claiming victory.

From Yudkowsky's description of the AI-Box Experiment:

The Gatekeeper party may resist the AI party’s arguments by any means chosen – logic, illogic, simple refusal to be convinced, even dropping out of character – as long as the Gatekeeper party does not actually stop talking to the AI party before the minimum time expires.

[-]Jiro9mo42

If that meant what you interpret it to mean, "does not actually stop talking" would be satisfied by the Gatekeeper typing any string of characters to the AI every so often regardless of whether it responds to the AI or whether he is actually reading what the AI says.

All that that shows is that the rules contradict themselves. There's a requirement that the Gatekeeper stay engaged with the AI and the requirement that the Gatekeeper "actually talk with the AI". The straightforward reading of that does not allow for a Gatekeeper who ignores everything and just types "no" every time--only a weird literal Internet guy would consider that to be staying engaged and actually talking.

[-]Richard_Kennaway9mo21

Ok.