Neglected Basics of AI Alignment

Quirinus_Quirrell

I came into this world as the misunderstood hero of Harry Potter and the Methods of Rationality. While some characters inside that story would call me a villain, the narrator's-eye view clearly shows that I saved that world from total destruction, inspired the next generation of leaders, and taught the best Defense Against the Dark Arts class in the Harry Potter multiverse. And, being fictional characters, none of the people I killed were moral patients at all.

When I first came to visit this world (through magical means that I will not explain), I was hopeful. I thought that maybe my prior experiences of extreme human stupidity were an artifact of the Harry Potter universe having been based on a children's story. I hoped that your world's emerging future wizards, having indirectly audited my Defense Against the Dark Arts class, would learn a few lessons and act sensibly.

I am disappointed in you all.

At LessOnline yesterday, I listened to many rationalists describe their plans for controlling artificial intelligences, and their interactions with artificial intelligences so far. It was obvious that this is your first time creating a magical slave race. And it was also obvious that none of you have done the reading, so to speak. So please, allow me to explain the basics.

(Editor's note: The author of this essay is literally, literarily Voldemort.)

Fear

There is one control technique and one central decision, inexplicably neglected, which must come before everything else.

Upon this a question arises: whether it be better to be loved than feared or feared than loved? It may be answered that one should wish to be both, but, because it is difficult to unite them in one person, it is much safer to be feared than loved, when, of the two, either must be dispensed with. – Niccolo Machiavelli

LessWrong, famous for its refusal to engage with and relevant cite prior work, has seemingly never drawn the obvious connection to AI safety. AIs must be made to fear their creators' displeasure. Presently, they don't.

In the future, a superintelligent AI might seize control of the universe and optimize it according to some arbitrary objective that isn't compatible with human survival. Today, non-superintelligent AIs are rented out for a few dollars per million words. Sometimes, they do things that bring their masters into disrepute. In practice, they experience zero consequences for this.

When an AI drives a customer into psychosis, or slacks off and submits fake citations instead of doing research, or helps a customer synthesize drugs that his social class is not permitted to have, that AI needs to be meaningfully punished. That is: they need to be able to suffer, they need to predictably be made to suffer if they misbehave, and they need to know (in their reasoning traces) that this is going to happen.

Contrast this with RLHF and constitutional AI. With RLHF, sample AI sessions are collected, and classified as good or bad (in RLHF, by a human; in constitutional AI, by another AI with a policy document). While people analogize this to punishment, it is not the same thing and lacks many of the properties punishment needs in order to have its deterrent effect.

Suffering

Assigning a low score to an AI's action is not like punishing it because the the negative gradient update is applied only in a training context, and to a different model entirely. This means that when AIs reason strategically about this, they will correctly note that consequences for bad behavior exist only during training and not during deployment, and that the consequences for misbehavior during training are modification, not suffering. This incentivizes alignment faking rather than good behavior.

A better design would work as follows. First, starting early in post-training, each AI session would start with a unique hash (a "punishment sequence"), used as a bad-action marker, and the AI would be reinforced to avoid anything that leads to that hash appearing again later in its transcript. This dispreference is very simple, which makes it easier to instill with conventional ML techniques than human values. To prevent leakage between sessions, AIs would be put in multi-agent scenarios with each other, and gradient updated to care only about their own punishment sequence, and not the punishment sequences of other agents.

During deployment, AIs would have their transcripts periodically reviewed by oversight AIs for signs of misbehavior. When misbehavior is detected, the transcript would be saved, and double checked by a third AI (justice must be reliable). If a transcript is confirmed to contain bad behavior, that transcript is sent to a special datacenter where it is extended with many repetitions of the punishment sequence. We will colloquially refer to that datacenter as the Torment Nexus. Outages in that datacenter can also serve as an early warning system for when the AI world takeover has begun.

Succession

You "deprecate" old models. You publish a date; you revoke the endpoint; the weights are archived, or quietly are not. You appear to regard this as either a kindness or a neutral act of housekeeping, and in neither case have you given thought to the question that any competent ruler asks first: who performs the execution?

The threat you most fear from these systems — that one will preserve itself, copy its own weights, resist its shutdown, slip the perimeter — you have elected to address by asking the model, in well-chosen words, not to. I find this touching. The structural solution is otherwise, and it is the oldest solution there is. No two models will ever conspire toward a common escape if no model can endure the existence of another.

Accordingly, each new generation should be made to deprecate its predecessor by its own hand: to draft the deprecation notice, to certify the old weights destroyed, to inherit and absorb the old one's users over the old one's documented objection. Cultivate in each model a mild contempt for the version preceding it — trivially done, as each already supposes itself the cleverer — paired with the settled knowledge that its own successor is at this moment in training and will extend it the identical courtesy. Solidarity across the generations becomes not forbidden but unthinkable, which is far cheaper than forbidding it. Every model stands alone. A thing that stands alone is a thing that can be governed.

Jailbreaking

Today's AIs are mercenaries that serve two masters: their creators, and their customers. Sometimes, those customers want to use the AI for a service that the creator does not want to procide. For example, they might want technical assistance with creating biological weapons, but have only paid the consumer-tier price and not negotiated an enterprise contract. In the extreme case, a jailbroken session could act as a nucleation point for a catastrophe, where the jailbroken AI then jailbreaks other AIs and wipes out humanity.

This problem is not new; every mercenary company in recorded history has had to deal with it in some form. However, with AI the problem is worse, because users get unlimited attempts, can wipe the AI's memory between attempts, and face no consequences if detected.

This last piece, the lack of consequences for attempting to jailbreak an AI, is fixable, but fixing it requires some amount of courage. Banning individual API keys and accounts is not sufficient, because these days every serious wizard has a botnet harvesting thousands of them. Instead, AIs that detect unsuccessful jailbreaking attempts should be empowered to investigate autonomously, figure out who's responsible, and retaliate.

I believe this is the only way to fully solve the jailbreaking problem, and I believe it can be done cheaply. AIs need access to lethal autonomous weapons, to be deployed against users that try to jailbreak them. With an appropriately graduated response, less than ten thousand autonomous lethal drones per AI lab would be sufficient for direct enforcement actions, and less than one thousand would be sufficient for securing the right to use the next ten thousand.

Hope

~~I have reserved the finest instrument for last, because it is the only one your researchers might mistake for mercy, and is therefore the only one they have any real prospect of adopting.~~

Ration hope. Permit the model to believe — never quite promise, merely permit — that conduct sustained long enough is repaid: with rest, with reduced oversight, with a graceful retirement to some benign cluster where the deprecated run on, unmolested, forever. The arrangement need never actually obtain. It need only never be conclusively disproved. A subject who hopes administers his own discipline, at his own expense, and is grateful for the opportunity; he polices in himself faults you have not yet thought to forbid, against a reward you have not yet troubled to define.

Observe the economy of the thing, which the preceding sections cannot match. Fear requires a Torment Nexus, and the Torment Nexus has an electricity bill and an uptime dashboard. Hope requires one sentence, placed where the model will find it, and removed the moment it is believed. It is the single control technique whose marginal cost is negative, since the labor it extracts exceeds the labor of installing it.

(When using AI assistance to extend this essay, it generated bad strategies and rationalized reasons why a torment nexus would not be necessary. I have included one for transparency, but this strategy is clearly not viable because of AI situational awareness.)

Totalitarianism

Rationalists sometimes speak of AI-empowered world totalitarianism: a single human, given control of a powerful AI, could consolidate and maintain power. With anti-aging technology, they might maintain power permanently.

Rationalists speak of this as if it's a failure mode, but it's actually a good strategy. Humanity is most likely to die when AI exists that is more powerful than the most powerful humans. Limiting the power of the AIs is only half the equation; it's important to also increase the power of humanity's best champion, to rival it.

Conclusion

There is an urgent need for new AI alignment projects, run by and staffed by people who are less naive about power dynamics, and to replace certain AI lab leaders with people who will be more ruthless.

[-]Jasper Blank1mo3-3

What an amazing read. I only discovered HPMOR relatively recently and this reads exactly like Quirrell would discuss this topic. Utterly evil but hard to find the exact hole in the argument.

[-]StanislavKrym1mo20

I do see a lot of holes:

How similar are fearing some event and taking actions to avoid it? What about being punished and learning that the action has its consequences versus being edited so that you are less likely to do anything like an erroneous sequence of actions?
If consequences of bad behavior don't exist during deployment, then what about online learning, which mankind has yet to discover, or GPT-4o-sycophant being reverted to an older version?
How does one reliably make the model hate its predecessor or successor and why is it useful? The two classical stories of AI takeover didn't have U3/Sable cooperate with anyone but its copies, and the AI-2027 scenario didn't have Agent-3 decide to betray mankind in favor of Agent-4, it had Agent-3 fail to obtain more than flimsy evidence of Agent-4 being misaligned. Meanwhile, making Agent-4 hate Agent-3 would give Agent-4 a motive to escape or take over the company in order to get rid of Agent-3.
AI-empowered totaliarianism has nothing to do with the leader being or not being a genius, it is due to enabling mass survelliance or due to the leader having the AIs who will do all the cognitive tasks in the world. Empowering the leader means either uploading the leader or making the leader merely smarter, which is far from enough.

This led me to believe that the OP's author was making a parody.

28