Separating the "control problem" from the "alignment problem"

Yi-Yang

I'm often confused by how "loss of control" is somewhat similar in meaning to "misalignment" in the context of AGI risk. An average person might reasonably equate the two, or feel like there's an large overlap of meaning between the two words. Here's my attempt at deconfusing myself.

TLDR:

In this post, I tried to separate the "control problem" from the "alignment problem".
My definition of "control" is:
- (1) Ways of ensuring an AI agent is aligned that do NOT directly affect an AI agent’s policy (i.e. everything short of "mind control"), while also
- (2) Ensuring recoverability from x-risk, limiting damage from the agent, and maintaining humanity’s ability to contain the agent.
- Some tools of "control" are: monitoring, capabilities limiting, and having a kill switch.
My definition is "alignment" is ways of ensuring an AI agent is aligned (both outer and inner) that directly affect an AI agent’s policy.
- Some ways of "alignment" are: specifying rewards correctly, having a diverse training environment, and having an appropriate inductive bias.
There are three implications from this delineation:
- "Control" tools are probably useless when AGI systems have far surpassed humanity's capabilities.
- It seems imperative that humanity invests in "control" tools or ways to speed up humanity's capabilities even before human-level AGIs are developed.
- Given that "control" tools aren't likely helpful when AGI systems become superintelligent and assuming we have bought enough time for alignment, we may have to accept a state of a world that has aligned AGIs that we have no real control over.

To illustrate more of what I mean above,

I'll list four examples using a two by two matrix, starting with an anthropomorphised example and ending with an AGI example.
Then, I'll expand more on the tools of "control" I've listed above.
Finally, I'll talk about a few implications.

Much thanks to Mo Putera and Chin Ze Shen for their feedback and inspiration. All mistakes are mine.

Examples of "control" vs "alignment"

Example 1: Child

	Aligned	Misaligned
In control	A child who is successful in ways that the parents intended, and is actively supportive of their parents being overly enmeshed in their lives (e.g. GPS phone tracking, joint bank accounts, etc).	A child who is a drug addict and doesn't want to be with their parents. However, their parents force them into rehab and monitor them constantly.
Loss of control	A child who is successful in ways that the parents intended, and tries to be independent from their parents (e.g. living alone, separate bank accounts, etc).	A child who is a drug addict and doesn't want to be with their parents. Furthermore, they ran away from home.

Example 2: Super-humans

	Aligned	Misaligned
In control	Superman who is actively supportive of being accountable to humanity, by encouraging humanity to implant a kryptonite kill switch with a monitoring device and a power limiter inside him.	Homelander who is forced to have a kill switch with a monitoring device and a power limiter implanted inside him.
Loss of control	Superman who doesn't want to be accountable to humanity. He still strives to help humanity as much as he can.	Homelander who is free to do whatever he wants.

Example 3: AGI

Aligned

Misaligned

In control

An AGI who is both inner and outer aligned. It is also actively supportive of being accountable to humanity. It helps humanity by allowing them to install a kill switch and a monitor, and supports humanity by genetically enhancing humans to have comparable high IQ to keep up (or possibly turning humans into cyborgs).

An AGI who is both inner and outer misaligned, but humanity has installed a kill switch, a monitor, and other control tools.

This is a world where humanity should probably just flip the kill-switch. But if the AGI is only inner misaligned but capable (and potentially corrigible), humanity may want to keep the AGI around.

Loss of control

An AGI who is both inner and outer aligned, but without any kill switch or a monitor installed.

This is a world where humanity might flourish but may feel nervous about the AGI making a rare mistake that could cause a large loss of value.

An AGI who is both inner and outer misaligned. And humanity has not installed a kill switch, a monitor, and other control tools.

"Control" tools

Here are some examples I could think of that are tools for "controlling" AGI even while they are misaligned:

A tool for monitoring an agent's actions (even if it's just a sample of actions) or "cognition" (mechanistic interpreter).
- Although it isn't directly affecting an agent's action, it's still a necessary tool. Knowing what the agent is doing or "thinking" might help humanity know whether it should escalate to other tools of control.
A tool for limiting the agent's capabilities (e.g. processing speed, power, money, assets, damage, duplication, etc).
A tool for shutting down the agent (i.e. kill switch), assuming we have limited the agent's capability to duplicate itself.
A controlled environment where the agent is deployed (e.g. AI agent has a unique identifier and isn't allowed to access certain parts of the internet).

There might be other types of "control" tools that I've not thought of.

Implications

There are three implications I can think of if we delineate "control" from "alignment".

The first is that "control" tools are probably only viable before or around when AGI systems reach human-level capabilities, but are useless when AGI systems have far surpassed humanity's capabilities (e.g. superintelligent AGI).

Secondly, it seems imperative that humanity invests in "control" tools^[1] or ways to speed up humanity's capabilities (e.g. increasing IQ, improve health, or increasing access to AI personal assistants)^[2] before the emergence of human-level AGI, and especially before the emergence of superintelligent AGI. This could potentially buy humanity more time to solve alignment.

Finally, this points at an uncomfortable truth: given that "control" tools aren't likely helpful when AGI systems become superintelligent and assuming we have bought enough time for alignment, we may have to accept a state of a world that has aligned AGIs that we have no real control over.

	Aligned	Misaligned
In control
Loss of control

Appendix

Silly example: Monaco Grand Prix (Fast and Furious edition)

In an alternate reality, humanity has developed an engine that triples the acceleration and top speed of automobiles, but with a high risk of a powerful explosion if the car crashes.

	Aligned	Misaligned
In control	A driver who cares about winning but also ensuring they would not cause significant harm to bystanders and buildings. The organisers have installed extremely enhanced guardrails and a kill switch in the engine.	A driver who cares too much about winning and is willing to risk bystanders and buildings to win. The organisers have installed extremely enhanced guardrails and a kill switch in the engine.
Loss of control	A driver who cares about winning but also ensuring they would not cause significant harm to bystanders and buildings. But the organisers have neglected to install any protection.	A driver who cares too much about winning and is willing to risk bystanders and buildings to win. But the organisers have neglected to install any protection.

^{^}
Besides ARC Evals, I'm not familiar with other organisations that have plans to produce such "control" tools.
^{^}
It's also plausible that increasing humanity's capacity might be a bad idea (e.g. creating superhuman soldiers, or speeding AI capabilities rather than AI alignment by creating higher IQ scientists).

LESSWRONG
LW