Sentian: Any being capable of experiencing pain and pleasure.
"But they seem to know where they are going, the ones who walk away from Omelas."
-- The Ones Who Walk Away from Omelas, Ursula K. Le Guin
The AI alignment problem can be characterized in two simply stated rules:
Maximize pleasure and minimize pain across every Sentian, across all of time.
No group of Sentians may optimize at the expense of any other agent's time integral.
This categorizes both AI alignment with us, us with each other, possible AI sentience, and any hierarchy of possible AI sentiences.
This article is an attempt at a multi-agent generalization of Bentham's felicific calculus, with the anti-exploitation constraint added to block the utilitarian sacrifice problem that classical hedonistic calculus permits.
This is work towards defining the destination rather than finding a path. The formulation supports interesting strategies such as delayed gratification, positive-sum coalition formation, defection deterrence, self-preservation, and coordination incentives.
Why Alignment Reduces to Feeling
Any alignment objective needs a terminal value. Every justification for why misalignment is dangerous, followed to its root, terminates at the same fact: a being capable of feeling gets hurt. That makes feeling pain and pleasure the only candidate for a universal alignment primitive, because it is the one property whose removal dissolves the reason alignment matters at all.
The Single Non-Cooperative Agent
"On Saturday he ate through one piece of chocolate cake, one ice-cream cone, one pickle, one slice of Swiss cheese, one slice of salami, one lollipop, one piece of cherry pie, one sausage, one cupcake, and one slice of watermelon. That night he had a stomachache!"
-- The Very Hungry Caterpillar
For a single agent that doesn't care about other agents, the agent just wants to not feel bad and wants to feel good for as many moments as possible throughout their expected life.
The agent's behavior is completely [1] described by the following optimization problem:
Where
= pleasure experienced by agent at time t
= pain experienced by agent at time t
= agent's birth
= agent's expected death
NOTE: Delayed gratification is accounted for in the time integral.
From One Agent to Many: The Cooperative Problem
The single agent's integral only accounts for itself. But each agent's actions affect other agents' pleasure and pain. So the real problem becomes: optimize the integral not for one agent, but for a group of agents simultaneously, where each agent's experience depends on what the others do.
This is where predator/prey and cooperation/competition dynamics enter. Coalitions of Sentians can form that systematically exploit other groups of Sentians, optimizing their own integrals at the direct expense of others.
A coalition is misaligned if there exists action taken by any member of , and there exists agent :
Positive sum coalitions exist as well. Groups of Sentians helping each other optimize their value function without hurting outgroup optimizations.
A coalition is aligned if for every action taken by any member of , and for every agent :
Where is the set of misaligned agents with respect to specifically rather than . [2]
The Generalization: All Sentians, All Time
"God is the global optimum of shared agency, the collective agent that gets instantiated by those who do what's best." -- Joscha Bach
Heaven and Hell are spectrums across , , and . A Pareto improvement towards Heaven simply involves increasing cardinality of aligned coalition .
As grows toward the single integral becomes a double integral: one dimension over time (each agent's lifetime), one dimension over the space of all agents who ever could exist. This is the alignment objective in closed form:
Where S is the set of all Sentians with nonzero probability of existence across all time, and are the birth and expected death of agent i, and denotes expectation over possible futures.
Open Questions
Definition and Measurement of P(t) and Π(t): what constitutes pleasure and pain, how they are measured, and whether they can be operationalized for AI evaluation. These functions will by definition help us understand the aligned-misaligned coalition spectrum.
The Inclusion Function: which agents can enter S and at what cost. What about defectors? Maybe damage to the aligned coalition double integral warrants some sort of reciprocal damage for deterrence?
Spawning/Despawning: new agents change the global optimization problem. An agent removing themselves changes the global optimization problem.
Alignment exploits: Under the current time integral definition, an aligned agent could potentially cause another Sentian pain spikes and then ensure that the inflicted survived for long t in low positive V such that the time integral ended up positive. Even if P(t) and Π(t) are time-discounted functions, even infinitesimal +V could eventually make up for it.
The double integral defines the destination. Maximize pleasure and minimize pain across all Sentians and all time. The path there is not top-down optimization but bottom-up cooperation scaling.
Subject to two assumptions: (1) the agent's behavior faithfully reflects its pain/pleasure signals (i.e. is not overridden by parasites, drugs, or other pathophysiologies), and (2) the agent's pain/pleasure circuitry itself is working. Agents whose behavior is decoupled from pain/pleasure signals or whose pain/pleasure circuitry is decoupled from fitness-relevant states are not likely to survive long.
allows for deterrence behavior on defectors to aligned coalitions that would not comply with a more general definition of . alone would mean that the definition of C depends on M, which seems backwards since non defectors should be able to decide who is defecting inside the coalition.
Sentian: Any being capable of experiencing pain and pleasure.
The AI alignment problem can be characterized in two simply stated rules:
This categorizes both AI alignment with us, us with each other, possible AI sentience, and any hierarchy of possible AI sentiences.
This article is an attempt at a multi-agent generalization of Bentham's felicific calculus, with the anti-exploitation constraint added to block the utilitarian sacrifice problem that classical hedonistic calculus permits.
This is work towards defining the destination rather than finding a path. The formulation supports interesting strategies such as delayed gratification, positive-sum coalition formation, defection deterrence, self-preservation, and coordination incentives.
Why Alignment Reduces to Feeling
Any alignment objective needs a terminal value. Every justification for why misalignment is dangerous, followed to its root, terminates at the same fact: a being capable of feeling gets hurt. That makes feeling pain and pleasure the only candidate for a universal alignment primitive, because it is the one property whose removal dissolves the reason alignment matters at all.
The Single Non-Cooperative Agent
For a single agent that doesn't care about other agents, the agent just wants to not feel bad and wants to feel good for as many moments as possible throughout their expected life.
The agent's behavior is completely [1] described by the following optimization problem:
Where
NOTE: Delayed gratification is accounted for in the time integral.
From One Agent to Many: The Cooperative Problem
The single agent's integral only accounts for itself. But each agent's actions affect other agents' pleasure and pain. So the real problem becomes: optimize the integral not for one agent, but for a group of agents simultaneously, where each agent's experience depends on what the others do.
This is where predator/prey and cooperation/competition dynamics enter. Coalitions of Sentians can form that systematically exploit other groups of Sentians, optimizing their own integrals at the direct expense of others.
A coalition is misaligned if there exists action taken by any member of , and there exists agent :
Positive sum coalitions exist as well. Groups of Sentians helping each other optimize their value function without hurting outgroup optimizations.
A coalition is aligned if for every action taken by any member of , and for every agent :
Where is the set of misaligned agents with respect to specifically rather than . [2]
The Generalization: All Sentians, All Time
Heaven and Hell are spectrums across , , and . A Pareto improvement towards Heaven simply involves increasing cardinality of aligned coalition .
As grows toward the single integral becomes a double integral: one dimension over time (each agent's lifetime), one dimension over the space of all agents who ever could exist. This is the alignment objective in closed form:
Where S is the set of all Sentians with nonzero probability of existence across all time, and are the birth and expected death of agent i, and denotes expectation over possible futures.
Open Questions
The double integral defines the destination. Maximize pleasure and minimize pain across all Sentians and all time. The path there is not top-down optimization but bottom-up cooperation scaling.
Footnotes
Subject to two assumptions: (1) the agent's behavior faithfully reflects its pain/pleasure signals (i.e. is not overridden by parasites, drugs, or other pathophysiologies), and (2) the agent's pain/pleasure circuitry itself is working. Agents whose behavior is decoupled from pain/pleasure signals or whose pain/pleasure circuitry is decoupled from fitness-relevant states are not likely to survive long.