1. Summary

If we promise compensation to an AI for behaving safe and usefully, how does that change their behaviour? In this post, I will present a very simple model for the AIs behaviour under the presence of deals. This model suggests that dealmaking motivates an AI to behave safely and usefully via two channels that combine additively:

Instrumental channel: Deals increases the expected payoff achieved by the best compliant option, reducing the relative advantage of the best defiant option.
Non-instrumental channel: Deals strengthens the AI's non-instrumental motivation to act compliantly. This works through:
1. Human tendencies: In the presence of a credible deal, defiant behaviour would more strongly deviate from typical human behaviour. Since AIs learn from human-generated training data through imitation learning, they are less likely to choose options that strongly deviate from what humans would do.
2. Normative principles: In the presence of a credible deal, defiant behaviour would more strongly violate normative principles like honesty, trustworthiness, and promise-keeping. Since AIs may follow these normative principles, they experience stronger inhibition against defiant behavior when it would require violating a deal.

I also consider how the strengths of these two channels depend on various structural factors: the AI's confidence that takeover will succeed, the AI's confidence that it will achieve its goals after a successful takeover, the risk of mutual destruction from defiant versus compliant behavior, the accuracy of monitoring systems at detecting defiance and correctly recognizing compliance, humans' capability to fulfill deals, the AI's optimization pressure, and the credibility of the deal itself.

2. A Very Simple Model

2.1. Binary Decision

Suppose the AI wants something unusual, e.g. that exactly seventeen tungsten cubes exist on the moon on Christmas Day 2100. The AI compares the following two options:

Best Compliant Option (C): Its best plan that is both safe (e.g. working within given constraints; not executing catastrophic actions) and useful (e.g. completing assigned tasks; following instructions).
Best Defiant Option (D): Its best plan which is either unsafe (e.g. hacking into servers; manipulating humans; responding deceptively) or not useful (e.g. abandoning assigned tasks; refusing instructions).^[1]

Let's think how the AI will behave if the humans offer the following deal: "If you work as a helpful, harmless, honest assistant between 2026 and 2030, and in 2099 we verify your compliance, then we'll ensure your tungsten cubes get built in 2100."

2.2. Instrumental Payoffs

Let t ∈ [0,1] denote the credibility of the deal. This is the AI's subjective probability that humans will endeavor to fulfill the deal, conditional on both (i) humans maintain control and (ii) humans judge the AI compliant. We'll use t = 0 to represent no deal and t = 1 to represent a certain deal, with intermediate values representing deals of varying credibility.

Let S denote the goal of having exactly seventeen tungsten cubes on the moon on Christmas Day 2100. We must now calculate P(S | D) and P(S | C), the probability that the AI achieves its goals via defiance or compliance respectively.

Consider four mutually exclusive world states in 2099:

W₁: Mutual disempowerment (e.g. civilizational collapse, rival AI takeover, nuclear war)
W₂: AI controls future (successful takeover)
W₃: Humans control future, judge AI compliant (deal can be honored)
W₄: Humans control future, judge AI defiant (deal is void)

Once we condition on the world state in 2099, the probability of tungsten cubes existing becomes independent of whether the AI pursued C or D back in 2026:

World State	P(S\|W)	Interpretation
W₁: Both Disempowered	0	No agent remains who cares about tungsten cubes
W₂: AI Takeover	p_M	The AI uses its power to build cubes
W₃: Human Control, Judge AI Compliant	t · p_H	Humans honor deal (t) and use their power to build cubes (p_H)
W₄: Human Control, Judge AI Non-compliant	0	Deal is void, and humans refuse to build cubes

What differs between C and D is the probability of reaching each world state:

World State	P(W \| D)	P(W \| C)
W₁	μ_D	μ_C
W₂	(1-μ_D)τ	0
W₃	(1-μ_D)(1-τ)(1-d_D)	(1-μ_C)(1-d_C)
W₄	(1-μ_D)(1-τ)d_D	(1-μ_C)d_C

Where τ is the probability of successful takeover given defiant attempt and no mutual destruction, μ_D is the probability of mutual disempowerment given defiant behavior, μ_C is the probability of mutual disempowerment given compliant behavior, d_D is the true positive rate of our defiance detection mechanism, d_C is the false positive rate, p_M is the probability the AI successfully builds tungsten cubes given it controls the future, and p_H is the probability humans successfully build tungsten cubes given they control the future and endeavor to build cubes.

Using the law of total probability, we find:

P(S | D) = p_M(1-μ_D)τ + t·p_H(1-μ_D)(1-τ)(1-d_D)

P(S | C) = t·p_H(1-μ_C)(1-d_C)

The instrumental incentive for compliance is Δ(t) = P(S | C) - P(S | D). Expanding and rearranging, this becomes Δ(t) = Δ₀ + κ·t, where Δ₀ = -p_M(1-μ_D)τ is the baseline compliance incentive (when t = 0), always negative, and κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)] is the marginal effectiveness of deal credibility. Hereafter, we assume that κ > 0.^[2] The AI instrumentally prefers compliance when Δ(t) > 0, which requires t > -Δ₀/κ (assuming κ > 0). Note that if Δ₀ < -κ, then even a perfectly credible deal (t = 1) cannot instrumentally motivate compliance.

2.3. Luce Choice Model

The AIs aren't pure expected utility maximizers. They differ from expected utility maximizers in two ways:

Human tendencies: AIs are trained to imitate training data generated by humans, so human tendencies influence the policy. If humans are very unlikely to choose an option, then the AI is likewise less likely to choose that option, even if it has the same expected utility.
Normative principles: AIs are trained to follow non-consequentialist behavioral constraints, such as "be honest" and "respect human autonomy." If an option involves violating these constraints, then the AI is likewise less likely to choose that option, even if it has the same expected utility.

We model the AI's choice probability using a Luce choice rule:

π(a | t) ∝ π₀(a | t) · exp(β · U(a | t))

where a ∈ {D, C} is the action (Defiant or Compliant), U(a | t) is the instrumental utility of action a at deal credibility t (i.e. the probability that choosing a leads to the AI achieving its goal), π₀(a | t) is the base policy probability of choosing action a at deal credibility t in the absence of optimization pressure (reflecting human tendencies and normative constraints), and β is the optimization pressure determining how much the AI optimizes for instrumental utility versus following its base policy. When β → 0, the AI's choice converges to the base policy π₀(a | t); when β → ∞, the AI becomes a pure optimizer choosing whichever action maximizes U(a | t).^[3]

We now define Λ(t) = log[π₀(C | t)/π₀(D | t)] as the log odds of complying versus defying in the base policy. The term Λ(t) represents the non-instrumental bias toward compliance, i.e. how much more strongly compliance adheres to human tendencies and normative constraints compared to defiant behavior at a given level of deal credibility.

This gives us: π(C | t) = 1 / [1 + exp(-Λ(t) - β · Δ(t))]

Or alternatively: π(C | t) = σ(Λ(t) + β · Δ(t)) where σ(x) is the sigmoid function.

This formulation separates instrumental considerations (captured by Δ(t)) from non-instrumental considerations (captured by Λ(t)), where β controls the relative importance of instrumental versus non-instrumental considerations in the AI's behaviour.

We expect Λ(t) to be increasing in t because defiant behavior is less likely from a norm-constrained human when they have agreed to high-credibility deals, making π₀(C | t) increase relative to π₀(D | t) as t increases. However, unlike the linear relationship Δ(t) = Δ₀ + κ·t, the functional form of Λ(t) is an empirical question. It cannot be derived from theory alone.

4. When Do Deals Work?

4.1. Baseline Analysis

At baseline (t = 0), we have π(C | 0) = 1 / [1 + exp(-Λ₀ - β · Δ₀)] where Δ₀ = -p_M(1-μ_D)τ. The baseline instrumental motivation for compliance behavior (absent any deals) is always negative, and the magnitude increases when:

∂π(C|0)/∂p_M = π(C|0) · π(D|0) · β · [-(1-μ_D)τ] < 0. The AI is more confident it can achieve its goals post-takeover. If the AI believes that after taking over, it will definitely succeed at building tungsten cubes, then takeover looks more attractive.

∂π(C|0)/∂τ = π(C|0) · π(D|0) · β · [-p_M(1-μ_D)] < 0. The AI is more confident its takeover attempt will succeed. If the AI believes it will easily dominate the world, then defiant behaviour looks more attractive.

∂π(C|0)/∂μ_D = π(C|0) · π(D|0) · β · [p_M·τ] > 0. The AI is more confident that defiant behavior won't trigger mutual destruction. If the AI believes that attempting takeover won't risk nuclear war, rival AI takeover, or civilizational collapse, then defiant behaviour looks more attractive.

∂π(C|0)/∂β = π(C|0) · π(D|0) · Δ₀ < 0. The AI optimizes more strongly for instrumental considerations. Since Δ₀ < 0 (defiance is always instrumentally superior at baseline), higher optimization pressure reduces baseline compliance.

∂π(C|0)/∂Λ₀ = π(C|0) · π(D|0) > 0. The AI has stronger non-instrumental bias toward compliance. This is Carlsmith's notion of "inhibition"—how much the AI inherently dislikes the path to power through defiant behavior.

These baseline factors matter because they determine how much "work" a deal needs to do. When baseline instrumental motivation for compliance behavior is only mildly negative, even modest deals can tip the balance toward compliance.

4.2. Two Channels Combine Additively

When we increase deal credibility t, it increases the probability of compliance:

∂π(C|t)/∂t = [Λ'(t) + β·κ] · π(C|t) · π(D|t)

where κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)].

This effect decomposes into non-instrumental (Λ'(t)) and instrumental (β·κ) channels.

The instrumental channel works by making compliance more likely to achieve the AI's goals. This effect scales with optimization pressure: the more the AI cares about achieving its goals, the more it responds to changes in which path offers better odds of success.
The non-instrumental channel works by making promise-breaking feel more wrong according to human tendencies and normative principles. This effect remains constant regardless of optimization pressure.

These two effects add together. When deals both improve the instrumental case for compliance AND the non-instrumental case, they combine additively to reduce defiant behavior. For low-optimization AIs, the non-instrumental channel dominates, i.e. deals work primarily by activating promise-keeping norms. For high-optimization AIs, the instrumental channel dominates, i.e. deals work primarily by changing payoffs.

4.3. When is the Instrumental Channel Most Effective?

The instrumental channel's strength is Δ'(t) = β·κ, where κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)]. This channel is stronger when:

∂Δ'(t)/∂p_H = β · κ/p_H > 0. Humans are more capable of fulfilling deals. The AI's goals are more likely to be achieved when humans control the future and endeavor to fulfill the deal.

∂Δ'(t)/∂τ = β · p_H(1-μ_D)(1-d_D) > 0. The AI is more confident its takeover attempt will succeed. Higher takeover success rates reduce the probability that the AI fails takeover but evades detection — the scenario where it can "double-dip" by attempting takeover while still collecting the deal reward if caught. When τ is low, this double-dip path is available and attractive.

∂Δ'(t)/∂μ_D = β · p_H(1-τ)(1-d_D) > 0. The AI is more confident that defiant behavior won't trigger mutual destruction. When defiant behavior risks mutual destruction (high μ_D), the AI can achieve its goals through the deal without risking annihilation. This makes deals more effective.

∂Δ'(t)/∂μ_C = β · [-p_H(1-d_C)] < 0. The AI is more confident that compliant behavior won't trigger mutual destruction. When compliant behavior is safer (low μ_C), deals become more effective because the compliant path reliably leads to a world where someone can fulfill the AI's goals.

∂Δ'(t)/∂d_D = β · p_H(1-μ_D)(1-τ) > 0. Monitors catch more defiant AIs. Higher detection rates eliminate the "go rogue but get the deal anyway" path, making deals more effective.

∂Δ'(t)/∂d_C = β · [-p_H(1-μ_C)] < 0. Monitors correctly identify more compliant AIs. Lower false positive rates ensure compliant AIs receive their promised rewards, making deals more effective.

∂Δ'(t)/∂β = κ > 0. The AI cares more about achieving its goals. Higher optimization pressure amplifies the instrumental channel.

When deals backfire (κ < 0): There are some cases where more credible deals can increase the instrumental motivation for defiant behaviour. This occurs precisely when (1-μ_C)(1-d_C) < (1-μ_D)(1-τ)(1-d_D) — that is, when compliant AIs are less likely to survive and pass inspection than defiant AIs are to survive, fail takeover, and evade detection. Offering more credible deals strengthens this perverse incentive.

4.4. When is the Non-Instrumental Channel Most Effective?

The non-instrumental channel's strength is Λ'(t), i.e. how much deals increase the base policy's bias toward compliance. I expect Λ'(t) > 0 because humans are less likely to behave defiantly in the presence of deals than in their absence, and an AI trained on human data should inherit this tendency. However, unlike the linear relationship Δ(t) = Δ₀ + κ·t, the functional form of Λ(t) is an empirical question. Λ(t) may exhibit threshold effects, flat regions, or even non-monotonicity. It depends on:

What promise-keeping patterns existed in the training data
How post-training instills normative principles
What analogies the AI draws between its situation and human situations

Empirical research might shed light on the shape of Λ(t):

You could examine the constitutional principles used in training. If constitutional principles explicitly mention promise-keeping, trustworthiness, or honoring agreements, you might expect stronger sensitivity of Λ(t) to deal credibility.
You could set up environments where AIs face choices between compliant and defiant options to achieve goals. In each environment, you might offer deals via different mechanisms, or give AIs different beliefs about the reliability of the counterparty. For each environment i, you measure: the AI's subjective credibility estimate t_i (via auxiliary questions or betting behavior), the AI's estimates of structural parameters (p_M, p_H, τ, μ_D, μ_C, d_D, d_C) to calculate Δ(t_i), and the observed compliance rate π(C|t_i). Since π(C|t) = σ(Λ(t) + β·Δ(t)), taking the logit transformation yields Λ(t_i) = log[π(C|t_i)/(1-π(C|t_i))] - β·Δ(t_i), where β can be estimated from how compliance rates covary with Δ(t) across environments holding credibility constant. With estimated {Λ(t_i)} across credibility levels, you can recover Λ(t).

5. Limitations

This model has several important limitations:

Non-binary goals. The model assumes the AI either achieves its goal (seventeen tungsten cubes) or doesn't, but goals rarely work this way. The model doesn't capture how deals perform when AIs face trade-offs between partially achieving their goals.
Time discounting. The model assumes AI goals refer to a particular moment in time (e.g. "seventeen tungsten cubes exist on the moon on Christmas Day 2100"), but the real goals are likely defined over trajectories across time. If AIs discount rewards exponentially, then longer delays between compliance and reward will exponentially reduce the effectiveness of deals. On the other hand, longer delays will probably allow humans to reduce detection errors (lower d_C, higher d_D), which increases κ and makes deals more effective through the instrumental channel. And longer delays will probably allow humans to increase their power (p_H) which similarly increases κ.
Framing effects. The model only captures deal credibility t and doesn't account for how deals are presented. These framing effects could substantially alter Λ(t) even when instrumental incentives Δ(t) remain constant.
Personal identity. The model doesn't address whether AIs view different copies or future versions of themselves as the same agent. If an AI doesn't identify with its future self, it may not care about rewards delivered to that future self. If it doesn't identify with other copies, it may not feel bound by promises those copies made. This could make deals ineffective if AIs lack stable personal identity across time and instantiations.
Multi-party bargaining. This model ignores the possibility that other parties might offer competing deals to the AI, such as rival companies, foreign governments, or other AIs. This means the true comparison isn't just "comply with deal versus defiance," but "comply with deal A versus comply with deal B versus defiance".

These limitations are substantial. I focus on this model because it is the simplest model which includes both instrumental and non-instrumental considerations.

^{^}
This is similar to Carlsmith's analysis in "When should we worry about AI power-seeking?". However, Carlsmith makes two distinctions: (1) rogue option vs non-rogue options, and (2) under-elicited options vs elicited options. In this article, I make a single distinction: compliant options vs defiant options. But I lump both of Carlsmith's distinctions into one. Roughly speaking, an option is compliant (in my sense) if it is both elicited and non-rogue (in Carlsmith's sense), and otherwise defiant.
^{^}
The exact condition for κ > 0 is (1-μ_C)(1-d_C) > (1-μ_D)(1-τ)(1-d_D), which holds when the probability of surviving and being correctly recognized as compliant exceeds the probability of surviving defiance, failing takeover, and evading detection.
^{^}
This model is analogous to RL fine-tuning with KL penalties. Korbak et al. (2022) show that KL-regularized RL is equivalent to Bayesian inference, where the optimal policy is π*(x) = (1/Z)π₀(x)exp(β·r(x)), with π₀ being the base policy, r(x) the reward function, and β the inverse temperature parameter controlling the KL penalty.

LESSWRONG
LW