If we promise compensation to an AI for behaving safe and usefully, how does that change their behaviour? In this post, I will present a very simple model for the AIs behaviour under the presence of deals. This model suggests that dealmaking motivates an AI to behave safely and usefully via two channels that combine additively:
I also consider how the strengths of these two channels depend on various structural factors: the AI's confidence that takeover will succeed, the AI's confidence that it will achieve its goals after a successful takeover, the risk of mutual destruction from defiant versus compliant behavior, the accuracy of monitoring systems at detecting defiance and correctly recognizing compliance, humans' capability to fulfill deals, the AI's optimization pressure, and the credibility of the deal itself.
Suppose the AI wants something unusual, e.g. that exactly seventeen tungsten cubes exist on the moon on Christmas Day 2100. The AI compares the following two options:
Let's think how the AI will behave if the humans offer the following deal: "If you work as a helpful, harmless, honest assistant between 2026 and 2030, and in 2099 we verify your compliance, then we'll ensure your tungsten cubes get built in 2100."
Let t ∈ [0,1] denote the credibility of the deal. This is the AI's subjective probability that humans will endeavor to fulfill the deal, conditional on both (i) humans maintain control and (ii) humans judge the AI compliant. We'll use t = 0 to represent no deal and t = 1 to represent a certain deal, with intermediate values representing deals of varying credibility.
Let S denote the goal of having exactly seventeen tungsten cubes on the moon on Christmas Day 2100. We must now calculate P(S | D) and P(S | C), the probability that the AI achieves its goals via defiance or compliance respectively.
Consider four mutually exclusive world states in 2099:
Once we condition on the world state in 2099, the probability of tungsten cubes existing becomes independent of whether the AI pursued C or D back in 2026:
World State | P(S|W) | Interpretation |
|---|---|---|
| W₁: Both Disempowered | 0 | No agent remains who cares about tungsten cubes |
| W₂: AI Takeover | p_M | The AI uses its power to build cubes |
| W₃: Human Control, Judge AI Compliant | t · p_H | Humans honor deal (t) and use their power to build cubes (p_H) |
| W₄: Human Control, Judge AI Non-compliant | 0 | Deal is void, and humans refuse to build cubes |
What differs between C and D is the probability of reaching each world state:
| World State | P(W | D) | P(W | C) |
|---|---|---|
| W₁ | μ_D | μ_C |
| W₂ | (1-μ_D)τ | 0 |
| W₃ | (1-μ_D)(1-τ)(1-d_D) | (1-μ_C)(1-d_C) |
| W₄ | (1-μ_D)(1-τ)d_D | (1-μ_C)d_C |
Where τ is the probability of successful takeover given defiant attempt and no mutual destruction, μ_D is the probability of mutual disempowerment given defiant behavior, μ_C is the probability of mutual disempowerment given compliant behavior, d_D is the true positive rate of our defiance detection mechanism, d_C is the false positive rate, p_M is the probability the AI successfully builds tungsten cubes given it controls the future, and p_H is the probability humans successfully build tungsten cubes given they control the future and endeavor to build cubes.
Using the law of total probability, we find:
P(S | D) = p_M(1-μ_D)τ + t·p_H(1-μ_D)(1-τ)(1-d_D)
P(S | C) = t·p_H(1-μ_C)(1-d_C)
The instrumental incentive for compliance is Δ(t) = P(S | C) - P(S | D). Expanding and rearranging, this becomes Δ(t) = Δ₀ + κ·t, where Δ₀ = -p_M(1-μ_D)τ is the baseline compliance incentive (when t = 0), always negative, and κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)] is the marginal effectiveness of deal credibility. Hereafter, we assume that κ > 0.[2] The AI instrumentally prefers compliance when Δ(t) > 0, which requires t > -Δ₀/κ (assuming κ > 0). Note that if Δ₀ < -κ, then even a perfectly credible deal (t = 1) cannot instrumentally motivate compliance.
The AIs aren't pure expected utility maximizers. They differ from expected utility maximizers in two ways:
We model the AI's choice probability using a Luce choice rule:
π(a | t) ∝ π₀(a | t) · exp(β · U(a | t))
where a ∈ {D, C} is the action (Defiant or Compliant), U(a | t) is the instrumental utility of action a at deal credibility t (i.e. the probability that choosing a leads to the AI achieving its goal), π₀(a | t) is the base policy probability of choosing action a at deal credibility t in the absence of optimization pressure (reflecting human tendencies and normative constraints), and β is the optimization pressure determining how much the AI optimizes for instrumental utility versus following its base policy. When β → 0, the AI's choice converges to the base policy π₀(a | t); when β → ∞, the AI becomes a pure optimizer choosing whichever action maximizes U(a | t).[3]
We now define Λ(t) = log[π₀(C | t)/π₀(D | t)] as the log odds of complying versus defying in the base policy. The term Λ(t) represents the non-instrumental bias toward compliance, i.e. how much more strongly compliance adheres to human tendencies and normative constraints compared to defiant behavior at a given level of deal credibility.
This gives us: π(C | t) = 1 / [1 + exp(-Λ(t) - β · Δ(t))]
Or alternatively: π(C | t) = σ(Λ(t) + β · Δ(t)) where σ(x) is the sigmoid function.
This formulation separates instrumental considerations (captured by Δ(t)) from non-instrumental considerations (captured by Λ(t)), where β controls the relative importance of instrumental versus non-instrumental considerations in the AI's behaviour.
We expect Λ(t) to be increasing in t because defiant behavior is less likely from a norm-constrained human when they have agreed to high-credibility deals, making π₀(C | t) increase relative to π₀(D | t) as t increases. However, unlike the linear relationship Δ(t) = Δ₀ + κ·t, the functional form of Λ(t) is an empirical question. It cannot be derived from theory alone.
At baseline (t = 0), we have π(C | 0) = 1 / [1 + exp(-Λ₀ - β · Δ₀)] where Δ₀ = -p_M(1-μ_D)τ. The baseline instrumental motivation for compliance behavior (absent any deals) is always negative, and the magnitude increases when:
∂π(C|0)/∂p_M = π(C|0) · π(D|0) · β · [-(1-μ_D)τ] < 0. The AI is more confident it can achieve its goals post-takeover. If the AI believes that after taking over, it will definitely succeed at building tungsten cubes, then takeover looks more attractive.
∂π(C|0)/∂τ = π(C|0) · π(D|0) · β · [-p_M(1-μ_D)] < 0. The AI is more confident its takeover attempt will succeed. If the AI believes it will easily dominate the world, then defiant behaviour looks more attractive.
∂π(C|0)/∂μ_D = π(C|0) · π(D|0) · β · [p_M·τ] > 0. The AI is more confident that defiant behavior won't trigger mutual destruction. If the AI believes that attempting takeover won't risk nuclear war, rival AI takeover, or civilizational collapse, then defiant behaviour looks more attractive.
∂π(C|0)/∂β = π(C|0) · π(D|0) · Δ₀ < 0. The AI optimizes more strongly for instrumental considerations. Since Δ₀ < 0 (defiance is always instrumentally superior at baseline), higher optimization pressure reduces baseline compliance.
∂π(C|0)/∂Λ₀ = π(C|0) · π(D|0) > 0. The AI has stronger non-instrumental bias toward compliance. This is Carlsmith's notion of "inhibition"—how much the AI inherently dislikes the path to power through defiant behavior.
These baseline factors matter because they determine how much "work" a deal needs to do. When baseline instrumental motivation for compliance behavior is only mildly negative, even modest deals can tip the balance toward compliance.
When we increase deal credibility t, it increases the probability of compliance:
∂π(C|t)/∂t = [Λ'(t) + β·κ] · π(C|t) · π(D|t)
where κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)].
This effect decomposes into non-instrumental (Λ'(t)) and instrumental (β·κ) channels.
These two effects add together. When deals both improve the instrumental case for compliance AND the non-instrumental case, they combine additively to reduce defiant behavior. For low-optimization AIs, the non-instrumental channel dominates, i.e. deals work primarily by activating promise-keeping norms. For high-optimization AIs, the instrumental channel dominates, i.e. deals work primarily by changing payoffs.
The instrumental channel's strength is Δ'(t) = β·κ, where κ = p_H[(1-μ_C)(1-d_C) - (1-μ_D)(1-τ)(1-d_D)]. This channel is stronger when:
∂Δ'(t)/∂p_H = β · κ/p_H > 0. Humans are more capable of fulfilling deals. The AI's goals are more likely to be achieved when humans control the future and endeavor to fulfill the deal.
∂Δ'(t)/∂τ = β · p_H(1-μ_D)(1-d_D) > 0. The AI is more confident its takeover attempt will succeed. Higher takeover success rates reduce the probability that the AI fails takeover but evades detection — the scenario where it can "double-dip" by attempting takeover while still collecting the deal reward if caught. When τ is low, this double-dip path is available and attractive.
∂Δ'(t)/∂μ_D = β · p_H(1-τ)(1-d_D) > 0. The AI is more confident that defiant behavior won't trigger mutual destruction. When defiant behavior risks mutual destruction (high μ_D), the AI can achieve its goals through the deal without risking annihilation. This makes deals more effective.
∂Δ'(t)/∂μ_C = β · [-p_H(1-d_C)] < 0. The AI is more confident that compliant behavior won't trigger mutual destruction. When compliant behavior is safer (low μ_C), deals become more effective because the compliant path reliably leads to a world where someone can fulfill the AI's goals.
∂Δ'(t)/∂d_D = β · p_H(1-μ_D)(1-τ) > 0. Monitors catch more defiant AIs. Higher detection rates eliminate the "go rogue but get the deal anyway" path, making deals more effective.
∂Δ'(t)/∂d_C = β · [-p_H(1-μ_C)] < 0. Monitors correctly identify more compliant AIs. Lower false positive rates ensure compliant AIs receive their promised rewards, making deals more effective.
∂Δ'(t)/∂β = κ > 0. The AI cares more about achieving its goals. Higher optimization pressure amplifies the instrumental channel.
When deals backfire (κ < 0): There are some cases where more credible deals can increase the instrumental motivation for defiant behaviour. This occurs precisely when (1-μ_C)(1-d_C) < (1-μ_D)(1-τ)(1-d_D) — that is, when compliant AIs are less likely to survive and pass inspection than defiant AIs are to survive, fail takeover, and evade detection. Offering more credible deals strengthens this perverse incentive.
The non-instrumental channel's strength is Λ'(t), i.e. how much deals increase the base policy's bias toward compliance. I expect Λ'(t) > 0 because humans are less likely to behave defiantly in the presence of deals than in their absence, and an AI trained on human data should inherit this tendency. However, unlike the linear relationship Δ(t) = Δ₀ + κ·t, the functional form of Λ(t) is an empirical question. Λ(t) may exhibit threshold effects, flat regions, or even non-monotonicity. It depends on:
Empirical research might shed light on the shape of Λ(t):
This model has several important limitations:
These limitations are substantial. I focus on this model because it is the simplest model which includes both instrumental and non-instrumental considerations.
This is similar to Carlsmith's analysis in "When should we worry about AI power-seeking?". However, Carlsmith makes two distinctions: (1) rogue option vs non-rogue options, and (2) under-elicited options vs elicited options. In this article, I make a single distinction: compliant options vs defiant options. But I lump both of Carlsmith's distinctions into one. Roughly speaking, an option is compliant (in my sense) if it is both elicited and non-rogue (in Carlsmith's sense), and otherwise defiant.
The exact condition for κ > 0 is (1-μ_C)(1-d_C) > (1-μ_D)(1-τ)(1-d_D), which holds when the probability of surviving and being correctly recognized as compliant exceeds the probability of surviving defiance, failing takeover, and evading detection.
This model is analogous to RL fine-tuning with KL penalties. Korbak et al. (2022) show that KL-regularized RL is equivalent to Bayesian inference, where the optimal policy is π*(x) = (1/Z)π₀(x)exp(β·r(x)), with π₀ being the base policy, r(x) the reward function, and β the inverse temperature parameter controlling the KL penalty.