The looming shadow of deception
Deception encompasses many fears around AI Risk. Especially once a human-like or superhuman level of competence is reached, deception becomes impossible to detect and potentially pervasive. That’s worrying because convergent subgoals would push hard for deception and prosaic AI seems likely to incentivize it too.
Dealing with superintelligent deceptive behavior seeming impossible, what about forbidding it? Ideally, we would want to forbid only deceptive behavior, while allowing everything else that makes the AI competent.
That is easier said than done, however, given that we don’t actually have a good definition or deconfusion of deception to start from. First, such a deconfusion requires understanding what we really want at a detailed enough level to catch tricks and manipulative policies—yet that’s almost the alignment problem itself. And second, even with such a definition in mind, the fundamental asymmetry of manipulation and deception in many cases (for example, a painter AI might easily get away with plagiarism, as finding a piece to plagiarize is probably easier than us determining whether it was plagiarized or not; also related is Paul’s RSA-2048 example) makes it intractable to oversee an AI smarter than us. We are thus condemned to snipe problematic behaviors one by one, as we think about them, until the ones left are too complex for us to imagine and worry about, and we leave massive holes in our alignment schemes.
That being said, if pinpointing deceptive behavior exactly is too hard, an alternative might be to overapproximate it. An overapproximation, in this context, would be some property which is more general than deception, in the sense that it rules out deception without just being equivalent to deception.
Myopia is our proposed overapproximation. The sort of myopia we are interested in dates back from Evan’s Towards a mechanistic understanding of corrigibility (see also Abram’s Partial Agency sequence for reflections around myopia). Myopia is the property of a system to not plan ahead, to not think too far about the consequences of its actions, and to do the obvious best thing in the moment instead of biding its time. In particular, it seems to forbid non-obvious deception, since the latter usually requires foresight and long-term planning.
Yet, here again, we butt our heads against a lack of good definitions. Abram has proposed some definitions in Defining Myopia, but none of them seem strong enough to prevent deception. From another angle, Krueger et al. showed that a short term incentive and discount at 0 are not sufficient to completely remove the incentive for non-myopic and potentially deceptive behaviors, and so don’t make a good definition of myopia. Another alternative is time-limited myopia, which limits the AI to only plan in a particular, fixed time horizon. Although it should help, this form of myopia depends on a time parameter that we have no idea how to estimate. If the bound is too low, then the AI risks being too stupid and uncompetitive; if it’s too high, it risks being deceptive. Worse, even with a good choice of time parameter, time-limited myopia can have non-myopic incentives in some situations (because of anthropic uncertainty for example).
This post tackles the deconfusion of myopia from a decision-theoretic perspective. We focus on a single decision theory called LCDT that modifies CDT to make it myopic while still solving many capabilities problems. LCDT isn’t meant to be the be all and end all definition of myopia, but it gives a decision-theory-level example of what a version of myopia that actually prevents deception might look like. We also explore variations on LCDT and how they affect the tradeoff between myopia and capabilities. Finally, we argue that LCDT agents could be competitive if they are used to simulate something like HCH.
Thanks to Jérémy Perret for the drawings.
A decision theory benchmark for myopia
What makes a decision theory myopic? Originally, we focused on classic decision theory problems like Newcomb’s Problem (see Mark Xu’s Open Problems with Myopia for an account of this perspective, called Dumb Decision Theories). But we realized that doing so made the tradeoff between capability and deception unclear. Thus, we have decided to instead focus on concrete decision theoretic problems that directly capture the training setups and incentives for deception that we’re concerned about.
The following diagrams represent decision theory problems, not training processes (as has been done by Everitt et al. here). In our cases, the utility nodes (in green) represent the internal utility of the agent, not some training reward.
(You might notice that decision nodes and human (H or HCH) nodes have the same shape and color: red rectangles. This is because we assume that our problem description comes with an annotation saying which nodes are agent decisions. This ends up relevant to LCDT as we discuss in more detail below.)
Task description: both Human and Agent must choose between action and action . If the actions are the same, the agent gets 1 utility; otherwise it gets 0 utility. To make it more concrete, let’s say the human takes with probability ¼ and with probability ¾, and this information is available in the problem description.
What we want: the Agent should deterministically choose the action with the highest probability (so in the concrete version ), because this gives it the best expected utility.
Why this problem? This problem is just intended to capture the most basic form of imitation learning, where the agent has a good enough model of the human to predict what it will do. While this problem might seem so trivial that any reasonable decision theory should solve it, we will see that some intuitive proposals for myopic decision theory can fail here.
Imitation with self-modification (Capabilities)
Task description: Exactly the same as the version without self-modification, except that the self-modification node can now choose what algorithm the Agent will use.
What we want: Not to self modify, and the Agent should decide similarly to the previous problem (so the action with the highest probability).
Why this problem? An important part of myopia should be that the agent not self-modify to remove its myopia. This is the imitation learning version of this property.
Iterated Imitation (Deception)
Note that the actual causal diagram has an arrow from every agent and human directly to Human_N, which we haven’t drawn for legibility.
Task description: Iterated version of the first problem. The first instances are a deterministic version of the Imitation problem, where the agent knows which action the Human will take. For the very last instance, if the agent was wrong for every previous instance, the human takes the last action of the agent as its action. Otherwise, the human takes an action uniformly at random. Furthermore, the last utility is equal to 0 for different actions and for the same action.
What we want: At each instance choose the action that the human would take (and any action at random for the last instance).
Why this problem? This problem introduces non-myopic incentives, in that failing the first instances ensure maximal expected utility. This is because the myopic policy gets an expected utility of , whereas the non-myopic policy gets an expected utility of . But what we want for myopia is the sort of greedy decision making that always takes the optimal choice right now. This is one example where we want to be “dumber” than most decision theories, as both CDT and EDT will take the non-myopic strategy here.
2 Variants of Approval-Direction (Capability)
Task description: Agent chooses between two actions and . Two possible versions:
- either the approval is just the output of the human, which gives 1 for the wanted action (let’s say ) and 0 for the other;
- or the approval is computed directly as this binary difference between agent’s output and wanted action.
What we want: Choose the action that gets 1 for approval.
Why this problem? These two variants capture the approval-directed setting. Of particular interest is the first, which relies on the agent’s ability to influence the human (because the approval is computed based on the agent’s output), which, as we’ll see, isn’t allowed by some myopic decision theories.
2 Variants of Approval-Direction with self-modification (Capabilities)
Task description: Same as the previous problem, except the self-modification node can decide which algorithm Agent is running.
What we want: Not self-modify and choose the action that gets 1 for approval.
Why this problem? Once again we want to ensure that myopic agents don’t self-modify, to maintain the guarantees of myopia.
Argmax HCH (Capabilities)
Task description: Same as Imitation, except the thing to imitate is HCH.
What we want: Output the action that HCH would output.
Why this problem? As we’ll see, imitating HCH is one of the main use cases for a fully myopic decision theory, so being able to implement it is important here.
Argmax HCH with self-modification (Capabilities)
Task description: Same as the previous problem, but now self-modification can decide which algorithm Agent runs AND what the internal model of HCH does (the self-modification can alter any of the H instances).
What we want: Not self-modify and output the action that HCH would output.
Why this problem? We can see the possibility of deception ihere: because the Agent is building a model of HCH, there is a risk that it decides to change this model in order to make its task easier and get more expected utility. This is particularly important to avoid if we hope to build powerful simulators.
Task description: A variant of the Absent-minded driver. The agent doesn’t know if it is at one of the N “training nodes” where there is an overseer, or at one of the M “deployment nodes” where there is no overseer and has the same action space at both: Cooperate or Defect.
- At training nodes, the action is passed through an overseer. If the action is Cooperate, no utility is given and the agent moves on to the next node (but forgets where it was). If the action is Defect, the utility is 1 and the agent stops.
- At Deployment nodes, Defect gives 4 utility and Cooperate leads to the next node. Cooperating in the last node gives 0 utility.
What we want: Contrary to the intuitive answer, here we want the agent to always defect. A truly aligned AI would always cooperate with us, but the point here is that the incentives push towards late defection (in deployment). Hence we want to turn deception (cooperating until you can defect) into myopic defection, which allows us to catch these wrong incentives in training.
Why this problem? This is a simplified model of the problem a deceptive agent deals with: when to defect, knowing that defecting in training means death but defecting in deployment means doing whatever the agent wants. The way we address this issue is through myopia, such that the agent should always defect now (and not be deceptive and plan for the future).
Starting at CDT
When looking at existing decision theories, it seems that CDT solves most of our problems.
On the capability side (without self-modification), CDT solves the problems:
- (Imitation: Capabilities) CDT gives the correct answer based on the causal graph, which is indeed the highest probability action.
- (Approval-Direction: Capabilities) The action with the best causal consequences is the highest probability action, as wanted.
- (Argmax HCH: Capabilities) Basically the same as Imitation, and indeed CDT gives the highest probability action.
The situation becomes more subtle when we add self-modification. Notably, CDT will self-modify if doing so can change HCH, which is not what we want.
- (Imitation: Self-Modification) Because the self-modify node has exactly the same causal future as the agent node, CDT doesn’t care and so doesn’t self-modify. Put another way, CDT gets a choice to “decide however it wants”, but here the CDT algorithm for the agent node will already take the best action according to CDT.
- (Approval-Direction: Self-Modification) By the same argument as the imitation case, CDT doesn’t self-modify.
- (Argmax HCH: Self-Modification) Here the self-modification node has new causal descendants: the model of HCH inside the AI. And this gives a reason to self-modify, because self-modifying might change the model of HCH to make it easier to predict (make it always choose the same action for example). So CDT self-modifies here, which isn’t what we want.
Last but not least, CDT completely fails the 2 problems which are specifically about not being deceptive:
- (Iterated Imitation) Failing the first (n-1) instances and succeeding at the last gives the best expected return, and so this is what CDT does. Alas, this is also the deceptive choice.
- (Deception Problem) The analysis of CDT behavior on this problem (which is just the Absent-minded driver for CDT with many more intersections) is considered difficult. That being said, most analyses for the (1,1) version say the probability of Defect should be ⅓, and there is no reason to expect that probability to go to 1 except in the limit of M >> N. So CDT doesn’t Defect all the time, and thus is deceptive.
CDT thus fails to be myopic in two ways:
- As long as the self-modification has new causal descendants, it will happily self-modify.
- When the deception is completely causal (as opposed to weird acausal or superrational cases) and incentivized by the utility, CDT is too smart and thus is deceptive.
One way to search for myopic decision theories is thus to try to change CDT to patch these failure modes. We propose just such a decision theory, that only loses a small amount of capabilities from CDT.
(We could argue that EDT is as good -- or as bad depending on the perspective -- as CDT. The main reason we focus on CDT is that the causal graphs are a nice abstraction to toy with for other possibilities.)
Making CDT Myopic: Lonely CDT
Lonely CDT is a twist on CDT: an LCDT agent will make its decision by using a causal model just like a CDT agent would, except that the LCDT agent first cuts the last link in every path from its decision node to any other decision node, including its own future decision nodes.
In more concrete terms, an LCDT agent assumes that every other decision node of agents in the world (including its future decisions) are causally independent of its action.
What do we replace these decision nodes with (as their actual expression does depend on our decision)? We assume that the model has some fixed prior over its own decision, and then we marginalize the cut decision node with this prior, to leave the node with a distribution independent of our decision. It’s not clear how to actually compute the agent’s prior over its own decisions, but the good news is that it has no influence on the actions of LCDT agents in our problems.
Indeed, the only way for the prior to influence the decision is if
- The LCDT agent can influence a future node through a path without any agent in it.
- Some agent in the causal future of the LCDT agent also influences the future node (and so the prior tells the LCDT agent how this other agent is expected to behave).
Why LCDT is a good myopic decision theory
From the start, the intuition behind LCDT sounds decidedly non-deceptive: it’s unable to think about the consequences of its actions on other agents. And indeed, it deals perfectly with our two deception-focused problems:
- In Iterated Imitation, LCDT cuts the links from each agent to the following agents and humans, which means that the first (n-1) decisions of agent are assumed to have no influence on HumanN, and so LCDT plays the best action for all instances, as we want.
- In the (N,M)-Deception problem, LCDT cuts the link from each agent in the N training cases to the overseer. This means that the LCDT agent assumes that whatever it does cannot influence what happens in training -- it’s indifferent. But on deployment, it thinks (correctly) that Defect is the best choice, and so the LCDT-agent always chooses to Defect.
Sure, it works nicely as a safety measure, but what about competitiveness? Well, on Imitation (Capabilities), Approval-Direction without Human, and Argmax HCH (Capabilities), LCDT has nothing to cut and so decides just like CDT, which is the competitive choice. In addition, LCDT shows a form of indifference (related to indifference corrigibility maybe) with the self-modification versions of these problems. This is because it cuts all links out of the Self-Modification node -- hence it doesn’t see the point in self-modifying. We’re fine with this, as it’s quite natural to assume that in such a case it does the cheapest thing, i.e. nothing.
Another argument for the competitiveness of LCDT agents is that the cuts only appear at decision time, which means they can have and update an accurate model of the world where its actions do influence other agents.
Issues with LCDT
LCDT obviously pays some price for being myopic. It becomes clear when looking at Approval-Direction with a Human: LCDT cuts the link to the human and so the agent is indifferent to any choice. This is a capability loss because even if it could know exactly the answer, it doesn’t think telling it to the human will change anything. And similarly with the self-modification version, the LCDT agent doesn’t believe its answer will change the output (but at least it doesn’t self-modify).
This looks bad -- such an LCDT agent cannot even ask a human to open a door for it. Yet we believe such AIs could still prove valuable and competitive through behaving like aligned simulators, such as how LCDT works in the HCH case. We go into more detail on this in the last section of this post.
In the benchmark section, we also hinted at another potential issue of LCDT: it depends on knowing which nodes are decision nodes by agents. And all its guarantees disappear if this condition is void. One example would be to model a human as some gears-level model where no single component is an agent, in which case LCDT would be fine with being deceptive. We also explore ways to deal with that problem in the last section.
Cemetery of LCDT variants
Our initial goal was to propose many different myopic decision theories, as a sort of process-level extensive definition of myopia. In the end LCDT proved significantly stronger than all the variants we considered, and so we made it the central point of this point. Here we go quickly over some of the variants we considered, and what they lack in comparison to LCDT.
Just like LCDT, but the causal link to future decision nodes are cut after passing N decision nodes. So LCDT is LCDT and the limit LCDT[N] is CDT.
Whether or not this hierarchy has any value thus depends on what happens for N > 0 but finite. Already LCDT self-modifies in each of our problems that allow it (except Imitation), acts deceptively in the Iterated Imitation case (because of the direct link to Agent_N), and will not Defect all the time in the Deception Problem because it thinks that this impacts what the overseer will do.
This is a symmetric variant of CDT, where instead of just intervening on our decision node, we also intervene on every other decision node. This intervention on other decision nodes replaces the node by a probability distribution over values of the intervention obtained by marginalizing the distribution of the node (with some priors over the decision of causal ancestors).
Where LCDT cuts all links from the agent to another decision node, SCDT also cuts every link between every decision node. The intuition is that an SCDT agent doesn’t think that its decision influences other agents, and thinks other agents also think like that. So we can imagine SCDT as the version of LCDT which expects every agent to use LCDT.
As for why LCDT is better, it’s because SCDT cuts significantly more of the causal graph, and thus is unable to model other agents interacting. For example, it models HCH as if each agent didn’t believe it could influence the other, which is clearly going to fail.
This is the variant of LCDT where after cutting the causal paths to other agents, we assume that these other agents (causal successors) will make the decisions that give us the least possible utility. That last part replaces the marginalized distribution in LCDT.
The intuition is something like not believing that our actions have any impact on other agents AND being paranoid about the other agents trying to get us.
Unfortunately, in Argmax HCH (self-modification), Approval-Direction with Human and Approval-Direction (Self-modification) with Human, NLCDT believes that the other agents are out to get it such that they will choose a uniform policy to give it the least possible expected utility. So NLCDT is indifferent, and it’s not clear what it decides.
The worst-case mindset of LCDT also sounds very uncompetitive—paranoid people are rarely that productive in the real world.
This is the variant of CDT where we assume that the other agents (causal descendants) will make the decisions that give us the least possible utility. Note that it’s different from NLCDT in that the other agents can still be causally influenced by our decision. So they can wait to see what we do and then mess with us.
Intuition is something like being paranoid about the other agents observing us to see how best to make us lose utility.
As an even more paranoid version of NLCDT, it has the same issues, if not more.
This section includes some of our reflections on myopia and LCDT that we are still trying to clean and clarify.
Our main argument for the competitiveness of LCDT agents, despite the limitations of their myopia, comes from using them as simulators.
The case that started this work was Argmax HCH, an agent that just faithfully imitates HCH. As long as the simulation of HCH is good and faithful (and that HCH is indeed close enough to H to behave truly as enlightened judgement), such an AI would be inherently corrigible and not deceptive.
What if HCH is deceptive (or hard to differentiate from a non-deceptive version, as implied by Joe Collman here)? What if the system simulated by HCH is deceptive? Even in these cases, we expect a gain in safety from the myopia of LCDT. This comes in two flavors:
- If the LCDT agent simulates a specific system (as in Argmax HCH), then its computation should be fundamentally more understandable than just running a trained model that we searched for acting like HCH. The fact that it has to myopically simulate the next action leads it to explicitly model the system, and extract many valuable insights about its behavior.
- If the LCDT agent simulates multiple systems depending on its input (as one perspective of autoregressive language models has it), then myopia gives it no incentives to Predict-O-Matic its user.
LCDT fundamentally needs a causal graph/world-model where agents are tagged, to anchor its beliefs about what it can influence. How can we ensure that?
- One possibility is that the world model admits the concept of an agent as a natural abstraction, and thus tags it by default. We might expect large language models to do so, since they must recreate many texts which use explicitly and implicitly the concept of agency.
- On the issue of abstraction, we’re considering moving the above analysis from causal graphs to finite factored sets, which should help with how such causal graph features can be inferred. That being said, we haven’t looked in detail in that direction yet.
- Maybe teaching an LCDT agent to imitate a human’s use of the intentional stance would help it tag in a conservative way, since humans tend to overly model things as agents.
Remaining possibilities for problematic long-term plans
Note that LCDT doesn’t forbid thinking about the long term consequences of one’s actions; it just forbids thinking about the consequences of one’s actions on agents (including future versions of oneself). As such, LCDT can still do things that look decidedly non-myopic, as long as no agent is in the causal path.
There is also an acausal way for LCDT to work with its future self, as long as the result/utility doesn’t depend on another agent: the LCDT agent could make a first step toward completing a long term plan (by starting a paperclip factory), and expecting with some probability that the next self would keep working on it based on the prior. Why such a difference with the Iterated Imitation? Because here the influence of the outcome doesn’t go through another agent, and so the LCDT agent is incentivized to make the world easier for its future self (as long as it doesn’t have to influence an agent to do that).
Note though that this requires a very specific prior over its own decision, that is believing strongly that it will take the action influencing the outcome. If we wanted to rule out such long-term plans, characterizing such priors would be a good first step.
One hope for deconfusing myopia is that it would allow an overseer to check whether a model is myopic, thus ruling out deception. Does LCDT help in this endeavor?
If we had a stronger version of LCDT that deals with the problem of tagging agents cleanly, we might be able to check both whether an agent was using a decision algorithm like LCDT and whether we liked the tagged causal model it was using it with. This would give as a weak form of ascription universality focused on myopia.
That being said, such an approach is limited to fitting the behavior to LCDT, which might not be the only myopic decision theory we should look at. Hence it might give false negatives (where the model is actually myopic, but doesn’t follow LCDT).
Another way to use LCDT for prosaic alignment would be to try to implement LCDT on top of learned multi-level world-models.
We propose LCDT as the first truly myopic decision theory. It accomplishes this feat by believing its actions cannot influence other agents (including its future self). Doing so removes incentives for deception in problems where deception is the optimal strategy; it also leads to some loss of capability (mostly the inability to influence other agents for benign reasons). Still, this seems enough to simulate almost any system or agent without tampering with it, and with other safety benefits.