May 08, 2012
This post describes a different (less agent-centric) way of looking at UDT-like decision theories that resolves some aspects of the long-standing technical problem of spurious moral arguments. It's only a half-baked idea, so there are currently a lot of loose ends.
UDT agents are usually considered as having a disinterested inference system (a "mathematical intuition module" in UDT and first order proof search in ADT) that plays a purely epistemic role, and preference-dependent decision rules that look for statements that characterize possible actions in terms of the utility value that the agent optimizes.
The statements (supplied by the inference system) used by agent's decision rules (to pick one of the many variants) have the form [(A=A1 => U=U1) and U<=U1]. Here, A is a symbol defined to be the actual action chosen by the agent, U is a similar symbol defined to be the actual value of world's utility, and A1 and U1 are some particular possible action and possible utility value. If the agent finds that this statement is provable, it performs action A1, thereby making A1 the actual action.
The use of this statement introduces the problem of spurious arguments: if A1 is a bad action, but for some reason it's still chosen, then [(A=A1 => U=U1) and U<=U1] is true, since utility value U will in that case be in fact U1, which justifies (by the decision rule) choosing the bad action A1. In usual cases, this problem results in the difficulty of proving that an agent will behave in the expected manner (i.e. won't choose a bad action), which is resolved by adding various compilicated clauses to its decision algorithm. But even worse, it turns out that if an agent is hapless enough to take seriously a (formally correct) proof of such a statement supplied by an enemy (or if its own inference system is malicious), it can be persuaded to take any action at all, irrespective of agent's own preferences.
Given that an inference system can overpower decision rules, causing an agent to follow a preference other than its own, perhaps decision-making should be happening inside the inference system in the first place, with agent only following its decisions. What does an inference system decide? Directly, it decides which theorems to have. The set of valid theorems follows deterministically from the axioms, but this is not really a problem, it's possible to make decisions in deterministic settings.
Suppose an inference system wants to decide whether it should have a theorem S. How does it evaluate the consequences of S being its theorem? It can assume that it proved S, see what that would imply, and if it likes the consequences (in comparison to the consequences of proving Not-S, for example), then it concludes S. Decision rules that an inference system follows are the axioms of the theory it works with, so this discussion suggests the following axiom schema (of moral axioms):
For all statements S and possible utility values u,
[(Prf(S) => U=u) and U<=u] => S is an axiom.
(This particular schema has a lot of problems, as discussed below, but seems adequate for communicating the general idea. [Edit: Stuart points out an even worse problem that makes these axioms break for any easily-provably-false S. Not sure what can be salvaged from this problem yet.]) A moral axiom from this schema states that if statement S being provable implies that the best possible utility gets realized, then that statement is declared to be valid.
Suppose that an agent has to choose an action A among possible actions 1 and 2, and wants to follow this theory's decisions. Then all it needs to do is pick a new propositional symbol B and establish the following decision rules:
Prf(B) => A=1
Prf(~B) => A=2
B remains otherwise undefined, its only effect is on our agent, or on definition of A. If it's true that, say, [A=1 => U=10], [A=2 => U=5], and also that in general [U<=10], then Prf(B) implies [U=10], which triggers the moral axiom for statement B and makes it valid/provable. As a result, the agent finds a proof of B and performs action 1.
This formulation is different from the usual ones in that the consequentialist loop is operated entirely from within an abstract formal system (i.e. not an algorithm). The formal system doesn't have an intended interpretation or a privileged agent (definition of an action) that would enact its decisions. Instead, it looks for all possible agents (actions, facts) that respond to its arguments (and affect its utility value), and supplies the arguments (theorems) according to how those agents respond to various hypothetical arguments. If there are multiple agents that have to be coordinated, that calls for proving a theorem that simultaneously establishes the strategies of all agents involved. And the agents could well use their own inference systems or proof search algorithms.
For an agent, such formal system plays a role of preference, it is an abstract computation that answers the questions about what should be done in each particular situation.
The axiom schema [(Prf(S) => U=u) and U<=u] => S is not adequate for many reasons. First, it's only capable of making knowably perfect decisions (which in particular requires utility value to have a reachable upper bound). Second, it introduces a different kind of spurious arguments that make the formal system inconsistent: once a statement S triggers its moral axiom, it follows that Prf(S), and so U=u, which triggers the other moral axioms all at once. This isn't necessarily too bad, since it's irrelevant what happens once utility value is optimal, but it also makes it harder to trigger moral axioms prior to making a decision.
For example, in Wei Dai's variant of coordination game, an agent is given indexical identification 1 or 2, and has to pick among actions A and B in such a way that its versions that observe 1 and 2 pick different actions. A natural way of setting up the agent using a consequentialist theory is to introduce propositional symbols T1 and T2, and establish decision rules
If I observe 1, Prf(T1) => action=A; Prf(~T1) => action=B
If I observe 2, Prf(T2) => action=A; Prf(~T2) => action=B
In this case, if either [T1 and ~T2] or [T2 and ~T1] is a theorem of the formal system, then the two versions of the agent (observing 1 and 2) will achieve the optimal utility value. The problem is that moral axioms for both theorems can be triggered, and if both do get triggered, then quickly absurdity is proved, which makes it hard to predict which actions the agents will actually perform, and what utility would follow from that. And if the formal system can't predict the effect of triggering its moral axioms on utility, it won't trigger the moral axioms, so it's unclear what would actually happen. Perhaps some different clever statement will get proved that would predictably lead to the agent choosing the right actions.
Another issue is that the moral axiom schema should probably only consider theorems of some special kind, and compare their consequences with those of specific other theorems (not just with an unconditional upper bound).
The main technical difference appears to be that instead of using moral arguments of the form [A=A1 => U=U1], this approach uses moral arguments of the form [Prf(A=A1) => U=U1]. As a result, proving A=A2 (for A2<>A1) no longer allows inferring a false antecedent, which in this case is ~Prf(A=A1), and so the usual path to spurious arguments is closed. Perhaps focusing on just this distinction might be more fruitful than paying attention to the surrounding philosophical bells and whistles.