On notation for modal UDT

Benya_Fallenstein

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Patrick thinks that the notation I've been using for talking about modal UDT is too hard to read. I agree that it's annoying that after defining a decision problem $\to P (\to a)$ and choosing a decision theory $\to T (\to u)$ ---say, $\to T (\to u) \equiv \to U D T (\to u)$ ---, I every time have to explicitly give separate names to the fixed points of the decision problem and the decision theory, for example by stating that $(\to A, \to U)$ are such that $G L ⊢ (\to A \leftrightarrow \to U D T (\to U)) \land (\to U \leftrightarrow \to P (\to A)) .$

In "Obstacle to modal optimality when you're being modalized", Patrick uses a pretty free-wheeling notation loosely inspired by the notation used in the modal agents paper. As Patrick notes in that post, I'm very much not sold on this notation. Partly, this is because various pieces of the notation don't seem very consistent to me. However, there's also a somewhat subtle problem with adopting notation that's too close to the modal agents paper, which I'm worried will lead to confusion; in this post, I'll explain that problem, and suggest some possible changes to modal UDT notation that might go part of the way towards the kind of notation Patrick prefers.

Comments appreciated!

Agents and universes

In the context of modal UDT, an agent is a sequence of formulas $\to A \equiv (A_{1}, \dots, A_{m})$ , such that the formula $A_{i}$ means that the agent takes action $i$ , and a universe is a sequence $\to U \equiv (U_{1}, \dots, U_{n})$ , where $U_{j}$ means that the universe returns the $j$ 'th-best outcome ( $1$ is best, $m$ is worst; this way, we can define modal UDT as looping over $j = 1$ to $m$ ).

When we talk about modal agents, we write $[X (Y) = C]$ for the formula saying that agent $X$ cooperates when playing against agent $Y$ . I wouldn't be averse to similarly writing, say, $[\to A = i]$ and $[\to U = j]$ for $A_{i}$ and $U_{j}$ , though I'm not sure whether we gain much from it. I wouldn't object too strongly to $[\to A () = i]$ and $[\to U () = j]$ , though it does look a bit ugly to me.

Orthogonally, I would be ok with naming actions and perhaps outcomes when this makes a decision problem easier to understand. For example, if an agent has to decide whether or not to press a certain button, I'd be happy to write that as $\to A \equiv (A_{press}, A_{d o n^{'} t p r e s s})$ . If we like, we can combine that with the above and write $[\to A = press]$ and $[\to A = d o n^{'} t p r e s s]$ , although the subscript notation feels pretty fine to me. I'm really not sold on using the symbol $\neg$ in the names of actions, though (e.g. " $\neg press$ "), as Patrick's post does; that seems bound to lead to confusion with the use of $\neg$ as negation in the logic.

I think it's useful to be able to talk about the sequences of formulas explicitly, so I'm not excited about using subscripts to distinguish different sequences, as in Patrick's ${\to U}_{1}$ vs. ${\to U}_{2}$ , since then I don't know how to name the individual items of these sequences. Instead, I suggest using superscripts in parantheses, e.g. ${\to U}^{(1)}$ for the first universe and $U_{3}^{(1)}$ for the formula which says whether the first universe leads to outcome $3$ . I've used this convention occasionally in my previous posts.

This is basically bikeshedding, though. Let's move on to the more substantial issue.

Modal agents vs. modal decision theory

In the modal agents work, we use the notation $[X (Y) = C]$ to mean that $X$ cooperates when it plays against $Y$ . On a formal level, each agent is represented by a one-variable formula in the language of arithmetic. To compute $X (Y)$ , we substitute the Gödel number of $Y$ 's formula for the variable in $X$ 's formula, i.e., $φ_{X} (┌ φ_{Y} ┐)$ . If the resulting sentence is true, $X$ cooperates with $Y$ , otherwise, it defects. Crucially, this means that $X$ can not only evaluate what $Y$ will do against $X$ ; it can, for example, evaluate what $Y$ would do against some other agent, such as $D e f e c t B o t$ or $F a i r B o t$ .

This is very different from the situation with modal decision problems and modal decision theories! A modal decision problem $\to P (\to a)$ takes a parameterless agent, which picks a single action, and a modal decision theory takes a parameterless universe, which returns a single outcome. For example, the decision theory is never passed the source code of the decision problem, just of the universe that is the fixed point of the problem with the theory. If we think of decision problems and decision theories as formulas in the language of arithmetic, then they're again formulas with one free variable, but the free variable in the decision theory doesn't get instantiated to the Gödel number of the decision problem, it gets instantiated to the Gödel number of the universe which is the fixed point of decision theory and decision problem, and similarly for the free variable in the formula representing the decision problem.

This is important! If the decision theory was passed the decision problem, it could make its decision using physical counterfactuals: It could go, "Which utility does the decision problem assign to the agent $c o n s t (1)$ ? To the agent $c o n s t (2)$ ? The agent $c o n s t (3)$ ? Ok, looks like $c o n s t (2)$ gets the best payoff; I'll take action $2$ , thank you." On extensional decision problems, this procedure would perform optimally. (Extensional decision problems are problems where an agent's payoff only depends on the action the agent decides to take, not on the computation the agent uses to arrive at this decision.) But this "optimal" procedure doesn't solve the challenge we're interested in solving: How do you make decisions given that you're just an ordinary part of the universe, not a specially marked thing that can be swapped out for an alternative agent?

So I really don't think we should write $\to T (\to P)$ or $\to P (\to T)$ to denote the agent and universe that are the fixed point of a decision theory with a decision problem. This makes it look like we're substituting $\to P$ for the free variable(s) inside $\to T$ , which is not what we're doing.

Possible notations

Here's a suggestion that seems reasonable to me, if not perfect: Suppose that we write $\to A (\to T, \to P)$ and $\to U (\to T, \to P)$ for the agent and universe that result from combining the decision theory $\to T$ with the decision problem $\to P$ ; that is, for the solution of $G L ⊢ (\to A (\to T, \to P) \leftrightarrow \to T (\to U (\to T, \to P))) \land (\to U (\to T, \to P) \leftrightarrow \to P (\to A (\to T, \to P)))$ If we combine this with the earlier notation, we can then, e.g., write $[\to A (\to T, \to P) = p r e s s]$ .

This is still pretty noisy, though. Another possibility would be to have some more syntaxy syntax for taking the fixed point of two sequences of formulas $\to P (\to a)$ and $\to T (\to u)$ ; perhaps define $\to T ⟨ \to P ⟩$ and $\to P ⟨ \to T ⟩$ to be sequences of formulas such that

$G L ⊢ (\to T ⟨ \to P ⟩ \leftrightarrow \to T (\to P ⟨ \to T ⟩)) \land (\to P ⟨ \to T ⟩ \leftrightarrow \to P (\to T ⟨ \to P ⟩)) ?$

Again, we could combine this with the previous notation to allow us to write, e.g., $[\to T ⟨ \to P ⟩ = p r e s s]$ .

If we adopt that notation, I'd feel better about seeing a decision problem notationally just as "a universe $\to U (\to a)$ with free variables" representing an agent's actions, and similarly a decision theory as "an agent $\to A (\to u)$ with free variables" representing a universe (getting closer to the notation Patrick would like to use). The reason I'd be more sold on this now is that we wouldn't have to use separate letters for the no-variable agent and universe that result from taking the fixed point, we'd just write $\to U ⟨ \to A ⟩$ and $\to A ⟨ \to U ⟩$ , and $[\to A ⟨ \to U ⟩ = p r e s s]$ . Again, $\to U ⟨ \to A ⟩$ and $\to A ⟨ \to U ⟩$ would represent sequences of formulas satisfying

$G L ⊢ (\to A ⟨ \to U ⟩ \leftrightarrow \to A (\to U ⟨ \to A ⟩)) \land (\to P ⟨ \to T ⟩ \leftrightarrow \to P (\to T ⟨ \to P ⟩)) .$

Thoughts?

LESSWRONG
LW

On notation for modal UDT

2

Ω 1

Agents and universes

Modal agents vs. modal decision theory

Possible notations

2

Ω 1