# Ω 4

Personal Blog
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Instead of directly asking "what decision theory should an agent run," consider the question, "given a description of an environment and a description of an agent, identify the best action available to that agent, regardless of its algorithm."

These are pretty much the same question, with slightly different framings. (If you answer the latter, then the answer to the former is "approximate the latter thing on the world model.") The latter question is more general, though, and it highlights different parts of the problem.

For example, it highlights questions about what counts as an agent, how the agent is identified in the environment, how the set of things the agent "could have done" is determined, how the agent's "observations" are identified, and so on. These parts of the question are where decision theory intersects with naturalized induction.

This angle puts the spurious proof problem in a new light. Instead of considering UDT as an algorithm which computes what its best available strategy is, consider a "global" version of UDT which takes any description of an environment and an agent within that environment, and computes the best strategy available to that agent.

Assume away the naturalized problems and say that there is a sorted list of outcomes (best first) and a set of available strategies (observation-to-action mappings). Now we can define GlobalUDT(E, A) as a function which takes a description E of the environment, a description A of the agent, and computes the best strategy available to that agent:

(I don't know how to use latex in code blocks, so braces are used as dequotes in strings: if x="ell" then "h{x}o" is the string "hello".)

def GlobalUDT(E, A):
for o in outcomes:
for s in strategies:
if PA proves "{A}()={s} implies {E}()={o}":
return s
return default_strategy


For a given universe U, Vladimir's proof-based UDT can be recovered by defining the following via quining:

UDT := GlobalUDT(quote(U), quote(UDT))


And, as it happens, GlobalUDT will always identify the actual best strategy available to UDT. (GlobalUDT never finds bad spurious proofs, because it iterates through outcomes in order; if there are any spurious proofs then they are spurious proofs that the best strategy is the best strategy.) But GlobalUDT does not identify the best strategy for agents in general!

Consider the following universe:

outcomes = [3, 2, 1]
strategies = {Hi, Med, Low}
A = lambda: Low
E = lambda: {Hi: 3, Med: 2, Low: 1}[A()]


That is, there are three options Hi, Med, Low corresponding to payoffs 3, 2, 1. The "agent" always selects the action Low. GlobalUDT("E", "A") either returns Hi or Med, whichever it considers first, because both A()=Hi and A()=Med are contradictory (and so imply E()=Hi).

This isn't all that surprising (what does it mean to ask what would have happened if (lambda: Low)() == Hi?), but it does distill the problem with using logical implications as counterfactuals a bit. GlobalUDT only identifies the right strategy on UDT because it happens that UDT is something in the environment where knowing what it does requires knowing that PA is consistent, and GlobalUDT doesn't know that.

(But a version of GlobalUDT using ZFC would prescribe bad strategies for a PA-based version of UDT in the environment.)

Moving beyond proof-based UDT requires a better theory of logical counterfactuals, and GlobalUDT provides a slightly different angle of approach than the one I'm used to. Intuitively, a satisfactory theory of logical counterfactuals should give us an algorithm that identifies Hi as the best strategy available to const Low. I'm actually not too convinced that such a thing is possible/meaningful (at least, not without context and better answers to questions like "what's an agent" and "what's an embedding"), but this makes it a bit clearer that when we talk about wanting a full theory of logical counterfactuals we're talking about defining a working version of GlobalUDT.

Food for thought.

# Ω 4

New Comment

It seems like what is needed is a logical uncertainty model that allows things that PA would consider contradictions (so we can reason about finite agents outputting something they don't actually output), and that is causal. Some of the models in in Paul's Non Omniscience, Probabilistic Inference, and Metamathematics paper allow contradictions in a satisfying way, but don't contain any causation.

It seems that we want "this agent outputs this" to cause "consequences of this decision happen". Suppose we create a Boolean variable for every logical proposition , and we want to arrange them in something like a causal Bayesian network. Since the consequences of an action are a logical consequence of the agent's output that can be seen within a small number of steps, perhaps we want to arrange the network so that the premises of an inference rule being true will cause their conclusion to be true (with high probability; we still want some probability of contradiction). But if we naively create causal arrows from the premises of an inference rule (such as modus ponens) to its conclusion (for example, allow and to jointly cause ) then we get cycles. I'm not sure if causation is well-defined in cyclic graphs, but if it's not then maybe there is a way to fix this by deleting some of the causal arrows?

Yeah, causation in logical uncertainty land would be nice. It wouldn't necessarily solve the whole problem, though. Consider the scenario

outcomes = [3, 2, 1, None]
strategies = {Hi, Med, Low}
A = lambda: Low
h = lambda: Hi
m = lambda: Med
l = lambda: Low
payoffs = {}
payoffs[h()] = 3
payoffs[m()] = 2
payoffs[l()] = 1
E = lambda: payoffs.get(A())


Now it's pretty unclear that (lambda: Low)()==Hi should logically cause E()=3.

When considering (lambda: Low)()==Hi, do we want to change l without A, A without l, or both? These correspond to answers None, 3, and 1 respectively.

Ideally, a causal-logic graph would be able to identify all three answers, depending on which question you're asking. (This actually gives an interesting perspective on whether or not CDT should cooperate with itself on a one-shot PD: it depends; do you think you "could" change one but not the other? The answer depends on the "could.") I don't think there's an objective sense in which any one of these is "correct," though.