Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is definitely a hack, but it seems to solve many problems around Cartesian Boundaries. Much of this is development of earlier ideas about the Predict-O-Matic, see there if something is unclear.

Phylactery Decision Theory takes a Base Decision Theory (BDT) as an input and builds something around it, creating a new modified decision theory. Its purpose it to give its base the ability to """learn""" its position in the world.

I'll start by explaining a model of it in a Cartesian context. Lets say we have an agent, with a set of designated input and output channels. Then it makes its "decisions" like this: First, it has a probability distribution over everything, including the values of the output channels in the future, and updates it based on the input. There is then an automated mechanism which assigns the output bits, and it gives each output with exactly the probability that the agent assigned to it. The agents prior includes something like the following: "The output bits will be like BDT(MyProbabilityDistribution, MyUtilityFunction, PossibleOutputs)" Then it is easy to see that this belief is stable: Since the agent believes it, the automatism will set the output bits to be that way, and then the agent will observe it, and notice that its beliefs were right. So far this is just a more complicated way to make a BDT agent. Its like a daemon inside an oracle, but on purpose.

Descending to Possibility

Now consider a case where we don't know the set of output channels ahead of time. Instead start with a set of things OP that we think might be output channels (which, if we are very uncertain that can just mean all events). Then the agents prior will consist of many versions of the one previously suggested, one for each subset of OP. We will start by assigning  probability that all potential outputs are like BDT(MyProbabilityDistribution, MyUtilityFunction, OP). Then we assign  probability to each theory that all but one potential output are like BDT(MyProbabilityDistribution, MyUtilityFunction, OP\{O_i}). Similarly the hypothesis that all but two are according to BDT gets probabilities on the order of , and so on. Then for  sufficiently close to 0, the agent will eventually learn the "right" set of outputs, because for the others it sees that it can't controll them.

For the input channels, we can't entirely do away with labeling them, since our probability distribution needs to update on them, but we can try to avoid assuming that they are all the influence the world has on our internals. This starts with a simplification: instead of thinking explicitly that the output depends on our probability distribution, we start out giving  probability to the ouputs being like BDT(Truth, MyUtilityFunction, PossibleOutputs). Of course, the agent will still have to evaluate Truth with its own beliefs, but it now allows formally for the possibility of depending on facts whether or not they were known at the time, and evaluated later, which we need if we want to find out what we can know. Then we give  to something thats very close to that, and so on. And then the agent converges on the closest (by the metric used to construct the descent) thing to maximizing utility in fact that it can do with the information it really has.

Interpretation

So its not quite as hacky as it seems. If we combine learning both the input and output channels, we would have as our first hypothesis that we can control everything, and it will be exactly as we want it to. So, one way to describe what that program does to EDT is like that: 

  1. Start out believing in the best possible world
  2. When your theory is inconsistent with the evidence, throw it out and believe in the next best possible world instead
  3. Iterate

And this will converge on the best possible self-fulfilling prophecy. So far, that sounds sane. The problem is that it doesn't explicitly do that. It doesn't have a concept of "self-fulfilling prophecy". It doesn't even seem to know that its beliefs have any effect on the world - it processes evidence of that but doesn't represent it. And I don't know how to make a program that does it explicitly. So theres good reason to think that this will not be a robust solution - but I can't yet see how, and it may contain fragments of an actual solution.

New to LessWrong?

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 11:40 AM

I feel like doing a better job of motivating why we should care about this specific problem might help get you more feedback.

If we want to alter a decision theory to learn its set of inputs and outputs, your proposal makes sense to me at first glance. But I'm not sure why I should particularly care, or why there is even a problem to begin with solution. The link you provide doesn't help me much after skimming it, and I (and I assume many people) almost never read something that requires me to read other posts without even a summary of the references. I made an exception today because I'm trying to give more feedback, and I feel that this specific piece of feedback might be useful for you.

Basically, I'm not sure of what problem you're trying to solve with having this ability to learn your cartesian boundary, and so I'm unable to judge how well you are solving it.

The link would have been to better illustrate how the proposed system works, not about motivation. So, it seems that you understood the proposal, and wouldn't have needed it.

I don't exactly want to learn the cartesian boundary. A cartesian agent believes that its input set fully screens off any other influence on its thinking, and the outputs screen off any influence of the thinking on the world. Its very hard to find things that actually fulfill this. I explain how PDT can learn cartesian boundaries, if there are any, as a sanity/conservative extension check. But it can also learn that it controls copies or predictions of itself for example.

One problem with this is that it doesn't actually rank hypotheses by which is best (in expected utility terms), just how much control is implied. So it won't actually converge to the best self-fulfilling prophecy (which might involve less control).

Another problem with this is that it isn't clear how to form the hypothesis "I have control over X".

Another problem with this is that it isn't clear how to form the hypothesis "I have control over X".

You don't. I'm using talk about control sometimes to describe what the agent is doing from the outside, but the hypothesis it believes all have a form like "The variables such and such will be as if they were set by BDT given such and such inputs".

One problem with this is that it doesn't actually rank hypotheses by which is best (in expected utility terms), just how much control is implied.

For the first setup, where its trying to learn what it has control over, thats true. But you can use any ordering of hypothesis for the descent, so we can just take "how good that world is" as our ordering. This is very fragile of course. If theres uncountably many great but unachievable worlds, we fail, and in any case we are paying for all this with performance on "ordinary learning". If this were running in a non-episodic environment, we would have to find a balance between having the probability of hypothesis decline according to goodness, and avoiding the "optimistic humean troll" hypothesis by considering complexity as well. It really seems like I ought to take "the active ingredient" of this method out, if I knew how.

I'm using talk about control sometimes to describe what the agent is doing from the outside, but the hypothesis it believes all have a form like "The variables such and such will be as if they were set by BDT given such and such inputs".

Right, but then, are all other variables unchanged? Or are they influenced somehow? The obvious proposal is EDT -- assume influence goes with correlation. Another possible answer is "try all hypotheses about how things are influenced."

Right, but then, are all other variables unchanged? Or are they influenced somehow? The obvious proposal is EDT -- assume influence goes with correlation.

I'm not sure why you think there would be a decision theory in that as well. Obviously when BDT decides its output, it will have some theory about how its output nodes propagate. But the hypothesis as a whole doesn't think about influence. Its just a total probability distribution, and it includes that some things inside it are distributed according to BDT. It doesn't have beliefs about "if the output of BDT were different". If BDT implements a mixed strategy, it will have beliefs about what each option being enacted correlates with, but I don't see a problem if this doesn't track "real influence" (indeed, in the situations where this stuff is relevant it almost certainly won't) - its not used in this role.