Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This post is a more minor post, that I'm putting up to reference in other posts.

Probabilities, weights, and expectations

You're an agent, with potential uncertainty over your reward function. You know you have to maximise

where and are reward functions. What do you do?

Well, how do we interpret the ? Are they probabilities for which reward function is right? Or are they weights, telling you the relative importance of each one? Well, in fact:

  • If you won't be learning any more information to help you distinguish between reward functions, then weights and probabilities play the same role.

Thus, if you don't expect to learn any more reward function-relevant information, maximising reward given is the same as maximising the single reward function .

So, if we denote probabilities with in bold, maximising the following (given no reward-function learning) are all equivalent:

Now, given a probability distribution over reward functions, we can take its expectation . You can define this by talking about affine spaces and so on, but the simple version of it is: to take an expectation, rewrite every probability as a weight. So the result becomes:

  • If you won't be learning any more information to help you distinguish between reward functions, then distributions with same expectation are equivalent.

Expected evidence and unriggability

We've defined an unriggable learning process as one that respects conservation of expected evidence.

Now, conservation of expected evidence is about expectations. It basically says that, if and are two policies the agent could take, then for the probability distribution ,

Suppose that is in fact riggable, and that we wanted to "correct" it to make it unriggable. Then we would want to add a correction term for any policy . If we took as a "default" policy, we could add a correction term to :

This would have the required unriggability properties. But how do you add to a probability distribution - and how do you subtract from it?

Bur recall that unriggability only cares about expectations, and expectations treat probabilities as weights. Adding weighted reward functions is perfectly fine. Generally there will be multiple ways of doing this, mixing probabilities and weights.

For example, if and , then we can map to

  1. ,
  2. ,
  3. with ,
  4. ,
  5. and many other options...

This multiplicity of possibilities is what I was trying to deal with in my old post about reward function translations.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 2:28 PM
You know you have to maximise 0.5R1+0.5R2

Can I have a little more detail on the setup? Is it a fair restatement to say: You're an agent, with a static reward function which you do not have direct access to. Omega (God, your creator, someone infallable and honest) has told you that 0.5R1 + 0.5R2 is reducable to your reward function, somehow, and you are not capable of experimenting or observing anything that would disambiguate this.

Now, as an actual person, I'd probably say "Fuck you, God, I'm running the experiment. I'll do something that generates different R1 and R2, measure my reward, and now I know my weighting."

In the case of an artificially-limited agent, who isn't permitted to actually update based on experience, you're right that it doesn't matter - probability _is_ weight for uncertain outcomes. But you have an unnecessary indirection with "respects conservation of expected evidence. " You can just say "unable to update this belief".