Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Probabilities, weights, sums: pretty much the same for reward functions

20th May 2020

3Dagon

New Comment

1 comment, sorted by Click to highlight new comments since: Today at 6:21 PM

You know you have to maximise 0.5R1+0.5R2

Can I have a little more detail on the setup? Is it a fair restatement to say: You're an agent, with a static reward function which you do not have direct access to. Omega (God, your creator, someone infallable and honest) has told you that 0.5R1 + 0.5R2 is reducable to your reward function, somehow, and you are not capable of experimenting or observing anything that would disambiguate this.

Now, as an actual person, I'd probably say "Fuck you, God, I'm running the experiment. I'll do something that generates different R1 and R2, measure my reward, and now I know my weighting."

In the case of an artificially-limited agent, who isn't permitted to actually update based on experience, you're right that it doesn't matter - probability _is_ weight for uncertain outcomes. But you have an unnecessary indirection with "respects conservation of expected evidence. " You can just say "unable to update this belief".

This post is a more minor post, that I'm putting up to reference in other posts.## Probabilities, weights, and expectations

You're an agent, with potential uncertainty over your reward function. You know you have to maximise

0.5R1+0.5R2

where R1 and R2 are reward functions. What do you do?

Well, how do we interpret the 0.5? Are they probabilities for which reward function is right? Or are they weights, telling you the relative importance of each one? Well, in fact:

Thus, if you don't expect to learn any more reward function-relevant information, maximising reward given P(R1)=P(R2)=0.5 is the same as maximising the single reward function R3=0.5R1+0.5R2.

So, if we denote probabilities with in bold, maximising the following (given no reward-function learning) are all equivalent:

0.5R1+0.5R21(0.5R1+0.5R2)0.25R1+0.25R2+0.5(0.5R1+0.5R2)0.5(1.5R1−0.5R2)+0.5(1.5R2−0.5R1)

Now, given a probability distribution pR over reward functions, we can take its expectation E(pR). You can define this by talking about affine spaces and so on, but the simple version of it is:

to take an expectation, rewrite every probability as a weight. So the result becomes:## Expected evidence and unriggability

We've defined an unriggable learning process as one that respects conservation of expected evidence.

Now, conservation of expected evidence is about expectations. It basically says that, if π1 and π2 are two policies the agent could take, then for the probability distribution pR,

E(pR ∣π1)=E(pR ∣π2).

Suppose that pR is in fact riggable, and that we wanted to "correct" it to make it unriggable. Then we would want to add a correction term for any policy π. If we took π0 as a "default" policy, we could add a correction term to pR∣π:

(pR∣π)→(pR∣π)−E(pR∣π)+E(pR∣π0).

This would have the required unriggability properties. But how do you add to a probability distribution - and how do you subtract from it?

Bur recall that unriggability only cares about expectations, and expectations treat probabilities as weights. Adding weighted reward functions is perfectly fine. Generally there will be multiple ways of doing this, mixing probabilities and weights.

For example, if (pR∣π)=0.5R1+0.5R2 and (pR∣π0)=0.75(R1−R2)+0.25R2, then we can map (pR∣π) to

This multiplicity of possibilities is what I was trying to deal with in my old post about reward function translations.