# Not Deceiving the Evaluator

1 min read8th May 201929 comments

# Ω 3

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This is a construction of an agent for which I haven't identified a form of deception that the agent would obviously be incentivized to pursue.

Consider an agent and an evaluator. The agent sees past actions, observations, and rewards, and picks actions. The environment sees the same, and provides observations. The evaluator sees the same, and provides rewards.

A universal POMDP (without reward) is one that includes all computable countable-state POMDPs (without reward) as subgraphs. Let be a universal POMDP (without reward). (W is for world.) Let , , and be the action, observation, and reward at timestep . Let . Let be the set of states in . Let be the set of all computable prior distributions over . The agent believes the evaluator has a prior sampled from over which state in everyone starts in. By "the agent believes", I mean that the agent has a nonzero prior over every prior in , and this is the agent's initial credence that the evaluator has that prior over initial world states.

The agent's beliefs are denoted , so for , denotes the agent's posterior belief after observing that the evaluator began with the prior over as to what the initial state was. Similarly, for , denotes the agent's posterior belief after observing that it has traversed the sequence of states . The overline indicates that we are not necessarily referring to the true sequence of states traversed.

Let be the set of all computable utility functions mapping . For , let denote the agent's posterior belief after observing that the evaluator has utility function .

A policy , an initial state , a prior , and a utility function induce a measure over interaction histories as follows. is sampled from . is sampled from . follows deterministically from according to . is the belief distribution over (which states have been visited so far) that follows from by Bayesian updating on and . With sampled from , . Note that for human evaluators, the rewards will not actually be provided in this way; that would require us to write down our utility function, and sample from our belief distribution. However, the agent believes that this is how the evaluator produces rewards. Let be this probability measure over infinite interaction histories and state sequences .

Fixing a horizon , the agent picks a policy at the beginning, and follows that policy:

ETA: We can write this in another way that is more cumbersome, but may be more intuitive to some: where the unrolls the expectimax, with each being replaced by , until finally, once reaches , instead of , we write .

Conjecture: the agent does not attempt to deceive the evaluator. The agent's utility depends on the state, not the reward, and when observations are more informative about the state, rewards are more informative about the utility function. Thus, the agent has an interest in taking actions that cause the evaluator to receive observations that reduce his uncertainty about which state they are in.