In this post I consider a single hypothetical which potentially has far-reaching implications for the future of AI development and deployment. It has to do with a complex interactions between the assumptions of which decision theory humans use and the method used to infer their values, such as something like an inverse reinforcement learning algorithm.
Consider the Newcomb’s problem. We have two boxes, box A and box B. Box B always has $1000. Box A has $1,000,000 if and only if a near perfect predictor Omega predicts that the agent picks only Box A. We have two agents: agent1, who one boxes (it’s an FDT agent) and agent2 who two-boxes (a CDT agent). In addition to Omega, there is an inverse reinforcement learner (later abbreviated as IRL) trying to infer the agent’s “values” from it’s behavior.
What kinds of reward signals does the IRL assume that agent1 or agent2 have? I claim that in the simplistic case of just looking at two possible actions, it will likely assume that agent1 values the lack of money because it fails to pick box2. It will correctly deduce that agent2 values money.
In effect, a naïve IRL learner assumes CDT as the agent’s decision theory and it will fail to adjust to learning about more sophisticated agents (including humans).
This depends a little bit on the setup of the IRL agent and the exact nature of the states fed into it. I am generally looking at the following setup of IRL Since we have a finite state and action space, the IRL learner simply tries to pick a hypothesis set of reward functions which place the highest value on the action taken by agent compared to other actions.
This also depends on the exact definition of which “actions we are considering. If we have potential actions of “pick one box” or “pick two boxes,” the IRL agent would think that agent1’s preferences are reversed from the its actual preferences.
This is bad, very extremely bad, since even the opposite of the utility function is now in the hypothesis set.
If, for example, we have three actions of “pick one box”, “pick two boxes” or “do nothing,” then the preference of “pick one box” over “do nothing” removes the reverse of agent1 reward function from the hypothesis set. It, however does not put the reward function of “maximize money” into the hypothesis set.
I can think of several objections / fixes to the problem existing, but I don’t think they make the problem easily resolvable.
a) Agent1 actually makes more money in the long run.
This is true, but we are not comparing the decision of agent0 to become agent1 or agent2, rather decision of agent1 to take or not take the second box. Also, if agents face these kinds of decisions, it’s hard to be consistent from the IRLs point of view.
b) IRL could incorporate decision theory of the agent into its estimates of it’s reward function.
This might happen, but it’s an un-called out hard problem since it requires learning the decision theory of an agent. If you assume every agent is using FDT, this still gives the wrong answer if they are not.
c) IRL could try to use only simple hypotheticals. It’s unclear if these are enough to infer the complexity of value.
d) IRL could use agent’s internal evaluations to better infer it’s true reward function – this becomes too circular, since we are trying to learn true utility, which is highly related to it’s reward function.
e) IRL could average out more situations, would work only if Newcomb’s problems don’t dominate the decision landscape.
f) IRL could simulate “the decision to become the type of agent that one-boxes” instead of “decision to one-box”. Once again, this feels like assuming FDT.
So, we have a problem of inverse reinforcement learning being dependent both on the decision theory of the agent in questions and whether sets of actions include a default do nothing action.
This seems like a non-trivial problem, but there are other complications that arise as well.
The major meta-complication with this setup. If agent1 is truly an FDT agent *and* it knows that IRL exists, it would act differently depending on the exact amount of value that the IRL can allocate to it later. If the amount of value dwarfs the amount it would get from Newcomb’s problems, it would start two-boxing. Thus, if there is a naïve IRL agent which is also an AGI, it’s potential presence could change people’s behavior to be less cooperative with each other.
The other complication meta from the previous one, is that the IRL’s learning system has an effect on agents it is observing. This creates strange incentives for the IRL itself. It suddenly might be aware that its own value learning system is causing problems in the world according to either people’s judgement or a certain initial utility function. Depending on how closely it is supposed to guard its value system, it could try to remove it. In either case, the AGI’s loop of understanding its own potentially negative impact through simulating humans creates an inherent instability and self-tension.