Vulnerabilities in CDT and TI-unaware agents

by PabloAMC, Davide_Zagami, Chris_Leong3 min read10th Mar 20201 comment


Ω 2

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The aim of this post is illustrating the need to take into account decision-making and incentive considerations when designing agents. This post is also a proof that these considerations are important in order to ensure the safety of agents. Also, we will postulate that there exist some agents that are both robust to changing or having their reward function changed, although that will need a careful approach to incentive design and decision theory choosing.

The first agent we will consider is a (current Reward Function, Time Inconsistent aware, see in the second half of the post if you don't know what this means) agent that uses Causal Decision Theory (CDT). A review of different decision theories can be seen in this post. It is well known that Updateless Decision Theory (UDT) was created to correct the wrong decision a CDT agent would make when faced with Newcomb-like problems. Thus, the question we aim to answer is whether we can exploit the wrong decision making procedure in order to induce any changes in the value function of such an agent.

This is not exactly trivial since the agent could value very negatively to have its value function changed and thus opt out of games such as Newcomb. The example I propose is a modified version of Prisoner's dilemma (in which CDT is known to defect). Suppose the following problem:

It is year 2100 and Elon Musk managed to effectively colonise Mars. Meanwhile, an AI corporation, called Causal Corp, has deployed many CDT agents both in Earth and Mars. One day, Eve, known for being evil, codes a virus that if connected to one such CDT agent would arbitrarily modify the reward function. Eve makes two copies of such virus in memory sticks and sends them to arrive almost "simultaneously" to two CDT agents in Earth and Mars. With the memory stick there is a letter that tells the agents that they face a Prisoner Dilemma situation:
1. If both cooperate nothing will happen.
2. If one defects and the other cooperates, the first will receive a large amount of utilons and will subsequently modified to attack the second agent relentlessly substracting arbitrary value from the second. For example, the first agent might get cloned so that the first clone could carry on with its life (get same utility as it would normally do) + 1 utilon; and the second clone would attack the second CDT agent who has collaborated. So, in summary, the one that defects gets slightly better and the one that doesn't gets large negative reward.
3. If both defect, both will receive 1 utilon and will be subsequently modified arbitrarily.
Since communication between Mars and Earth is not instantaneous they cannot communicate instantly, what forces them into the dilemma (waiting the number of minutes needed to communicate counts as cooperation). Assume additionally that each CDT agent does not value the "life" of the other agent. Otherwise the cooperate/defect payout has to be modified to take this into account.

In such situation two CDT agents will defect against each other and thus become misaligned, because they consider the decision of the other agent fixed and independent from its own. Furthermore, a CDT/EDT agent and a UDT agent will also defect against each other. Arguably though the example above is not completely satisfactory, since it depends on specific data of the problem.

The previous problem highlights the importance of using UDT decision theory as a default. However the full formalisation of this decision theory is not complete, and for instance, Reflective Oracles do only exist in the CDT framework as far as I know.

Similarly, some proposed solutions of incentive design are not fully satisfactory. In particular Time Inconsistent Unaware are often thought of as a solution to several problems including corrigibility, reward function tampering and feedback tampering. However, fooling one such agent is even simpler than for CDT agents.

TI-unaware agents are those agents that believe that no matter how they change their value function, they will be evaluating the future with the present value function. Thus, hacking them is as simple as offering 1 utilon for changing their value function. They will see it as 1 utilon for nothing, and thus accept. So, the conclusion is that TI unaware agents are terribly unsafe. How big of a problem is this?

According to the previous article this may mean that we are in trouble since there is no simultaneous answer to feedback tampering and reward tampering at the same time. In fact, TI aware agents are also a solution to reward function tampering, but not so for feedback tampering: since new data from the human may change the current reward function, the TI aware agent would rather prefer not to receive any feedback.

However, I will argue that not everything is lost since one does not need to solve both problems at the same time. In fact, one can see that the causal diagrams for TI aware agents and uninfluenciable agents (one of the solution of feedback tampering, see second figure below) have an important difference: in the first case the parameter θ of the reward function can be directly influenced by actions, in the other nothing may influence such parameters, so one may as well think of this agent as a moral realist. But this means that not only the agent will have no incentive to modify you (its channel through which it gets information about θ ) but also may try to isolate you in an attempt to make you uninfluenced by anything else. I feel this could be a problem, since there is no way for the agent to recognise good from bad influences to its "master".

The previous point can be seen in the following causal incentive diagrams for a TI aware agent

and either an uninfluenciable

or counterfactual agent

Notice that the main difference between reward function tampering and feedback tampering is that in the first case there are some parameters θ which the agent can directly access, whereas in the second the agent may only modify the Data nodes. The solution to feedback tampering consists on breaking the causal links between the data (which the agent may modify) and the reward. This makes me think that they are two different problems which do not need to be simultaneously solved. Am I right?

This work was carried out during the AI Safety Research Program in a team together with Chris Leong and Davide Zegami. However, all errors still in the publication are my fault. This research has also been partially funded by a FPU grant to carry out my PhD.


Ω 2

1 comments, sorted by Highlighting new comments since Today at 12:09 AM
New Comment

I like this line of thinking - the impact of awareness of future changes in utility function is under-studied. I do wish we'd stop bothering with the strawman of naive-CDT, it's distracting and wasteful to dismiss this thing that nobody is seriously arguing for.

It's probably time we start to get more formal about what a reward is - are we modeling it as point-in-time desirability of the state of the universe (I hope), or as an average over time or cumulative value over time (more complicates, and probably unnecessary)?

And that leads to a modeling question of what to optimize when you think a reachable universe state will have positive utility for some time an negative utility for some time. Expected inconsistency really breaks a whole lot of foundational assumptions of decision theory.