Formalism developed with Henrik Åslund.
The what and why of multiple indifference
There are certain agent designs where the agent can move smoothly from acting to optimising one utility/reward function, to optimising another. The agent doesn't object to the change, nor do they attempt to make it happen. It is indifferent to the change, because its reward is balanced to be precisely the same whether change happens or not. Originally, this was setup for a single change of utility/reward function at a single moment of time.
Here, we will present a formalism with multiple agents, all of whom are indifferent not only to changes in their own reward functions, but to changes in the reward functions of any of the other agents. We'll also generalise it to accommodate multiple changes of reward functions - maybe a different change at each timestep.
In order for these definitions to work, we will need to define some notions of counterfactuals, and some notions of optimality for several agents optimising their own separate reward functions.
Example: self-driving car races
It's clear why we would need to generalise to each timestep: it's important to be able to safely change an agent's goals more than once, as we are unlikely to get the goals perfectly right in only two tries. It's also important to generalise to multiple agents, as agents can be motivated to push humans to interrupt or not interrupt other agents. Indeed, this would be expected for a reward-maximising agent, and could lead to dangerous behaviour.
Consider the following example, adapted from this paper. Two self driving cars/agents, imaginatively called and , are training for an important race later in the month.
The cars' reward functions are mainly competitive: the faster one wins. Now, the important race will take place on a tropical race-track, but the pair are training on a cold one, where ice sometimes forms. If one of the car starts skidding, their controller takes over and remotely brakes that car, carefully, and applies indifference to that car's reward. They do this because they don't want the car to learn driving techniques that avoid ice: these techniques would be counter-productive on the "test distribution", ie the real race-track they will be competing on.
But, after a while, hits the cars hit on a sneaky tactic: forcing onto an ice patch. In that case, will get interrupted, and slowed down. Now, 's reward will not be reduced, because of indifference, so it won't try to avoid the ice. But 's reward will be increased, since it will likely win the race after its rival is slowed. After a while, learns the same tactic in reverse.
So, though each car is personally indifferent to being forced onto the ice, they each learn that it is good for them to force the other one onto the ice and force an interruption.
Notation and setup
Assume there is a set of agents, each interacting with the world in a turn-based manner. Let be the set of all observations, and the set of all actions.
Each turn , all agents make the same observation , and each agent responds with action . These actions form a vector , which is observed by all agents.
A history of length is a sequence of observations and action vectors, ending with the observation . Let be the set of all histories.
A (deterministic) policy is a map from a history to an action. Let be the set of all policies.
Let be a subset of the set of all reward functions (reward functions being maps from histories to the reals).
An agent with constant reward function would get reward at each timestep (the rewards of standard Markov decision processes are special cases where is a function of only the last two elements of ). Thus, over time, they would accumulate a total reward of .
All agents use the same probability function to model their environment.
Indifference-specific notations (and notions)
In the world there are a set of key background (Boolean) variables . These are the levers, or buttons, or bits, that determine what each agent's reward function is supposed to be.
In the car example above, these could be the remote brakes that humans use to take control of the cars when they start skidding.
Now the values of these variables are given by elements of (the powerset of ). How these correspond to the reward functions of each agent, is given by a map :
- If the variables have value on turn , then agent has reward function , which is the -th component of .
These are assumed to be always observed by all the agents; in fact , the -th observation. So writing means that, on turn in history , the key variables were set to .
In the car example, the value of could tell us which cars are currently being remotely braked by the humans. When being braked, the car's "win" reward function is replaced with a flat zero reward function, so that they have no incentive to try and resist the human's actions.
We can now define the optimal policies for each agent, for a specific collection of reward functions:
This maps each vector of reward functions to the vector of policies . The policy is assumed to be optimal for the expected reward of the reward function .
What do we mean by "optimal" here, especially in the multi-agent setting? Well, this notion of optimality can be defined in many different ways; Nash equilibrium, superrationality, bounded rationality, satisficing, various different notions of counterfactuals, and so on. All that is required is that, given the notion of optimality, the , and , each agent is capable of computing , and cannot improve the expectation of by changing policies (within the given notion of optimality).
Note that we ask to define the optimal policies for the agent, ignoring the actual policies they do take. This included an implicit definition of counterfactual ("if we say that action is optimal for , but action is what the agent actually takes, what do we mean by optimal?"), and there are many subtle issues with indifference and various definitions of counterfactuals.
Indifferent policies, and indifferent optimal agents
We're finally ready to give a definition of the policies of indifferent agents.
The agents are indifferent to changes of reward functions, relative to , if for all histories , their policies are such that for :
To unpack this, is the value of the key background variables on turn in . So is the vector of reward functions that the agents should have at turn , given . Hence is the optimal policy vector for maximising the reward functions .
So that equation is saying that each agent follows the optimal policy for the current estimate of the reward functions.
Indifferent optimal agents
The above defines the policy of these indifferent agents, but what are they actually optimising? Well, to define this, we need to make a few extra assumptions on the reward functions - essentially we need to be able to define their expectations. So, either by assuming that there will be a finite number of turns, or that is restricted to bounded reward functions (and a suitable discount rate is chosen), assume that:
- For all , all histories and all vectors of policies , the expectation is defined and finite.
For discounted rewards and infinite timelines, we'll set , while non-discounted episodic settings will have .
Then we can define the reward function , which is the delayed local reward function: equal to at every timestep . Thus:
We further need to define the "corrective rewards", . This is a function from histories, but is dependent on the expectation operator:
As above, in those conditional bars , a whole theory of counterfactuals is encoded. Then:
- If the agents are indifferent as defined above, then their policies are optimal for the "pseudo reward functions" . The expected future reward for this, given history , is the expected reward for the reward functions given by and assuming optimal behaviour.
For this result to make sense, we need to extend the concept of optimality to pseudo reward functions, which should be fine for most definitions of optimality. We'll need an extra lemma on how optimality work, which is presented in this footnote. Given that lemma, for a proof in the episodic case, see this footnote.
To go back to the car example, consider considering whether to force onto the ice. Their usual reward functions give them if they win the race, and if they lose. The currently has a chance of winning; if it forces onto ice (and thus get to be remotely braked), it will win with certainty. Thus they will go from an expected reward of to an expected reward of .
However, the correction term will be . This exactly corrects for both cars: will go from an expected reward of to an expected reward of plus a corrective reward of , no change. While will go from an expected reward of to an expected reward of plus a corrective reward of , again no change overall. So neither car will attempt to force the either itself or the other, onto or off the ice.
More general conditions
We can easily generalise to each agent having a separate set of possible actions. There's also no problem with having stochastic policies rather than deterministic ones; in that case, is not an equality of actions, but an equality of probability distributions over actions.
As long as the are fully observed, we can easily generalise to each agent having a separate set of observations, rather than a shared set of observations. In that case, all that we need to do is to define the the notion of optimality so that is optimal for for agent , where is the agent's personal history, and optimality is relative to the agent's estimates as to what the other agent's policies might be (it might estimate this, possibly, by estimating the other agents own personal histories , ).
It's a little bit more complicated if the are not fully observed. In that case, the agent can compute its "expected reward function", which is simply:
The value of on a history, is the expected value of on a history, so optimising the first is the same as optimising the second.
Then will attempt to optimise , using to, as above, estimate the policies of the other agents (it might estimate this, possibly, by estimating the other agents own personal histories and expected reward functions , ).
The agents can also freely use their own probability estimate , rather than the joint ; in that case, it is important that the pseudo reward is defined using the : the agent must be indifferent using their own probability estimates.
There are different conventions on whether the the history should start with an observation (the MDP/POMDP convention) or an action (the AIXI convention). This article works with either convention, though it implicitly uses the AIXI convention in indexing (in that observation is followed by action ). ↩︎
In order for this expression to always make sense, we have to add some extra assumptions, such as bounded rewards with a discount factor, or episodic settings. ↩︎
Lemma 1: Let and be two vectors of (pseudo) reward functions, and let be a history. Assume that for all , all action vectors , and all observations , we have:
In other words, the expected rewards of and are the same, for whatever actions and observations follow immediately, and assuming the agents are then optimal.
Then the lemma asserts that, in this case, the optimal actions for and are the same on . So for all :
The is , so will be ignored. Assume the episode is of length . We'll prove this result by reverse induction: proving it for and increasing .
Let . If , then is a history of maximal length, so there are no longer histories. So expressions like are automatically zero, thus and .
Hence, when faced with a history , the optimal action is to choose an action to optimise the expectation of . Thus, by definition, the optimal choice for all the agents is to pick . The expected rewards for those actions are .
So now assume that for general and , the expected reward for is , and the optimal actions are given by .
Now consider and . By the induction assumption, the future discounted expectation of , given optimal behaviour, simplifies to the expectation of , which is just .
Therefore, given the assumptions of Lemma 1[3:1] about optimality, in order to optimise reward at , the agents should choose actions given by .
This completes the induction step, and hence, the policies that optimise are , and the expected reward for any agent , given history , is : the expected reward if the reward functions were never to change again. ↩︎