In Cooperative Inverse Reinforcement Learning (CIRL), a human H and a robot R cooperate in order to best fullfil the human's preferences. This is modeled as a Markov game M=⟨S,{AH,AR},T(⋅|⋅,⋅,⋅),{Θ,R(⋅,⋅,⋅;⋅)},P0(⋅,⋅),γ⟩.

This setup is not as complicated as it seems. There is a set S of states, and in any state, the human and robot take simultaenous actions, chosen from AH and AR respectively. The transition function T takes this state and the two actions, and gives the probability of the next state. The γ is the discount factor of the reward.

What is this reward? Well, the idea is that the reward is parameterised by a θ∈Θ, which only the human sees. Then R takes this parameter, the state, and the actions of both parties, and computes a reward; this is R(s,aH,aR;θ) for a state s and actions aH and aR by the human and robot respectively. Note that the robot will never observe this reward, it will simply compute it. The P0 is a joint probability distribution over the initial state s0, and the θ that will be observed by the human.

Behaviour in a CIRL game is defined by a pair of policies (πH,πR), that determine the action selection for H and R respectively. Each agent gets to observe the past actions of the other agent, so in general these policies could be arbitrary functions of their observation histories: πH:[AH×AR×S]∗×θ→AH and πR:[AH×AR×S]∗→AR.

The optimal joint policy is the policy that maximises value, which is the expected sum of discounted rewards. This optimal is the best H and R can do if they coordinate perfectly before H observes θ. It turns out that there exist optimal policies that depend only on the current state and R's belief about θ.

Manipulation actions

My informal critique of CIRL is that it assume two untrue facts: that H knows θ (ie knows their own values) and that H is perfectly rational (or noisly rational in a specific way).

Since I've been developing more machinery in this area, I can now try and state this more formally.

Assume that M always starts in a fixed state s0, that the reward is always zero in this initial state (so R(s0,⋅,⋅;⋅)=0), and that transitions from this initial state are independent of the agent's actions (so T(s|s0,⋅,⋅) is defined indendently of the actions). This makes R's initial action aR0 irrelevant (since R has no private information to transmit).

Then let πH be the optimal policy for θ, and (πH)′ be the optimal policy for θ′ (this θ′ may be either independent of or dependent on θ).

Among the action set AR is a manipulative action a′ (this could involve tricking the human, drugging them, brain surgery, effective propaganda, etc...) If aR0=a′, the human H will pursue (πH)′; otherwise, they will pursue πH. If we designate I′ as the indicator variable of aR0=a′ (so it's 1 if that happens and 0 otherwise), then this corresponds to following the compound policy:

π=I′(πH)′+(1−I′)πH.

This is well defined as policies map past sequences of states and actions, and

In Cooperative Inverse Reinforcement Learning (CIRL), a human H and a robot R cooperate in order to best fullfil the human's preferences. This is modeled as a Markov game M=⟨S,{AH,AR},T(⋅|⋅,⋅,⋅),{Θ,R(⋅,⋅,⋅;⋅)},P0(⋅,⋅),γ⟩.

This setup is not as complicated as it seems. There is a set S of states, and in any state, the human and robot take simultaenous actions, chosen from AH and AR respectively. The transition function T takes this state and the two actions, and gives the probability of the next state. The γ is the discount factor of the reward.

What is this reward? Well, the idea is that the reward is parameterised by a θ∈Θ, which only the human sees. Then R takes this parameter, the state, and the actions of both parties, and computes a reward; this is R(s,aH,aR;θ) for a state s and actions aH and aR by the human and robot respectively. Note that the robot will never observe this reward, it will simply compute it. The P0 is a joint probability distribution over the initial state s0, and the θ that will be observed by the human.

Behaviour in a CIRL game is defined by a pair of policies (πH,πR), that determine the action selection for H and R respectively. Each agent gets to observe the past actions of the other agent, so in general these policies could be arbitrary functions of their observation histories: πH:[AH×AR×S]∗×θ→AH and πR:[AH×AR×S]∗→AR.

The optimal joint policy is the policy that maximises

value, which is the expected sum of discounted rewards. This optimal is the best H and R can do if they coordinate perfectly before H observes θ. It turns out that there exist optimal policies that depend only on the current state and R's belief about θ.## Manipulation actions

My informal critique of CIRL is that it assume two untrue facts: that H knows θ (ie knows their own values) and that H is perfectly rational (or noisly rational in a specific way).

Since I've been developing more machinery in this area, I can now try and state this more formally.

Assume that M always starts in a fixed state s0, that the reward is always zero in this initial state (so R(s0,⋅,⋅;⋅)=0), and that transitions from this initial state are independent of the agent's actions (so T(s|s0,⋅,⋅) is defined indendently of the actions). This makes R's initial action aR0 irrelevant (since R has no private information to transmit).

Then let πH be the optimal policy for θ, and (πH)′ be the optimal policy for θ′ (this θ′ may be either independent of or dependent on θ).

Among the action set AR is a manipulative action a′ (this could involve tricking the human, drugging them, brain surgery, effective propaganda, etc...) If aR0=a′, the human H will pursue (πH)′; otherwise, they will pursue πH. If we designate I′ as the indicator variable of aR0=a′ (so it's 1 if that happens and 0 otherwise), then this corresponds to following the compound policy:

This is well defined as policies map past sequences of states and actions, and