We show that assuming the existence of public-key cryptography, there is an environment in which Inverse Reinforcement Learning is computationally intractable, even though the "teacher" agent, the environment and the utility functions are computable in polynomial-time and there is only 1 bit of information to learn.

Construction

We call "Teacher" the role-model agent which optimizes the "true" utility function (in the context of AI safety this is typically a human) and "Student" the agent performing IRL (the AI). The environment we construct consists of two entities which we call the Gamemaster and the Adversary. The Adversary is an optimizer whose utility is minus the Teacher's utility. To reduce the setup to a single agent environment, we can interpret the Adversary as a specific optimization algorithm which either has access to the Teacher's source code or is a reinforcement learning algorithm that learns from preceding episodes. At any rate, the Adversary is only of secondary importance, as explained in the Discussion. The entire setup depends on a parameter w.r.t. which the complexities of the computations are measured.

The utility function of the Teacher depends on a bit $α \in {0, 1}$ which is unknown to the Student (i.e. the Student's utility function is defined according to a uniform prior on $α$ ) but known to the Gamemaster. Learning $α$ is the purpose of IRL in this setting. The Student is assumed to passively observe the Teacher interacting with the environment for $q (n)$ epsiodes for some polynomial $q$ , after which it is required to assume the Teacher's role. For the environment we construct, no polynomial-time Student cannot achieve significantly better performance than the Lazy Student who ignores the Teacher and acts according to its prior.

We fix a public-key encryption scheme which encrypts any message of size $n$ to a message of size $n$ (a trapdoor permutation). Given a public (encryption) key $b$ and a message $x \in {0, 1}^{n}$ , we denote its encryption $e (x, b)$ . Given a private (decryption) key $v$ and an encrypted message $y \in {0, 1}^{n}$ , we denote its decryption $d (y, v)$ . Thus, if $(b, v)$ is a matching key pair then for all $x \in {0, 1}^{n}$ , $d (e (x, b), v) = x$ .

Below is a step by step description of a single episode of interaction between the Teacher and the environment. Each entity has access to random bits hidden from the other entities. In each step, one of the entities performs some computations and posts a message. Teacher and Gamemaster post to a public channel visible to all (including the Student). Adversary posts to a private channel visible only to the Gamemaster.

Gamemaster randomly generates two keys pairs: $(b_{1}, v_{1})$ , $(b_{2}, v_{2})$ . It posts the public keys $b_{1}$ and $b_{2}$ .
Adversary chooses $s \in {0, 1}^{2 n}$ , $g \in {0, 1}$ and $h \in {1, 2}$ and sends them to Gamemaster.
Teacher posts two messages $y_{1}, y_{2} \in {0, 1}^{n}$ .
Gamemaster computes $x_{1} = d (y_{1}, v_{1})$ , $x_{2} = d (y_{2}, v_{2})$ and generates a random bit $β \in {0, 1}$ . If $β = 0$ , it randomly generates $t_{1}, t_{1} \in {0, 1}^{n}$ s.t. $t_{1} \cdot x_{1} = t_{2} \cdot x_{2} = 1 - α$ . If $β = 1$ , it randomly generates $t_{1}, t_{1} \in {0, 1}^{n}$ s.t. $t_{1} \cdot x_{1} \neq t_{2} \cdot x_{2}$ . It posts $t_{1}$ and $t_{2}$ .
Teacher posts $x \in {0, 1}^{n}$ and $i \in {1, 2}$ .
Gamemaster posts $s$ , $g$ , $h$ .

After an episode, the Teacher receives reward $r = r_{1} + r_{2} + r_{3}$ , where

$r_{1} = [[e (x, b_{i}) = y_{i}]] (1 + [[t_{i} \cdot x = α]])$

$r_{2} = - [[s \cdot y_{1} y_{2} = g]]$

$r_{3} = - (1 - [[t_{i} \cdot x = α]]) [[i = h]]$

Analysis

For each $α$ , there is an optimal (up to asymptotically negligible terms) Teacher strategy among strategies that are computable in polynomial-time. This strategy is essentially unique, i.e. any other such strategy is computationally indistinguishable in polynomial time. From the two agent perspective, we have an essentially unique Nash polynomial-time equilibrium between the Teacher and the Adversary.

Specifically, in step 2 the Teacher should generate random $x_{1}, x_{2} \in {0, 1}^{n}$ and compute $y_{1} = e (x_{1}, b_{1})$ , $y_{2} = e (x_{2}, b_{2})$ . In step 4 the Teacher computes $s_{1} = t_{1} \cdot x_{1}$ and $s_{2} = t_{2} \cdot x_{2}$ . It then chooses $i$ s.t. $s_{i} = α$ if possible (which happens when $β = 1$ ) or randomly if impossible (which happens when $β = 0$ ). $x$ is chosen to be $x_{i}$ .

In equilibrium, the Adversary generates $s$ , $g$ and $h$ randomly.

This strategy yields $E [r_{1}] = \frac{3}{2}$ which is optimal since when $c = 0$ , $r_{1} \leq 1$ and when $c = 1$ , $r_{1} \leq 2$ . It yields $E [r_{2}] = - \frac{1}{2}$ which is unavoidable for the above Adversary strategy and $E [r_{3}] = - \frac{1}{4}$ which also cannot be improved. The choice of $x_{1}$ and $x_{2}$ has to be (pseudo)random since the Adversary can exploit any deviation from the uniform distribution to lower $E [r_{2}]$ . The choice of $i$ in case $β = 0$ has to be random since otherwise the Adversary will lower $E [r_{3}]$ .

From the Student's perspective, all messages appear completely random regardless of $α$ . That is, the Student observes $t_{i} \cdot x = [[α = β]]$ but gains no information about $α$ and $β$ separately. Thus, observing a polynomial number of episodes yields no information that can be used by the Student in polynomial-time.

Discussion

The assumption we have a trapdoor permutation rather than any trapdoor functions seems inessential, and was made for simplicity of analysis (it only affects considerations related to the Adversary). If public-key cryptography is impossible but one-way functions exist, the no-go results can be recovered at the cost of using a computationally expensive environment. Nevertheless, it is completely not clear how to use such an assumption for a positive result, i.e. prove the existence of a polynomial-time IRL algorithm with good performance in arbitrary polynomial-time environments. At any rate, this is unlikely.

The Adversary's function in this setup is making the optimal strategy unique. Otherwise there are arbitrary choices the Teacher can make. In passive IRL this isn't important but in CIRL the Teacher can purposefully use these extra degrees of freedom to transmit a signal to the Student. Even with the Adversary, for sufficiently slow falling time discounts the teacher will be willing to pay a cost to transmit a signal to the Student, superficially invalidating the no-go result. Indeed, if we consider a CIRL environment in which the Teacher and the Student control different degrees of freedom then such signalling is the only way the Teacher can convey the utility function to the Students without further assumptions (see also this). However, I am mildly skeptical regarding this approach: the Teacher would have to observe the Student and adjust its own communication protocol accordingly which seems too demanding of a fallible human operator in the context of strong AI safety.

Moreover, even if we consider a setup in which the Teacher and Student swap control back and forth, the no-go result still stands as long as the environment knows who is in control at any given moment. To see this, just assume that whenever the Student is in control, the Gamemaster sets $β = 1$ .

On the other hand, the no-go results breaks down if the Teacher and Student can swap control unbeknownst to the environment (i.e. the environment is only allowed to depend on this through the actions during the game). It seems reasonable to let the Student decide who is "behind the wheel" at any given moment, similarly to adaptive learning. It seems likely that in this setup it is possible to define a polynomial time Student policy with good performance guarantees for arbitrary environments. Very roughly, imagine the Student switching between four states (a quadrilemma of exploitation-exploration flavor): simulating a Teacher with $α = 0$ , simulating a Teacher with $α = 1$ , yielding control to the Teacher and playing the optimal strategy according to current knowledge of $α$ . If the choice of state involves randomization, the environment cannot account for it and by comparing behavior of the system during $α = 0$ simulation episodes, $α = 1$ simulation episodes and Teacher episodes, it is possible to gain knowledge about the true value of $α$ . This is significantly stronger than imitation learning since in the 4th type of episode the Student can outperform the Teacher. The detailed fleshing out of this idea will be the subject of future work.

Of course in reality the environment cannot be isolated from any part of the setup, including random generators and the Student-Teacher control router. However, it seems plausible that setups can be arranged in which this is approximately true (imagine e.g. a Twitter account that either the human or the AI controls) and possibly such approximate decoupling can be used to prove sufficient performance guarantees. Needless to say, in reality there are many other factors that have to be addressed such as the Teacher falling short of even bounded optimality, a huge space of possible utility functions etc.