Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

Highlights

Differentiable Image Parameterizations(Alexander Mordvintsev et al): There are lots of techniques for generating images using neural nets. A common approach is to take a neural net trained to classify images, and then use gradient descent to optimize the input image instead of the weights of the neural net. You might think that the only way to affect the generated input image would be to change the loss function on which you run gradient descent, but in reality the way in which you represent the image makes a huge difference. They describe why this might be the case, and go through several examples:

1. Suppose you want to see how two neurons interact. You could optimize an image to maximize the sum of the activations of the neurons. Even better, you could create an animation of how the image changes as you trade off how much you care about each neuron. Done naively, this doesn't look good, because there's a lot of randomness that changes between each image in the animation, which swamps out the differences we actually care about. To fix this, we can generate each frame in the animation as the sum of two images, one shared across all frames, and one that is frame-specific. Despite changing neither the loss function nor the space of input images, this is sufficient to remove the randomness between frames.

2. You've probably seen style transfer before, but did you know it only works with the VGG architecture? We can get it to work with other architectures by representing images in Fourier-space instead of pixel-space, again without any change in the loss function or expressible space of images.

3. If you generate the pixel-space representation of an image from a lower-dimensional representation using a Compositional Pattern Producing Network (CPPN), then gradient descent will optimize the lower-dimensional representation. It turns out that this produces images vaguely reminiscent of light-paintings. (I believe in this case, while the loss function doesn't change, the space of expressible images does change.)

4. Often when we see the feature visualization for a neuron, there are a lot of areas of the image that don't actually matter for the neuron's activation. So, we can add transparency, and add a term in the loss function that encourages transparency. We also have to change the representation of the image to include a transparency channel in addition to the normal RGB channels. Then, the generated image will be transparent wherever the pixels don't matter, but will still have the visualization wherever it does matter for activating the neuron.

5+6. We can even use a representation of 3D objects, and then write a (differentiable) algorithm that converts that into a 2D image that then goes through the standard image classifier neural net. This lets us optimize over the 3D object representation itself, letting us do both feature visualization and style transfer on 3D objects.

My opinion: While OpenAI Five suggests that the main thing we need to do is think of a reward function and an exploration strategy, this suggests that ML requires not just a good loss function, but lots of other things in order to work well. We have particular examples where changing things other than the loss function leads to different results. (This is probably also true for OpenAI Five, but the variations may not matter much, or OpenAI hasn't talked about the ML engineering behind the scenes -- I'm not sure.) These generally seem to be changing the inductive bias of the neural nets encoding the images. I think that if you expect to get very capable AI systems within the current paradigm, you will have to think about how inductive bias will affect what your AI system will do (and consequently its safety).

Also, the paper is very clear and approachable, and filled with great visualizations, as I've come to expect from Distill. I almost forgot to mention this, because I take it as a given for any Distill paper.

Generative Adversarial Imitation Learning(Jonathan Ho et al): To do imitation learning, you could try to learn the state-action correspondence that defines the expert policy via behavioral cloning (simply learn a model of action given state using supervised learning), but that’s brittle. Specifically, this is learning the policy over single timesteps, so any errors in one timestep lead to the next timestep being slightly out of distribution, leading to worse errors, giving compounding errors. So, you can instead use IRL to find the cost function, and use RL to then find a good policy. But IRL is slow. They propose a fast method that gets directly to the policy, but it’s as if you did IRL followed by RL.

They write down IRL as a maximization of a minimization that gives you a cost function, and then RL as a minimization over policies of the expected cost. Specifically, maxent RL(c) computes a good policy π given a cost function c by minimizing expected cost E[c] and maximizing entropy H(π), where H(π) is computed according to causal entropy, not regular entropy (see MaxCausalEnt from AN #12). IRL(π) finds a cost function c such that the expert policy is maximally better than the RL policy.

Let’s assume that we’re working with all cost functions c : S → A → Reals. This is a huge space, so let’s throw in a convex regularizer ψ for the IRL problem. Then, they prove that RL(IRL(π)) can be expressed as a single minimization involving ψ*, the convex conjugate of ψ. We can make different choices of ψ give us different imitation learning algorithms that minimize this objective.

Let ρ(s, a) be the occupancy measure of a policy π for a state-action pair (s, a). (Occupancy measure is the sum of the discounted probabilities of taking the given state-action pair.) Then, the expected cost of a trajectory is the sum of ρ(s, a) c(s, a) over all s, a. In the special case where you don’t do any regularization, the occupancy measure of the solution is guaranteed to be equal to the occupancy measure of the expert policy. (In the language of convex optimization, IRL and maxent RL are dual to each other, and strong duality applies.)

Of course, this all assumes that you have access to the full expert policy, whereas in practice you only have access to some expert demonstrations. So now, the question is how to choose a good ψ that deals with that issue. If you don’t regularize at all, you get something that matches the estimated occupancy measure exactly -- which is not great, since there are likely many areas of state space where there are not enough samples to accurately match the occupancy measure, at least in the case with stochastic dynamics. On the other hand, previous algorithms are basically using a ψ that is infinity on cost functions outside of a particular class, and constant elsewhere, which means that they only work if the cost function is actually in the class of learnable functions.

So, they instead propose a ψ that is dependent on the expert data. It requires that the cost function be negative everywhere, and puts a high penalty on any cost function that puts high cost (i.e. close to zero) on the expert data. Note that this can represent any (bounded) reward function, so it has very good expressive power. They chose the particular form they did because when you work through the convex conjugate, you end up choosing π to minimize the Jensen-Shannon divergence between the inferred policy and the expert policy, which can be interpreted as the optimal log loss of a classifier that distinguishes between the two policiees. This is very GAN-like, where we have a discriminator trying to distinguish between two policies, and a generator (the learned policy) that’s trying to fool the discriminator. Their proposed algorithm, GAIL, follows this quite precisely, alternating between an Adam gradient step to update the discriminator, and a TRPO step to update the policy. (Both the discriminator and the policy are neural nets.)

In the experiments, they imitate classic control and MuJoCo tasks that have been trained with TRPO. GAIL consistently achieves the expert performance, even with not much data, though this comes at a cost of lots of environment interaction (which eg. behavioral cloning does not require).

Learning Robust Rewards with Adversarial Inverse Reinforcement Learning(Justin Fu et al): GAIL and Guided Cost Learning (GCL) both have the idea of learning a policy and a reward function simultaneously, in a GAN-like way (where the generator is the policy, and the discriminator has to distinguish between trajectories from the policy and the expert trajectories, which can then be interpreted as a reward function). In GAIL, the discriminator can’t be interpreted as a reward function, and you only get a policy as output (which is why it is called imitation learning). However, if you enforce that the discriminator has to be of the form:

D(τ) = exp(f(τ)) / (exp(f(τ)) + π(τ))

Then you can show that if π is trained to maximize

R(τ) = log(1 - D(τ)) - log(D(τ))

Then R and π converge to the optimal reward and policy respectively.

This trajectory-centric formulation is called GAN-GCL (the GAN version of guided cost learning). It’s main issue is that it works with trajectories and so is hard to optimize -- the gradients are very high variance. So, instead, we can work with individual state-action pairs instead of trajectories, just replacing every τ in the equations above with (s, a). This makes it more sample-efficient, and in this case f converges to the advantage function of the optimal policy. However, the advantage function induces a heavily entangled reward function, which rewards the RL agent for doing the action that the expert would have taken, without actually understanding the goal that the expert had. We would like to learn a disentangled reward, which they define as a reward function that leads to the optimal policy according to the true reward function even if the transition dynamics change.

Intuitively, since entanglement happens by rewarding the agent for taking the same action as the expert, we can do better by enforcing that the reward function only be a function of the state, so that it is forced to learn the actual goal rather than memorizing the actions that are good. They prove two theorems under the condition that the true reward is only a function of the state. First, the learned optimal reward function is fully disentangled if it is a function of only the state, assuming that the transition dynamics are “decomposable”. Second, the reward function must be a function of only the state if it is fully disentangled.

Now the discriminator in the formulation above is either looking at the trajectory as a whole, or looking at the current action in order to see whether or not you are matching the expert demonstrations. Clearly we can’t just make the discriminator a function of only the current state -- there’s no way that could distinguish between policies, since it has no access to information about the actions that the policies took. However, we can instead separate the discriminator’s f function into the reward term and a shaping term, and enforce that the shaping term does not change the optimal policy:

f(s, a, s’) = g(s) + γh(s’) − h(s)

(It happens to be the case that for any function h, adding γh(s’) − h(s) to the reward function does not change the optimal policy.) Now, the discriminator gets information about the action taken by the policy by seeing the next state s’ that resulted. Since γh(s’) − h(s) does not change the optimal policy, g(s) should converge to an optimal reward function, while h(s) must then be the value function V(s) in order to have f(s, a, s’) be the advantage function.

They run a bunch of experiments with recovering a reward function and then transferring it to a situation with different dynamics, and show that it works much better than any other algorithm. They also show that for direct imitation (no transfer required), it does about as well as GAIL.

Stable Pointers to Value III: Recursive Quantilization(Abram Demski): We often try to solve alignment problems by going a level meta. For example, instead of providing feedback on what the utility function is, we might provide feedback on how to best learn what the utility function is. This seems to get more information about what safe behavior is. What if we iterate this process? For example, in the case of quantilizers with three levels of iteration, we would do a quantilized search over utility function generators, then do a quantilized search over the generated utility functions, and then do a quantilized search to actually take actions.

My opinion: The post mentions what seems like the most salient issue -- that it is really hard for humans to give feedback even a few meta levels up. How do you evaluate a thing that will create a distribution over utility functions? I might go further -- I'm not even sure there is good normative feedback on the meta level(s). There is feedback we can give on the meta level for any particular object-level instance, but it seems not at all obvious (to me) that this advice will generalize well to other object-level instances. On the other hand, it does seem to me that the higher up you are in meta-levels, the smaller the space of concepts and the easier it is to learn. So maybe my overall take is that it seems like we can't depend on humans to give meta-level feedback well, but if we can figure out how to either give better feedback or learn from noisy feedback, it would be easier to learn and likely generalize better.

Computational efficiency reasons not to model VNM-rational preference relations with utility functions(Alex Mennen): Realistic agents don't use utility functions over world histories to make decisions, because it is computationally infeasible, and it's quite possible to make a good decision by only considering the local effects of the decision. For example, when deciding whether or not to eat a sandwich, we don't typically worry about the outcome of a local election in Siberia. For the same computational reasons, we wouldn't want to use a utility function to model other agents. Perhaps a utility function is useful for measuring the strength of an agent's preference, but even then it is really measuring the ratio of the strength of the agent's preference to the strength of the agent's preference over the two reference points used to determine the utility function.

My opinion: I agree that we certainly don't want to model other agents using full explicit expected utility calculations because it's computationally infeasible. However, as a first approximation it seems okay to model other agents as computationally bounded optimizers of some utility function. It seems like a bigger problem to me that any such model predicts that the agent will never change its preferences (since that would be bad according to the current utility function).

Exorcizing the Speed Prior?(Abram Demski): Intuitively, in order to find a solution to a hard problem, we could either do an uninformed brute force search, or encode some domain knowledge and then do an informed search. Roughly, we should expect each additional bit of information to cut the required search roughly in half. The speed prior trades off a bit of complexity against a doubling of running time, so we should expect the informed and uninformed searches to be equally likely in the speed prior. So, uninformed brute force searches that can find weird edge cases (aka daemons) are only equally likely, not more likely.

My opinion: As the post acknowledges, this is extremely handwavy and just gesturing at an intuition, so I'm not sure what to make of it yet. One counterconsideration is that a lot of intelligence that is not just search, that still is general across domains (see this comment for examples).

Conceptual problems with utility functions, second attempt at explaining(Dacyn): Argues that there's a difference between object-level fairness (which sounds to me like fairness as a terminal value) and meta-level fairness (which sounds to me like instrumental fairness), and that this difference is not captured with single-player utility function maximization.

My opinion: I still think that the difference pointed out here is accounted for by traditional multiagent game theory, which has utility maximization for each player. For example, I would expect that in a repeated Ultimatum game, fairness would arise naturally, similarly to how tit-for-tat is a good strategy in an iterated prisoner's dilemma.

Safe Option-Critic: Learning Safety in the Option-Critic Architecture(Arushi Jain et al): Let's consider an RL agent in the options framework (one way of doing hierarchical reinforcement learning). One way in which we could make such an agent safer would be to make it risk-averse. The authors define the controllability of a (state, option) pair to be the negative expected variance of the TD error. Intuitively, the controllability is higher when the value of the (state, option) pair is more predictable to the agent, and so by optimizing for controllability we can encourage risk-aversion. They derive the policy gradient when the objective is to maximize the reward and the controllability of the initial (state, option) pair, and use this to create the Safe-A2OC algorithm (a safe version of A2OC, which itself is a version of A2C for options). They test this out on a four-rooms gridworld problem, Cartpole, and three games from the Arcade Learning Environment (ALE).

My opinion: I'm very excited to see a paper tackling safety in hierarchical reinforcement learning -- that seems like a really important area to consider, and doesn't have many safety people working on it yet. That said, this paper feels weird to me, because in order to learn that a particular (state, option) pair is bad, the RL agent must experience that pair somewhat often, so it will have done the risky thing. It's not clear to me where this would be useful. One upside could be that we do risky things less often, so our RL agent learns faster from its mistakes, and doesn't make them as often. (And in fact they find that this leads to faster learning in three games in the ALE.) Perhaps we could also use this to train a risk-averse agent in simulation, that then never makes a mistake when deployed in the real world.

I also wonder whether we should be trying to make our agents risk-averse. The "right" answer seems to me to a combination of two things: First, some things are actually very bad and have very large negative reward, and so they should be avoided with high probability. Second, when you are acting over a long period of time, even a small probability of failure at every time step compounds and leads to a near-guaranteed failure. If these are actually the reasons underlying risk aversion, it seems like we want to be able to imbue our RL agent with the underlying reasons, rather than flat risk aversion.

Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences(Jasper van der Waa et al): This paper aims to provide contrastive explanations for the behavior of an RL agent, meaning that they contrast why the RL agent used one policy instead of another policy. They do this by computing the expected outcomes under the alternate policy, and then describing the difference between the two. (An outcome is a human-interpretable event -- they assume that they are given a function that maps states to outcomes.)

My opinion: I wish that they had let users choose the questions in their user study, rather than just evaluating questions that had been generated by their method where they wrote the alternative policy using template policies they had written. I'd be pretty excited and think it was a good step forward in this area if end users (i.e. not ML researchers) could ask novel contrastive questions (perhaps in some restricted class of questions).

Learning Heuristics for Automated Reasoning through Deep Reinforcement Learning(Gil Lederman et al): The formal methods community uses SAT solvers all the time in order to solve complex search-based problems. These solvers use handtuned heuristics in order to drastically improve performance. The heuristics only affect the choices in the search process, not the correctness of the algorithm overall. Obviously, we should consider using neural nets to learn these heuristics instead. However, neural nets take a long time to run, and SAT solvers have to make these decisions very frequently, so it's unlikely to actually be helpful -- the neural net would have to be orders of magnitude better than existing heuristics. So, they instead do this for QBF (quantified boolean formulas) -- these are PSPACE complete, and the infrastructure needed to support the theory takes more time, so it's more likely that neural nets can actually help. They implement this using a graph neural network and engineer some simple features for variables and clauses. (Feature engineering is needed because there can hundreds of thousands of variables, so you can only have ~10 numbers to describe the variable.) It works well, doing better than the handcoded heuristics.

My opinion: For over a year now people keep asking me whether something like this is doable, since it seems like an obvious win combining PL and ML, and why no one has done it yet. I've mentioned the issue about neural nets being too slow, but it still seemed doable, and I was really tempted to do it myself. So I'm really excited that it's finally been done!

Oh right, AI alignment. Yeah, I do actually think this is somewhat relevant -- this sort of work could lead to much better theorem provers and formal reasoning, which could make it possible to create AI systems with formal guarantees. I'm not very optimistic about this approach myself, but I know others are.

HighlightsDifferentiable Image Parameterizations(Alexander Mordvintsev et al): There are lots of techniques for generating images using neural nets. A common approach is to take a neural net trained to classify images, and then use gradient descent to optimizethe input imageinstead of the weights of the neural net. You might think that the only way to affect the generated input image would be to change the loss function on which you run gradient descent, but in reality the way in which you represent the image makes a huge difference. They describe why this might be the case, and go through several examples:1. Suppose you want to see how two neurons interact. You could optimize an image to maximize the sum of the activations of the neurons. Even better, you could create an animation of how the image changes as you trade off how much you care about each neuron. Done naively, this doesn't look good, because there's a lot of randomness that changes between each image in the animation, which swamps out the differences we actually care about. To fix this, we can generate each frame in the animation as the sum of two images, one shared across all frames, and one that is frame-specific. Despite changing neither the loss function nor the space of input images, this is sufficient to remove the randomness between frames.

2. You've probably seen

style transferbefore, but did you know it only works with the VGG architecture? We can get it to work with other architectures by representing images in Fourier-space instead of pixel-space, again without any change in the loss function or expressible space of images.3. If you generate the pixel-space representation of an image from a lower-dimensional representation using a Compositional Pattern Producing Network (CPPN), then gradient descent will optimize the lower-dimensional representation. It turns out that this produces images vaguely reminiscent of light-paintings. (I believe in this case, while the loss function doesn't change, the space of expressible images does change.)

4. Often when we see the feature visualization for a neuron, there are a lot of areas of the image that don't actually matter for the neuron's activation. So, we can add transparency, and add a term in the loss function that encourages transparency. We also have to change the representation of the image to include a transparency channel in addition to the normal RGB channels. Then, the generated image will be transparent wherever the pixels don't matter, but will still have the visualization wherever it does matter for activating the neuron.

5+6. We can even use a representation of 3D objects, and then write a (differentiable) algorithm that converts that into a 2D image that then goes through the standard image classifier neural net. This lets us optimize over the 3D object representation itself, letting us do both feature visualization and style transfer on 3D objects.

My opinion:WhileOpenAI Fivesuggests that the main thing we need to do is think of a reward function and an exploration strategy, this suggests that ML requires not just a good loss function, but lots of other things in order to work well. We have particular examples where changing things other than the loss function leads to different results. (This is probably also true for OpenAI Five, but the variations may not matter much, or OpenAI hasn't talked about the ML engineering behind the scenes -- I'm not sure.) These generally seem to be changing the inductive bias of the neural nets encoding the images. I think that if you expect to get very capable AI systems within the current paradigm, you will have to think about how inductive bias will affect what your AI system will do (and consequently its safety).Also, the paper is very clear and approachable, and filled with great visualizations, as I've come to expect from Distill. I almost forgot to mention this, because I take it as a given for any Distill paper.

Prerequisities:Feature VisualizationTechnical AI alignmentSummary: Inverse Reinforcement LearningWe continue on the tour of IRL algorithms introduced in

Alignment Newsletter #12.Generative Adversarial Imitation Learning(Jonathan Ho et al): To do imitation learning, you could try to learn the state-action correspondence that defines the expert policy via behavioral cloning (simply learn a model of action given state using supervised learning), but that’s brittle. Specifically, this is learning the policy over single timesteps, so any errors in one timestep lead to the next timestep being slightly out of distribution, leading to worse errors, giving compounding errors. So, you can instead use IRL to find the cost function, and use RL to then find a good policy. But IRL is slow. They propose a fast method that gets directly to the policy, but it’s as if you did IRL followed by RL.They write down IRL as a maximization of a minimization that gives you a cost function, and then RL as a minimization over policies of the expected cost. Specifically, maxent RL(c) computes a good policy π given a cost function c by minimizing expected cost E[c] and maximizing entropy H(π), where H(π) is computed according to causal entropy, not regular entropy (see MaxCausalEnt from

AN #12). IRL(π) finds a cost function c such that the expert policy is maximally better than the RL policy.Let’s assume that we’re working with all cost functions c : S → A → Reals. This is a huge space, so let’s throw in a convex regularizer ψ for the IRL problem. Then, they prove that RL(IRL(π)) can be expressed as a single minimization involving ψ*, the convex conjugate of ψ. We can make different choices of ψ give us different imitation learning algorithms that minimize this objective.

Let ρ(s, a) be the occupancy measure of a policy π for a state-action pair (s, a). (Occupancy measure is the sum of the discounted probabilities of taking the given state-action pair.) Then, the expected cost of a trajectory is the sum of ρ(s, a) c(s, a) over all s, a. In the special case where you don’t do any regularization, the occupancy measure of the solution is guaranteed to be equal to the occupancy measure of the expert policy. (In the language of convex optimization, IRL and maxent RL are dual to each other, and strong duality applies.)

Of course, this all assumes that you have access to the full expert policy, whereas in practice you only have access to some expert demonstrations. So now, the question is how to choose a good ψ that deals with that issue. If you don’t regularize at all, you get something that matches the estimated occupancy measure exactly -- which is not great, since there are likely many areas of state space where there are not enough samples to accurately match the occupancy measure, at least in the case with stochastic dynamics. On the other hand, previous algorithms are basically using a ψ that is infinity on cost functions outside of a particular class, and constant elsewhere, which means that they only work if the cost function is actually in the class of learnable functions.

So, they instead propose a ψ that is dependent on the expert data. It requires that the cost function be negative everywhere, and puts a high penalty on any cost function that puts high cost (i.e. close to zero) on the expert data. Note that this can represent any (bounded) reward function, so it has very good expressive power. They chose the particular form they did because when you work through the convex conjugate, you end up choosing π to minimize the Jensen-Shannon divergence between the inferred policy and the expert policy, which can be interpreted as the optimal log loss of a classifier that distinguishes between the two policiees. This is very GAN-like, where we have a discriminator trying to distinguish between two policies, and a generator (the learned policy) that’s trying to fool the discriminator. Their proposed algorithm, GAIL, follows this quite precisely, alternating between an Adam gradient step to update the discriminator, and a TRPO step to update the policy. (Both the discriminator and the policy are neural nets.)

In the experiments, they imitate classic control and MuJoCo tasks that have been trained with TRPO. GAIL consistently achieves the expert performance, even with not much data, though this comes at a cost of lots of environment interaction (which eg. behavioral cloning does not require).

Learning Robust Rewards with Adversarial Inverse Reinforcement Learning(Justin Fu et al): GAIL and Guided Cost Learning (GCL) both have the idea of learning a policy and a reward function simultaneously, in a GAN-like way (where the generator is the policy, and the discriminator has to distinguish between trajectories from the policy and the expert trajectories, which can then be interpreted as a reward function). In GAIL, the discriminator can’t be interpreted as a reward function, and you only get a policy as output (which is why it is called imitation learning). However, if you enforce that the discriminator has to be of the form:D(τ) = exp(f(τ)) / (exp(f(τ)) + π(τ))

Then you can show that if π is trained to maximize

R(τ) = log(1 - D(τ)) - log(D(τ))

Then R and π converge to the optimal reward and policy respectively.

This trajectory-centric formulation is called GAN-GCL (the GAN version of guided cost learning). It’s main issue is that it works with trajectories and so is hard to optimize -- the gradients are very high variance. So, instead, we can work with individual state-action pairs instead of trajectories, just replacing every τ in the equations above with (s, a). This makes it more sample-efficient, and in this case f converges to the advantage function of the optimal policy. However, the advantage function induces a heavily entangled reward function, which rewards the RL agent for doing the action that the expert would have taken, without actually understanding the goal that the expert had. We would like to learn a disentangled reward, which they define as a reward function that leads to the optimal policy according to the true reward function even if the transition dynamics change.

Intuitively, since entanglement happens by rewarding the agent for taking the same action as the expert, we can do better by enforcing that the reward function only be a function of the state, so that it is forced to learn the actual goal rather than memorizing the actions that are good. They prove two theorems under the condition that the true reward is only a function of the state. First, the learned optimal reward function is fully disentangled if it is a function of only the state, assuming that the transition dynamics are “decomposable”. Second, the reward function must be a function of only the state if it is fully disentangled.

Now the discriminator in the formulation above is either looking at the trajectory as a whole, or looking at the current action in order to see whether or not you are matching the expert demonstrations. Clearly we can’t just make the discriminator a function of only the current state -- there’s no way that could distinguish between policies, since it has no access to information about the actions that the policies took. However, we can instead separate the discriminator’s f function into the reward term and a shaping term, and enforce that the shaping term does not change the optimal policy:

f(s, a, s’) = g(s) + γh(s’) − h(s)

(It happens to be the case that for any function h, adding γh(s’) − h(s) to the reward function does not change the optimal policy.) Now, the discriminator gets information about the action taken by the policy by seeing the next state s’ that resulted. Since γh(s’) − h(s) does not change the optimal policy, g(s) should converge to an optimal reward function, while h(s) must then be the value function V(s) in order to have f(s, a, s’) be the advantage function.

They run a bunch of experiments with recovering a reward function and then transferring it to a situation with different dynamics, and show that it works much better than any other algorithm. They also show that for direct imitation (no transfer required), it does about as well as GAIL.

Technical agendas and prioritizationRobustness to fundamental uncertainty in AGI alignment(gworley)Agent foundationsStable Pointers to Value III: Recursive Quantilization(Abram Demski): We often try to solve alignment problems by going a level meta. For example, instead of providing feedback on what the utility function is, we might provide feedback on how to best learn what the utility function is. This seems to get more information about what safe behavior is. What if we iterate this process? For example, in the case of quantilizers with three levels of iteration, we would do a quantilized search over utility function generators, then do a quantilized search over the generated utility functions, and then do a quantilized search to actually take actions.My opinion:The post mentions what seems like the most salient issue -- that it is really hard for humans to give feedback even a few meta levels up. How do you evaluate a thing that will create a distribution over utility functions? I might go further -- I'm not even sure there is good normative feedback on the meta level(s). There is feedback we can give on the meta level for any particular object-level instance, but it seems not at all obvious (to me) that this advice will generalize well to other object-level instances. On the other hand, it does seem to me that the higher up you are in meta-levels, the smaller the space of concepts and the easier it is to learn. So maybe my overall take is that it seems like we can't depend on humans to give meta-level feedback well, but if we can figure out how to either give better feedback or learn from noisy feedback, it would be easier to learn and likely generalize better.Computational efficiency reasons not to model VNM-rational preference relations with utility functions(Alex Mennen): Realistic agents don't use utility functions over world histories to make decisions, because it is computationally infeasible, and it's quite possible to make a good decision by only considering the local effects of the decision. For example, when deciding whether or not to eat a sandwich, we don't typically worry about the outcome of a local election in Siberia. For the same computational reasons, we wouldn't want to use a utility function to model other agents. Perhaps a utility function is useful for measuring the strength of an agent's preference, but even then it is really measuring the ratio of the strength of the agent's preference to the strength of the agent's preference over the two reference points used to determine the utility function.My opinion:I agree that we certainly don't want to model other agents using full explicit expected utility calculations because it's computationally infeasible. However, as a first approximation it seems okay to model other agents as computationally bounded optimizers of some utility function. It seems like a bigger problem to me that any such model predicts that the agent will never change its preferences (since that would be bad according to the current utility function).Exorcizing the Speed Prior?(Abram Demski): Intuitively, in order to find a solution to a hard problem, we could either do an uninformed brute force search, or encode some domain knowledge and then do an informed search. Roughly, we should expect each additional bit of information to cut the required search roughly in half. The speed prior trades off a bit of complexity against a doubling of running time, so we should expect the informed and uninformed searches to be equally likely in the speed prior. So, uninformed brute force searches that can find weird edge cases (aka daemons) are only equally likely, not more likely.My opinion:As the post acknowledges, this is extremely handwavy and just gesturing at an intuition, so I'm not sure what to make of it yet. One counterconsideration is that a lot of intelligence that is not just search, that still is general across domains (seethis commentfor examples).Conceptual problems with utility functions, second attempt at explaining(Dacyn): Argues that there's a difference between object-level fairness (which sounds to me like fairness as a terminal value) and meta-level fairness (which sounds to me like instrumental fairness), and that this difference is not captured with single-player utility function maximization.My opinion:I still think that the difference pointed out here is accounted for by traditional multiagent game theory, which has utility maximization for each player. For example, I would expect that in a repeated Ultimatum game, fairness would arise naturally, similarly to how tit-for-tat is a good strategy in an iterated prisoner's dilemma.Read more:Conceptual problems with utility functionsThe Evil Genie Puzzle(Chris Leong)Learning human intentInterpretable Latent Spaces for Learning from Demonstration(Yordan Hristov et al)Preventing bad behaviorSafe Option-Critic: Learning Safety in the Option-Critic Architecture(Arushi Jain et al): Let's consider an RL agent in the options framework (one way of doing hierarchical reinforcement learning). One way in which we could make such an agent safer would be to make it risk-averse. The authors define the controllability of a (state, option) pair to be the negative expected variance of the TD error. Intuitively, the controllability is higher when the value of the (state, option) pair is more predictable to the agent, and so by optimizing for controllability we can encourage risk-aversion. They derive the policy gradient when the objective is to maximize the reward and the controllability of the initial (state, option) pair, and use this to create the Safe-A2OC algorithm (a safe version of A2OC, which itself is a version of A2C for options). They test this out on a four-rooms gridworld problem, Cartpole, and three games from the Arcade Learning Environment (ALE).My opinion:I'm very excited to see a paper tackling safety in hierarchical reinforcement learning -- that seems like a really important area to consider, and doesn't have many safety people working on it yet. That said, this paper feels weird to me, because in order to learn that a particular (state, option) pair is bad, the RL agent must experience that pair somewhat often, so it will have done the risky thing. It's not clear to me where this would be useful. One upside could be that we do risky things less often, so our RL agent learns faster from its mistakes, and doesn't make them as often. (And in fact they find that this leads to faster learning in three games in the ALE.) Perhaps we could also use this to train a risk-averse agent in simulation, that then never makes a mistake when deployed in the real world.I also wonder whether we should be trying to make our agents risk-averse. The "right" answer seems to me to a combination of two things: First, some things are actually very bad and have very large negative reward, and so they should be avoided with high probability. Second, when you are acting over a long period of time, even a small probability of failure at every time step compounds and leads to a near-guaranteed failure. If these are actually the reasons underlying risk aversion, it seems like we want to be able to imbue our RL agent with the underlying reasons, rather than flat risk aversion.

InterpretabilityDifferentiable Image Parameterizations(Alexander Mordvintsev et al): Summarized in the highlights!Contrastive Explanations for Reinforcement Learning in terms of Expected Consequences(Jasper van der Waa et al): This paper aims to provide contrastive explanations for the behavior of an RL agent, meaning that they contrast why the RL agent used one policy instead of another policy. They do this by computing the expected outcomes under the alternate policy, and then describing the difference between the two. (An outcome is a human-interpretable event -- they assume that they are given a function that maps states to outcomes.)My opinion:I wish that they had let users choose the questions in their user study, rather than just evaluating questions that had been generated by their method where they wrote the alternative policy using template policies they had written. I'd be pretty excited and think it was a good step forward in this area if end users (i.e. not ML researchers) could ask novel contrastive questions (perhaps in some restricted class of questions).AI strategy and policyNarrow AI Nanny: Reaching Strategic Advantage via Narrow AI to Prevent Creation of the Dangerous Superintelligence(avturchin)AI capabilitiesReinforcement learningLearning Heuristics for Automated Reasoning through Deep Reinforcement Learning(Gil Lederman et al): The formal methods community uses SAT solvers all the time in order to solve complex search-based problems. These solvers use handtuned heuristics in order to drastically improve performance. The heuristics only affect the choices in the search process, not the correctness of the algorithm overall. Obviously, we should consider using neural nets to learn these heuristics instead. However, neural nets take a long time to run, and SAT solvers have to make these decisions very frequently, so it's unlikely to actually be helpful -- the neural net would have to be orders of magnitude better than existing heuristics. So, they instead do this for QBF (quantified boolean formulas) -- these are PSPACE complete, and the infrastructure needed to support the theory takes more time, so it's more likely that neural nets can actually help. They implement this using a graph neural network and engineer some simple features for variables and clauses. (Feature engineering is needed because there can hundreds of thousands of variables, so you can only have ~10 numbers to describe the variable.) It works well, doing better than the handcoded heuristics.My opinion:For over a year now people keep asking me whether something like this is doable, since it seems like an obvious win combining PL and ML, and why no one has done it yet. I've mentioned the issue about neural nets being too slow, but it still seemed doable, and I was really tempted to do it myself. So I'm really excited that it's finally been done!Oh right, AI alignment. Yeah, I do actually think this is somewhat relevant -- this sort of work could lead to much better theorem provers and formal reasoning, which could make it possible to create AI systems with formal guarantees. I'm not very optimistic about this approach myself, but I know others are.

Deep learningTowards Automated Deep Learning: Efficient Joint Neural Architecture and Hyperparameter Search(Arber Zela et al)Meta learningMeta-Learning with Latent Embedding Optimization(Andrei A. Rusu et al)News$2 Million Donated to Keep Artificial General Intelligence Beneficial and Robust(Ariel Conn): The next round of FLI grants have been announced! There are fewer grants than in theirfirst roundand the topics seem more focused on AGI safety.