[AN #94]: AI alignment as translation between humans and machines

Rohin Shah

Alignment Newsletter is a weekly publication with recent content relevant to AI alignment around the world. Find all Alignment Newsletter resources here. In particular, you can look through this spreadsheet of all summaries that have ever been in the newsletter.

Audio version here (may not be up yet).

HIGHLIGHTS

Alignment as Translation (John S Wentworth) (summarized by Rohin): At a very high level, we can model powerful AI systems as moving closer and closer to omniscience. As we move in that direction, what becomes the new constraint on technology? This post argues that the constraint is good interfaces, that is, something that allows us to specify what the AI should do. As with most interfaces, the primary challenge is dealing with the discrepancy between the user's abstractions (how humans think about the world) and the AI system's abstractions, which could be very alien to us (e.g. perhaps the AI system uses detailed low-level simulations). The author believes that this is the central problem of AI alignment: how to translate between these abstractions in a way that accurately preserves meaning.

The post goes through a few ways in which we could attempt to do this translation, but all of them seem to only reduce the amount of translation that is necessary: none of them solve the chicken-and-egg problem of how you do the very first translation between the abstractions.

Rohin's opinion: I like this view on alignment, but I don't know if I would call it the central problem of alignment. It sure seems important that the AI is optimizing something: this is what prevents solutions like "make sure the AI has an undo button / off switch", which would be my preferred line of attack if the main source of AI risk were bad translations between abstractions. There's a longer discussion on this point here.

TECHNICAL AI ALIGNMENT

AGENT FOUNDATIONS

Two Alternatives to Logical Counterfactuals (Jessica Taylor)

LEARNING HUMAN INTENT

State-only Imitation with Transition Dynamics Mismatch (Tanmay Gangwani et al) (summarized by Zach): Most existing imitation learning algorithms rely on the availability of expert demonstrations that come from the same MDP as the one the imitator will be evaluated in. With the advent of adversarial inverse reinforcement learning (AIRL) (AN #17), it has become possible to learn general behaviors. However, algorithms such as GAIL (AN #17) are capable of learning with just state-information, something that AIRL was not designed for. In this paper, the authors introduce indirect-imitation learning (I2L) to try and merge the benefits of both GAIL and AIRL. The basic sketch of the algorithm is to first use a generalization of AIRL to imitate demonstrations via a buffer distribution and then focus on moving that buffer closer to the expert's demonstration distribution using a Wasserstein critic, a smoother way to train GAN networks. By combining these two approaches, agents trained with I2L learn how to control Ant in regular gravity and can generalize to perform in simulations with differing parameters for gravity. For the suite of Gym continuous domains, they show consistent advantages for I2L over other algorithms such as GAIL, BCO, and AIRL when parameters such as friction, density, and gravity are changed.

Prerequisities: Wasserstein GAN

Zach's opinion: The main contribution in this paper seems to be deriving a new bound so that AIRL can handle state-only imitation learning. The use of indirection via a buffer is also interesting and seems to be a good idea to provide stability in training. However, they did not do an ablation. Overall, it's aesthetically interesting that this paper is borrowing tricks, such as buffering and Wasserstein critic. Finally, the results seem promising, particularly for the sim-to-real problem. It would be interesting to see a follow-up to gauge whether or not I2L can help bridge this gap.

The MineRL Competition on Sample-Efficient Reinforcement Learning Using Human Priors: A Retrospective (Stephanie Milani et al) (summarized by Rohin): This paper reports on the results of the MineRL competition (AN #56), in which participants had to train agents to obtain a diamond in Minecraft using a limited amount of compute, environment interactions, and human demonstrations. While no team achieved this task, one team did make it to the penultimate milestone: obtaining an iron pickaxe.

The top nine teams all used some form of action reduction: that is, they constrained their agents to only take a subset of all available actions, shaping the space in which the agent had to learn and explore. The top four teams all used some form of hierarchy in order to learn longer "options" that could then be selected from. The second place team used pure imitation learning (and so required no environment interactions), while the eighth and ninth place teams used pure reinforcement learning (and so required no human demonstrations).

Rohin's opinion: I was surprised to see pure RL solutions rank in the leaderboard, given the limitations on compute and environment interactions. Notably though, while the second place team (pure imitation) got 42.41 points, the eighth place team (pure RL) only got 8.25 points.

More generally, I was excited to see an actual benchmark for techniques using human demonstrations: so far there hasn't been a good evaluation of such techniques. It does seem like Minecraft benefits a lot from hierarchy and action pruning, which we may not care about when evaluating algorithms.

Sample Efficient Reinforcement Learning through Learning from Demonstrations in Minecraft (Christian Scheller et al) (summarized by Rohin): This paper explains the technique used by the 3rd place team in the MineRL competition (summarized above). They used behavior cloning to train their neural net on human demonstrations, and then used reinforcement learning (specifically, IMPALA) with experience replay and advantage clipping to improve. There are more details about their architecture and design choices in the paper.

HANDLING GROUPS OF AGENTS

Equilibrium and prior selection problems in multipolar deployment (Jesse Clifton) (summarized by Rohin): Consider the scenario in which two principals with different terminal goals will separately develop and deploy learning agents, that will then act on their behalf. Let us call this a learning game, in which the "players" are the principals, and the actions are the agents developed.

One strategy for this game is for the principals to first agree on a "fair" joint welfare function, such that they and their agents are then licensed to punish the other agent if they take actions that deviate from this welfare function. Ideally, this would lead to the agents jointly optimizing the welfare function (while being on the lookout for defection).

There still remain two coordination problems. First, there is an equilibrium selection problem: if the two deployed learning agents are Nash strategies from different equilibria, payoffs can be arbitrarily bad. Second, there is a prior selection problem: given that there are many reasonable priors that the learning agents could have, if they end up with different priors from each other, outcomes can again be quite bad, especially in the context of threats (AN #86).

Rohin's opinion: These are indeed pretty hard problems in any non-competitive game. While this post takes the framing of considering optimal principals and/or agents (and so considers Bayesian strategies in which only the prior and choice of equilibrium are free variables), I prefer the framing taken in our paper (AN #70): the issue is primarily that the optimal thing for you to do depends strongly on who your partner is, but you may not have a good understanding of who your partner is, and if you're wrong you can do arbitrarily poorly.

FORECASTING

Openness Norms in AGI Development (Sublation) (summarized by Rohin): This post summarizes two papers that provide models of why scientific research tends to be so open, and then applies it to the development of powerful AI systems. The first models science as a series of discoveries, in which the first academic group to reach a discovery gets all the credit for it. It shows that for a few different models of info-sharing, info-sharing helps everyone reach the discovery sooner, but doesn't change the probabilities for who makes the discovery first (called race-clinching probabilities): as a result, sharing all information is a better strategy than sharing none (and is easier to coordinate on than the possibly-better strategy of sharing just some information).

However, this theorem doesn't apply when info sharing compresses the discovery probabilities unequally across actors: in this case, the race-clinching probabilities do change, and the group whose probability would go down is instead incentivized to keep information secret (which then causes everyone else to keep their information secret). This could be good news: it suggests that actors are incentivized to share safety research (which probably doesn't affect race-clinching probabilities) while keeping capabilities research secret (thereby leading to longer timelines).

The second paper assumes that scientists are competing to complete a k-stage project, and whenever they publish, they get credit for all the stages they completed that were not yet published by anyone else. It also assumes that earlier stages have a higher credit-to-difficulty ratio (where difficulty can be different across scientists). It finds that under this setting scientists are incentivized to publish whenever possible. For AI development, this seems not to be too relevant: we should expect that with powerful AI systems, most of the "credit" (profit) comes from the last few stages, where it is possible to deploy the AI system to earn money.

Rohin's opinion: I enjoyed this post a lot; the question of openness in AI research is an important one, that depends both on the scientific community and industry practice. The scientific community is extremely open, and the second paper especially seems to capture well the reason why. In contrast industry is often more secret (plausibly due to patents (AN #88)). To the extent that we would like to change one community in the direction of the other, a good first step is to understand their incentives so that we can try to then change those incentives.

MISCELLANEOUS (ALIGNMENT)

Takeaways from safety by default interviews (Asya Bergal) (summarized by Rohin): This post lists three key takeaways from AI Impacts' conversations with "optimistic" researchers (summarized mainly in AN #80 with one in AN #63). I'll just name the takeaways here, see the post for more details:

1. Relative optimism in AI often comes from the belief that AGI will be developed gradually, and problems will be fixed as they are found rather than neglected.

2. Many of the arguments I heard around relative optimism weren’t based on inside-view technical arguments.

3. There are lots of calls for individuals with views around AI risk to engage with each other and understand the reasoning behind fundamental disagreements.

Rohin's opinion: As one of the people interviewed, these seem like the right high-level takeaways to me.

OTHER PROGRESS IN AI

REINFORCEMENT LEARNING

Robots Learning to Move like Animals (Xue Bin Peng et al) (summarized by Rohin): Previous work (AN #28) has suggested that we can get good policies by estimating and imitating poses. This work takes this idea and tries to make it work with sim-to-real transfer. Domain randomization would result in a policy that must be robust to all the possible values of the hidden parameters (such as friction). To make the problem easier, they do domain randomization, but give the agent access to (a latent representation of) the hidden parameters, so that its policy can depend on the hidden parameters. Then, to transfer to the real world, they simply need to search over the latent representation of the hidden parameters in order to find one where the policy actually works in the real world. In practice, they can adapt to the real world with just 8 minutes of real world data.

Rohin's opinion: This is a cool improvement to domain randomization: it seems like it should be distinctly easier to learn a policy that is dependent on the hidden parameters, and that seems to come at the relatively low cost of needing just a little real world data.

FEEDBACK

I'm always happy to hear feedback; you can send it to me, Rohin Shah, by replying to this email.

PODCAST

An audio podcast version of the Alignment Newsletter is available. This podcast is an audio version of the newsletter, recorded by Robert Miles.

11

[AN #94]: AI alignment as translation between humans and machines

11

Ω 8