Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

We’ve just released our paper on human-AI collaboration. The paper makes a straightforward-to-me point that self-play training is not going to work as well with humans in collaborative settings as in competitive settings. Basically, humans cause a distributional shift for the self-play agent. However, in the competitive case, the self-play agent should move towards the minimax policy, which has the nice property of guaranteeing a certain level of reward regardless of the opponent. The collaborative case has no such guarantee, and the distribution shift can tank the team performance. We demonstrated this empirically on a simplified version of the couch coop game Overcooked (which is amazing, I’ve played through both Overcooked games with friends).

As with a previous post, the rest of this post assumes that you’ve already read the blog post. I’ll speculate about how the general area of human-AI collaboration is relevant for AI alignment. Think of these as rationalizations of the research after the fact.

It’s necessary for assistance games

Assistance games (formerly called CIRL games) involve a human and an agent working together to optimize a shared objective that only the human knows. I think the general framework makes a lot of sense. Unfortunately, assistance games are extremely intractable to solve. If you try to scale up assistance games as a whole, the resulting environment is not very strategically complex, because it’s hard to do preference learning and coordination simultaneously with deep RL. This suggests trying to make progress on subproblems within assistance games.

Usually, when people talk about making progress on “the CIRL agenda”, they are talking about the preference learning aspect of an assistance game. We typically simplify to a single-agent setting and do preference learning, as in learning from comparisons or demonstrations. However, a useful agent will also need to properly coordinate with the human in order to be efficient. This suggests work on human-AI collaboration. We can work on this problem independently of preference learning simply by assuming that the agent knows the true reward function. This is exactly the setting that we study.

In general, I expect that if one hopes to take an assistance-game-like approach to AI alignment, work on human-AI collaboration will be necessary. The main uncertainty is whether assistance games are the right approach. Under a learning-based model of AI development, I think it is reasonably likely that the assistance game paradigm will be useful, without solving all problems (in particular, it may not solve inner alignment).

It seems important to figure out coordination

Regardless of whether we use assistance games, it’s probably worthwhile to figure out how an AI system should coordinate with another agent that is not like itself. I don’t have a concrete story here; it’s just a general broad intuition.

It leads to more human-AI research

On my model, the best reason for optimism is that researchers will try to build useful AI systems, they’ll run into problems, and then they’ll fix those problems. Under this model, a useful intervention to run is to discover the problems sooner. This isn’t completely clear, since maybe if you discover the problems sooner, the root causes aren’t as obvious, and you are less likely to fix the entire problem -- but I think the main effect is in fact an increase in safety.

This would be my guess for how this research will most impact AI safety. We (by which I mean mostly Micah and somewhat me) spent a bunch of time cleaning up the code, making it easy for others to work with, creating nice figures, writing up a good blog post, etc. in an effort to have other ML researchers actually make progress on these issues. (However, I wouldn’t be too surprised if other researchers used the environment, but for a different purpose.)

New Comment
7 comments, sorted by Click to highlight new comments since: Today at 2:51 PM

Agents trained via general DRL algorithms (SP, PBT) in collaborative environments are very good at coordinating with themselves. They’re not able to handle human partners well, since they have never seen humans during training.

Test Env., “Overcooked” is a video game, players control chefs in a kitchen to cook & serve dishes, each dish takes several high-level actions, the challenge is motion coordination.

Agents should learn how to (1) navigate the map, (2) interact with objects, (3) drop the objects off in the right locations, (4) serve completed dishes to the serving area. (5) spatial awareness & coordination with their partner.

Takeaways, Agents that were explicitly designed to work well with a human model achieved significantly better performance. PPO BC agent learns to use both pots & can take both leader & follower roles. The challenge is the BC model is not adaptive, as it cannot learn to take advantage of human adaptivity. Real humans learn throughout the episode to anticipate stuff happening.

Really enjoyed the paper, agents learning to collaborate rather than compete with humans is positive.

I suppose "minimax policy" is a shortcut for "assume that your human partner is a complete idiot just clicking things randomly or worse; choose the course of action that prevents them from causing the worst possible disaster; if you afterwards still have some time and energy left, use them to gain some extra points".

I welcome our condescending AI overlords, because they are probably the best we can realistically hope for.

That's what a minimax policy would be in a collaborative game, but I was talking about a competitive game. There, a minimax policy corresponds to assuming that your human opponent is perfectly optimal.

What do you think about the possibility that, in practice, a really good strategy might be to not learn about humans at all, but just to learn to adapt to whatever player is out there (if you're powerful enough to use a simplicity prior, you only make a small finite number of mistakes relative to the best hypothesis in your hypothesis space)? I think it might exacerbate the issues CIRL has with distinguishing humans from the environment.

So in general the larger the set of agents that you have to be robust to, the less you are able to coordinate with them. If you need to coordinate with a human on saying the same word out of "Heads" and "Tails", you can probably do it. If you need to coordinate with an arbitrary agent about that, you have no hope.

if you're powerful enough to use a simplicity prior, you only make a small finite number of mistakes relative to the best hypothesis in your hypothesis space

... But when you update on the evidence that you see, you are learning about humans? I'm not sure why you say this is "not learn[ing] about humans at all".

Also, I disagree with "small" in that quote, but that's probably not central.

Also, our AI systems are not going to be powerful enough to use a simplicity prior.

... But when you update on the evidence that you see, you are learning about humans? I'm not sure why you say this is "not learn[ing] about humans at all".

Maybe I should retract this to "not learning about humans at train time," but on the other hand, maybe not. The point here is probably worth explaining, and then some rationalist taboo is probably in order.

What's that quote (via Richard Rorty summarizing Donald Davidson)? "If you believe that beavers live in deserts, are pure white in color, and weigh 300 pounds when adult, then you do not have any beliefs, true or false, about beavers." There is a certain sort of syntactic aboutness that we sometimes care about, not merely that our model captures the function of something, but that we can access the right concept via some specific signal.

When you train the AI on datasets of human behavior, the sense in which it's "learning about humans" isn't merely related to its function in a specific environment at test time, it's learning a model that is forced to capture human behavior in a wide variety of contexts, and it's learning this model in such a way that you the programmer can access it later to make use of it for planning, and be confident that you're not like the person trying to use the label "beaver" to communicate with someone who thinks beavers live in deserts.

When the purely-adaptive AI "learns about humans" during test time, it has fewer of those nice properties. It is not forced to make a broad model of humans, and in fact it doesn't need to distinguish humans from complicated human-related parts of the environment. If you think humans come with wifi and cellphone reception, and can use their wheels to travel at speeds upwards of 150 kph, I'm suspicious about your opinion on how to satisfy human values.

Also, I disagree with "small" in that quote, but that's probably not central.
Also, our AI systems are not going to be powerful enough to use a simplicity prior.

Fair enough (though it can be approximated surprisingly well, and many effective learning algorithms aspire to similar bounds on error relative to best in class). So do you think this means that pre-training a human model will in general be a superior solution to having the AI adapt to its environment? Or just that it will be important in enough specific cases (e.g. certain combinations of availability of human data, ease of simulation of the problem, and ease of extracting a reward signal from the environment) that the "engineers on the ground" will sit up and take notice?

Oh, I see, thanks for the explanation. In that case, I basically agree with your original comment: it seems likely that eventually we will have AI systems that simply play well with their environment. However, I think that by that point AI systems will have a good concept of "human", because humans are a natural category / the "human" concept carves reality at its joints.

(I anticipate pushback on the claim that the "human" concept is natural; I'm not going to defend it here, though if the claim is unclear I'm happy to clarify.)