This work was supported by OAK, a monastic community in the Berkeley hills. It could not have been written without the daily love of living in this beautiful community.


Last week I attended the annual workshop of Stuart Russell’s research lab at UC Berkeley — the Center for Human-Compatible AI (CHAI). There were talks by Russell himself, as well as several graduates of the lab who now have research positions of their own at other universities. I got the clearest picture that I’ve yet encountered of CHAI’s overall technical research agenda. This is my take on it.

Assistance games

Traditionally, AI researchers have formulated problems assuming that there will be a fixed objective provided by a human, and that the job of the AI system is to find a solution that satisfies the human’s objective. In the language of sequence diagrams this looks as follows:

The "standard model" of AI research

For example, in a search problem the objective specification might be a graph over which the system is to search, a cost for each edge, and a goal state that terminates the search. The AI researcher then needs to to develop optimization algorithms that efficiently find a minimum-cost path to a goal state. Or in a supervised learning problem the objective specification might consist of a dataset of labelled examples and the AI researcher needs to develop optimization algorithms that efficiently find function approximations that extrapolate these labelled examples to future unlabelled examples.

CHAI’s basic insight is to ask: why limit ourselves to a one-time objective specification event? We know that it is difficult to capture everything we care about in a formal metric (c.f. Goodhart’s law). We know that humans aren’t very good at foreseeing the strange and sometimes deranged ways that powerful optimization can give you what you asked for but not what you wanted. Why should information about the human’s objective be transmitted to the machine via a one-time data dump, after which it remains fixed for all time?

There are many alternative interaction patterns by which information about the human’s objective could be transmitted to the machine. The human could observe the machine and provide it with feedback as it works. The machine could ask the human questions about its objective. The machine could observe the human and deduce its objective from its behavior. And so on.

Examples of interaction patterns in assistance games

CHAI calls this an assistance game: the human wants something from the machine, and it is the machine’s job to both (1) figure out what that is, and (2) fulfil it. The role of the AI researcher under this new model then is to explore the space of possible interaction patterns and find one that is conducive to the machine building an informed picture of what the human wants as quickly as possible.

The old model in which a complete objective is specified up front is actually just one special case of an assistance game: one in which the interaction pattern is that the machine receives all the information it will ever receive about the human’s objective in a one-time up-front data dump. The unique thing about the old model -- and the reason it is both attractive and dangerous -- is that the machine never needs to entertain any uncertainty about what the human wants. It is given an objective up front and its job is just to fulfil it.

Using more nuanced interaction patterns require the machine to maintain uncertainty about what the human’s objective is, which in turn requires optimization algorithms formulated so as to take into account this uncertainty. This suggests an exciting reformulation of each of the basic AI problem statements, and CHAI seems to be enthusiastically taking up this challenge, including with Russell’s new edition of the standard AI textbook AI: A Modern Approach.

One of CHAI’s early successes was the development of cooperative inverse reinforcement learning (CIRL). But CIRL is often mistaken as representing the entirety of CHAI’s research agenda, whereas in fact CIRL is one particular approach to solving one particular kind of assistance game. Specifically it addresses an assistance game in which the machine observes demonstrations by a human, who is in turn incentivized to provide demonstrations that are of value to the machine. The original CIRL paper makes various further modelling assumptions and approximations in order to arrive at a concrete algorithm. CIRL is an important contribution to the field but it is important to understand that the program of reformulating AI as an assistance game is broader than this one specific proposal.

I found myself wondering whether reinforcement learning might also count as a new-style assistance game. In reinforcement learning the machine begins by exploring, and the human provides a positive or negative reward each time it does something consistent with or opposed to the human’s objective. In early reinforcement learning literature it was envisaged that there would be a literal human providing live feedback as learning progressed, but in modern reinforcement learning the reward signal is generally automated using a program that the human provides before learning commences. In this way modern RL looks more like the old model of Figure 1 since the reward algorithm is generally specified up-front and not modified during training. Also, reinforcement learning agents do not really maintain uncertainty about what the human’s objective is: they just act to maximize their reward, and for that reason will take control of the reward signal if given the opportunity.

Going beyond agents

This new model of CHAI’s relaxes one of the core assumptions of classical AI research -- that a fully-specified objective will be given at the beginning of time -- but there is still a strong assumption that both the human and the machine are well-modelled as having objectives.

Suppose we wish to build a machine that provides assistance to a rainforest. Can we view a rainforest -- the complete ecosystem containing all the various plants and animals that live there -- as having objectives? The plants and animals living in a rainforest spend a great deal of energy competing against one another, so it is difficult to view the rainforest as behaving according to any unified set of objectives. Yet it is possible to take actions that do great damage to a rainforest (clearing a large area of all trees, or introducing a synthetic pathogen) and conversely it is possible to take actions that protect the rainforest (preventing the clearing of trees or preventing the introduction of synthetic pathogen). Should our AI systems be able to observe a rainforest and deduce what it would mean to be of assistance to it?

Humans, too, are not perfectly modelled as agents. The agent model provides a compact and useful description of human behavior at a certain resolution, but as we look deeper into our own nature we find that we are not such unitary agents at all. We are in fact made of the same basic building blocks as the world around us and our view of ourselves as agents is only an approximate description. Should our AI systems model humans as agents, or can we do better? What is the next most detailed model up from the agent model? Can we build AI systems that play assistance games on the basis of this model?

Finally, machines are not ideal agents either. Any AI algorithm deployed on real hardware in the real world is made of the same basic building blocks as the world itself, and is only approximately modelled as an agent with a set of objectives. As we dial up the power of the AI systems we build and those AI systems are able to build increasingly detailed models of the world around them, this agent approximation is likely to break down. This is because an AI system that has a sufficiently detailed model of its environment will eventually discover and begin to reason about its own computing infrastructure, at which point it will need some way to deal with the paradoxes of counterfactual reasoning and logical uncertainty that arise when one can accurately predict one’s own future behavior. If we have constructed our AI systems on the basis that they are well-modelled as agents, with no line of retreat from this assumption, then when our AI systems build detailed models of the world that conflict with this assumption, they are likely to misbehave.

New Comment
18 comments, sorted by Click to highlight new comments since:
[-]Rohin ShahΩ11290

Great summary! Some quick notes:

  • While this is the agenda that Stuart talks most about, other work also happens at CHAI, especially on multiagent scenarios (whether multiple humans or multiple AI systems). See also the ARCHES agenda.
  • The reason I'm excited about CIRL is because it provides a formalization of assistance games in the sequential decision-making setting. According to me, the specific algorithm and technical results about pedagogy in the paper should be taken as examples of what the formalism allows you to do. They are interesting results, but certainly aren't striking at the core of AI alignment. The Benefits of Assistance paper is a bit more clear on the more general benefits of assistance. I think most of CHAI has a similar view to me on this.
  • All models are wrong; some are useful. I expect that there will always be misspecification in any kind of system that we build. So when I hear "X is misspecified, so it might misbehave"; I want to hear more about how exactly it will misbehave before I'm convinced I should care.
  • Nonetheless, I do agree that a strict agent assumption seems bad; most notably it seems hard to model the fact that human preferences change (unless you adopt a very expressive model of "preferences", in which case the agent learns complicated conditionals like "Alice prefers sweet things in the decade 2000-2010 and healthy things in the decade 2010-2020" that may not generalize very well).
  • I find the rainforest example not very compelling -- it seems to me that to the extent "help the rainforest" means anything to me, it's because I can model the rainforest as an agent and figure out what it is "trying to do", and then help it with that. More generally, it seems like "help X" or "assist X" only means something when you view X as pursuing some goal. You could interpret "help the rainforest" as "do the things the environmentalists want you to do", but that seems to be about human desires, and not an "objective fact" about what it means to help a rainforest. (It does seem plausible to me that the specific mathematical formalism of optimizing a reward function would not be a good fit for the rainforest; that's different from saying that you shouldn't view the rainforest as an agent altogether.)

While this is the agenda that Stuart talks most about, other work also happens at CHAI

Yes good point - I'll clarify and link to ARCHES.

The reason I'm excited about CIRL is because it provides a formalization of assistance games in the sequential decision-making setting ... There should soon be a paper that more directly explains the case for the formalism

Yeah this is a helpful perspective, and great to hear re upcoming paper. I have definitely spoken to some folks that think of CHAI as the "cooperative inverse reinforcement learning lab" so I wanted to make the point that CIRL != CHAI.

All models are wrong; some are useful

Well keep in mind that we're using the agent model twice: once in our own understanding of the AI systems we build, and then a second time in the AI system's understanding of what a human is. We can update the former as needed, but if we want the AI system to be able to update its understanding of what a human is then we need to work out how to make that assumption updateable in the algorithms we deploy.

So when I hear "X is misspecified, so it might misbehave"; I want to hear more about how exactly it will misbehave before I'm convinced I should care.

Very fair request. I will hopefully be writing more on this topic in the specific case of the agent assumption soon.

More generally, it seems like "help X" or "assist X" only means something when you view X as pursuing some goal

Well would you agree that it's possible to help a country? A country seems pretty far away from being an agent, although perhaps it could be said to have goals. Yet it does seem possible to provide e.g. economic advice or military assistance to a country in a way that helps country without simply helping each of the separate individuals.

How about helping some primitive organism, such as a jellyfish or amoeba? I guess you could impute goals onto such organisms...

How about helping a tree? It actually seems pretty straightforward to me how to help a tree (bring water and nutrients to it, clean off parasites from the bark, cut away any dead branches), but does an individual tree really have goals?

Now that I've read your post on optimization, I'd restate

More generally, it seems like "help X" or "assist X" only means something when you view X as pursuing some goal.

as

More generally, it seems like "help X" or "assist X" only means something when you view X as an optimizing system.

Which I guess was your point in the first place, that we should view things as optimizing systems and not agents. (Whereas when I hear "agent" I usually think of something like what you call an "optimizing system".)

I think my main point is that "CHAI's agenda depends strongly on an agent assumption" seems only true of the specific mathematical formalization that currently exists; I would not be surprised if the work could then be generalized to optimizing systems instead of agents / EU maximizers in particular.

I think my main point is that "CHAI's agenda depends strongly on an agent assumption" seems only true of the specific mathematical formalization that currently exists; I would not be surprised if the work could then be generalized to optimizing systems instead of agents / EU maximizers in particular.

Ah, very interesting, yeah I agree this seems plausible, and also this is very encouraging to me!

In all of the "help X" examples you give, I do feel like it's reasonable to do it via taking an intentional stance towards X, e.g. a tree by default takes in water + nutrients through its roots and produces fruit and seeds, in a way that wouldn't happen "randomly", and so "helping a tree" means "causing the tree to succeed more at taking in water + nutrients and producing fruit + seeds".

In the case of a country, I think I would more say "whatever the goal of a country, since the country knows how to use money / military power, that will likely help with its goal, since money + power are instrumental subgoals". This is mostly a shortcut; ideally I'd figure out what the country's "goal" is and then assist with that, but that's very difficult to do because a country is very complex.

[-]RaemonΩ350

I'm wondering if the Rainforest thing is somehow tied to some other disagreements (between you/me or you/MIRI-cluster).

Where, something like "the fact that it requires some interpretive labor to model the Rainforest as an agent in the first place" is related to why it seems hard to be helpful to humans, i.e. humans aren't actually agents. You get an easier starting ground since we have the ability to write down goals and notice inconsistencies in them, but that's not actually that reliable. We are not in fact agents and we need to somehow build AIs that reliable seem good to us anyway.

(Curious if this feels relevant either to Rohin, or other "MIRI cluster" folk)

Well, yes, one way to help some living entity is to (1) interpret it as an agent, and then (2) act in service of the terminal goals of that agent. But that's not the only way to be helpful. It may also be possible to directly be helpful to a living entity that is not an agent, without getting any agent concepts involved at all.

I definitely don't know how to do this, but the route that avoids agent models entirely seems more plausible me compared to working hard to interpret everything using some agent model that is often a really poor fit, and then helping on the basis of a that poorly-fitting agent model.

I'm excited about inquiring deeply into what the heck "help" means. (All please reach out to me if you'd like to join a study seminar on this topic)

I share Alex's intuition in a sibling comment:

How about helping a tree? It actually seems pretty straightforward to me how to help a tree

Yes, there is interpretive labor, and yes, things become fuzzy as situations become more and more extreme, but if you want to help an agent-ish thing it shouldn't be too hard to add some value and not cause massive harm.

I expect MIRI-cluster to agree with this point -- think of the sentiment "the AI knows what you want it to do, it just doesn't care". The difficulty isn't in being competent enough to help humans, it's in being motivated to help humans. (If you thought that we had to formally define everything and prove theorems w.r.t the formal definitions or else we're doomed, then you might think that the fact that humans aren't clear agents poses a problem; that might be one way that MIRI-cluster and I disagree.)

I could imagine that for some specific designs for AI systems you could say that they would fail to help humans because they make a false assumption of too-much-agentiness. If the plan was "literally run an optimal strategy pair for an assistance game (CIRL)", I think that would be a correct critique -- most egregiously, CIRL assumes a fixed reward function, but humans change over time. But I don't see why it would be true for the "default" intelligent AI system.

[-]RaemonΩ370

I think previously I read this partway through, and assumed it was long, and then stopped for some reason. Now I finally read it and found it a nice, short/sweet post. 

I personally did find the Rainforest example fairly compelling. At first glance I think it feels a bit nonsensical to try to "help" a rainforest. But, I'm kinda worried that it'll turn out that it's not (much) less nonsensical to try to help a human, and figuring out how to help arbitrary non-obviously-agenty systems seems like it might be the sort of thing we have to understand.

At first glance I think it feels a bit nonsensical to try to "help" a rainforest. But, I'm kinda worried that it'll turn out that it's not (much) less nonsensical to try to help a human, and figuring out how to help arbitrary non-obviously-agenty systems seems like it might be the sort of thing we have to understand.

Yeah this question of what it really means to help some non-pure-agent living entity seems more and more central to me. It also, unfortunately, seems more and more difficult. Another way that I state the question in order to meditate on it is: what does it mean to act out of true love?

Curated. Short and sweet summary of a research approach being pursued in AI alignment, that I think has not been written up like this before (especially on LessWrong).

The slightly longer and slightly worse version is the pair of Human-AI Interaction and Reward uncertainty.

Nonetheless, thanks for the links :)

One compression: assistance in partitioning the hypothesis space. As opposed to finding the correct point in the search space from one shot learning.

I'm afraid I don't really know what you're referring to here.

My take on CHAIs research agenda in 21 words, having read your take.

Is the "Going Beyond Agents" section part of CHAI's research agenda, or your take on further challenges from an embedded agency perspective?

That part is my take on further challenges from an embedded agency perspective.