Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

If you end up applying this post, please do it in the name of safety research.

I - intro

Suppose I need groceries. When I make a plan to get more groceries, what I don't do is fully model the world (e.g. as a Markov process), find a policy that leads to me getting groceries, and then execute that policy. Among the reasons why not, there are two at the top of the heap both in importance and ridiculousness. First, this requires imagining every situation I can think of and making a plan in case of every eventuality. Imagine me sitting at home, thinking "Yes, well if the peanut butter is on the third shelf, I'll do this, but if it's on the second shelf, I'll do that..." Second, every plan I make, for all of these situations, would have to be in the most exacting detail possible. Not "Then I'll take the peanut butter," but "Then I will activate muscle group #1938 at 43% intensity and muscle group #552 at 9% intensity, and..."

What I actually do is think of some abstract plan like "I'm going to bike to the supermarket down the street, then get some veggies and tofu and stuff, and come home." Then to get started, I figure out how to bike to the supermarket in terms of actions like "put on my shoes and a coat, and grab my bike." Then to get started on that, I make a plan to put on my shoes in terms of actions like "Insert feet into shoes," which itself is done with a plan that looks like "Activate muscle group #698 at 35% intensity, and..."

This sort of hierarchical planning is vital to operating in the real world with any sort of efficiency. In this post I will verbally outline some hierarchical planners - "context agents" - followed by a more precise definition, and finally talk about what the values of context agents look like and why I think that's interesting. 

Context agents as stated are going to be pretty impractical, but if you're clever there are ways to make them cheaper. The main point of introducing them is to talk about what values and Goodhart's law look like in the context of hierarchical planning - I'm not wedded to this exact formulation by any means, but I hope you'll agree they help describe something interesting.

II - description

The basic part of a context agent is the contexts. A context is mainly just an MDP-type model of the world. It models the world as having states and actions - at each state within the context, there is some set of actions you can take, and each action advertises that taking it results in some probability distribution over other states in the context. Goals are given by some reward function that is exogenous to the context (so that we can talk about using the same context with different reward functions); the context agent plans out a policy that maximizes reward and samples the first action from that policy.

Of course, since this is supposed to be a model of hierarchical planning, there's more than one context. Actions serve a dual purpose here: within the context, they predict state transitions, but once an action is chosen it transitions the agent to a new context plus a goal. I make the plan to get groceries in terms of states like "at home" and "go to the store," but once I select the action "go to the store," I figure out how to go to the store by making a new, more fine-grained plan in terms of actions like "put on my shoes." "Go to the store" goes from being an action in the parent context to defining the goal in the child context.

There are plenty of similar takes on hierarchical reasoning out there in the literature (e.g.), and these tend to really be hierarchical: when making a plan, you have to only be moving from general to specific, and each level of this hierarchy is a description of the same stuff, just in different terms.

Context agents are more relaxed about this. You can even have contexts with a cyclical relationship, where you can have two contexts A and B such that sometimes using way A of making plans leads to context B, but also vice versa. The context switching part of actions is very broad, with no particular restrictions on what you switch from or to.

An example? How about if while biking to the store, the light on my bike dies - and so I plan to buy batteries at the store. In this situation I started out making a plan to "go to the store," which put me in a bicycle-operating context, but it turns out that I was in the state "the light on my bike is out," which led me to an action like "buy some batteries," which led right back to "go to the store" (albeit with a different reward function!).

Of course, infinite loops are bad. But all that's important here is that you eventually take a real-world action ("muscle group #552" et c.) with probability 1  . In the context agent formalism, there's some set of actions that are the real-world outputs of the agent, and selecting one of these actions updates the observations available to the agent according to some discrete-time function of the agent's output history (the environment). Then, at least in the nice simple case we're covering in this post, the agent jumps back to the origin of all its plans and plots out the next timestep.


: Guaranteeing that you eventually get output is a somewhat tricky problem. If we assume that there's some "basic actions only" context where you are forced to choose an , it gets a bit easier and looks like an ergodicity condition on the transition dynamics between contexts. These conditions are actually pretty easy to fulfill, because the non-ergodic transition matrices are of measure zero, but only if transitions are probabilistic rather than deterministic. Therefore, if the search process is deterministic, and the found policy is deterministic, we have to worry more about finding ourselves in a cycle, and in that case deliberately injecting noise can help avoid cycles.

III - more specific description

A context agent (to use the usual jargon) is a tuple  where:

  •  is the contexts.
  •  is the starting context.
  •  is a reward function given for the starting context. Reward functions are (in general) of type . You can have a reward that depends on what state you started in, what action you took, and what state you ended up in.
  •  (with elements ) is the set of real-world actions or outputs.
  •  is the set of possible sensory observations.
  •  is the transition function between contexts. This is of type . It takes in your current context and the action taken in that context, and gives you back the new context and a reward function for that context. You could also think of this split into context-specific transition functions .

Each context  is in turn a tuple , where:

  •  is the set of states.
  •  is the set of available actions for each  in .
  •  are the transition probabilities between states, .
  •  is the context's inference function for what state you're in, given the agent's entire history. This is of type . This is a potentially impractical thing that I am going to ignore for now because it doesn't bear much on the alignment bits I want to talk about in this post.

The activity of a context agent follows a "tick, tock" pattern, alternating planning and action-selecting.

The context agent always has some context  and some reward function . Based on this the agent samples a policy that achieves the highest reward, starting from the inferred current state . In the real world where we can't just say "," this search can be difficult, especially if we allow the transition probabilities to be difficult-to-evaluate functions rather than a pre-computed matrix. Or maybe the number of possible states to search through is just very large (since the contexts are like MDPs rather than POMDPs, there can be contexts where there need to be at least as many states as there are states of knowledge of the agent). But we can suppose that the context agent is able to solve this search problem at least to some degree, and find some decent policy.

It then selects the first action in that policy, . If  is not part of , then the context agent switches contexts: it repeats the planning step except with new (context, reward function) =  = . On the other hand, if , the agent takes the real-world action , records a new sensory observation , and jumps back to the starting context and reward .

IV - motivation

The introduction was a little ambiguous as to the real motivations of this post. We can either interpret context agents in terms of their practical use, or we can interpret them normatively.

The practical perspective looks at the introductory arguments like "modeling things as POMDPs is really hard on long time scales" and evaluates context agents in terms of their ability to solve this problem. This puts them in the company of the macro-actions literature, which is on the "being clever" side of the historical trend that being clever isn't as good as being trained end-to-end with a much larger computer. (One can see most things related to Josh Tenenbaum as attempting to navigate this divide, see e.g. Kulkarni & Narasimhan et al. 2016). And indeed, context agents don't seem all that well-suited for end-to-end training.

If there's a practically exciting feature, maybe it's the potential for unsupervised / self-supervised learning of contexts. After all, each context is an abstraction of the world that makes its own local kind of sense -  that should perk up your ears if you like autoencoders and reconstruction loss. However, there are challenges in forming a new context - it should be a good abstract description of part of the world, but it also has to be built out of plans that can be made in already-learned contexts. This is a thorny optimization problem to solve efficiently. And since this is an agent, the data it's learning from are its interactions with an environment, so unsupervised learning would also require intrinsically motivated planning for what we might call novelty and mastery.

The normative perspective looks at the introductory arguments like "when humans make a plan to fulfill their goals, they use conceive of it abstractly and tend to only fill in details as needed" and evaluates context agents on how well they help us understand human values.

To clarify what I mean, consider trying to understand human values through the lens of utility maximization. As soon as we start, the utility-maximization framework immediately suggests interesting questions, like: What is the human utility function? Or if that doesn't exist, can we attribute a utility function to an idealized human trying to do good? What is the domain of the utility function? What are some ways humans deviate from (or complicate) utility maximization? If you tell a superintelligent AI to fit human behavior to a utility function and then adopt that utility function as its own, what happens?

The problem with utility maximization as a framework for understanding human values isn't really that humans don't have utility functions. Okay, maybe some of the problem is that humans very don't have utility functions. But the real problem is that a lot of the questions it generates have dismal and barren answers.

Example one: Humans don't have a utility function over physical states of the universe (we don't just want it to be frozen in the one best state forever), and so we are relegated to the extremely general space of utility functions over universe-histories, about which it is difficult to say interesting things.

Example two: There are many utility functions over universe-histories that would be wonderful to implement, and we can interpret some of these as being the utility functions of idealized human-like agents ("what would we do if we knew more, were more moral, deliberated for longer, etc.?"). But the actual idealization process, the interesting bit, doesn't take place in the language of utility functions.

We get some of the same questions for context agents, and some different ones: What context agent is a human? Or if that doesn't exist, what context agent corresponds to an idealized human trying to do good? How do actual humans deviate deviate from (or complicate) context agency? Can we nevertheless infer the contexts used by humans in making certain sorts of plans? If we take our idealized do-gooder context agent, what does it even mean to try to "give this to a superintelligence," given that context agents mix up the boundary between goal-having and decision-making?

That last one is important - we've definitely given something up by deviating from VNM rationality. Optimization power can be "hidden" from the starting context  both in learned behavior of the transition function, and in the unavoidable facts that more concrete contexts will fill in details that were left out of more-abstract contexts. Context agents don't have goals in the sense of a utility function over states of the universe, but they have goals in their own messier, more holistic way.

In exchange for the mess, we get a lot closer to the structure of what humans think when they imagine the goal of "doing good." Humans strive towards such abstract goals by having a vague notion of what it would look and feel like, and by breaking down those goals into more concrete sub-tasks. This encodes a pattern of preferences over universe-histories that treats some temporally extended patterns as "states."

Unfortunately, the normative usage of context agents is at odds with the practical usage. It's like context agents are a paremetric model of humans, and what I called the practical perspective is asking "How can we use this model of humans to make skillful plans and decisions?", while what I call the normative perspective is asking "How can we use this model of humans to mimic actual humans?". The contexts of a context agent, and the transitions between them, encode common sense in what options they present to the planning process. We might abstractly say "go over there" without worrying about specifying that it should be bipedal locomotion, and we might say "get some groceries" without specifying that that means going to the store, rather than robbing the neighbor. The more concrete contexts can fill in such details. But for them to do so means they have to have a lot of learned information from humans, which rather torpedoes the cool ideas I had for unsupervised learning of contexts.

On the plus side, I'm pretty sure this has interesting Goodhart's law properties. In the same sense that GPT-2 doesn't have Goodhart's law problems (sort of) because it's not doing agent-like selection of its output based on expected consequences for the world, the learned contexts and transition function exert non-agenty optimization pressure. The Goodhart's law concern is that you boot up a context agent with the high-level goal "do good things," and it plans out the cheapest thing that qualifies as good, and then kills you to prevent you from stopping it. But if "do good things" is an action that maps onto a cause-selection context, which leads to a malaria-curing context, which leads to actually curing malaria, everything is fine.

This is a bit like a cross between a quantilizer and existentialist philosophy. If the optimization via choice of context eliminates undesirable maxima, well, it eliminates most things, only leaving behind a smaller selection that's more in line with human plans. Quantilizer-style thinking might frame this as missing out on high-value states of the unknown True Utility Function in order to be safe. Surely our AI could be doing something better than using some human-comprehensible plan to cure malaria (or whatever), right? But the normative perspective on context agents would be more comfortable treating this as a legitimate expression of human values.


Ω 9

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 10:15 AM

I think maybe a more powerful framework than discrete contexts is that there's a giant soup of models, and the models have arrows pointing at other models, and multiple models can be active simultaneously, and the models can span different time scales. So you can have a "I am in the store" model, and it's active for the whole time you're shopping, and meanwhile there are faster models like "I am looking for noodles", and slower models like "go shopping then take the bus home". And anything can point to anything else. So then if you have a group of models that mainly point to each other, and less to other stuff, it's a bit of an island in the graph, and you can call it a "context". Like everything I know about chess strategy is mostly isolated from the rest of my universe of knowledge and ideas, so I could say I have a "chess strategy context". But that's an emergent property, not part of the data structure.

My impression is that the Goodhart's law thing at the end is a bit like saying "Don't think creatively"... Thinking creatively is making new connections where they don't immediately pop into your head. Is that reasonable? Sorry if I'm misunderstanding. :)

Yeah, I agree, it seems both more human-like and more powerful to have a dynamical system where models are activating other models based on something like the "lock and key" matching of neural attention. But for alignment purposes, it seems to me that we need to not only optimize models for usefulness or similarity to actual human thought, but also for how similar they are to how humans think of human thought - when we imagine an AI with the goal of doing good, we want it to have decision-making that matches our understanding of "doing good." The model in this post isn't as neat and clean as utility maximization, but a lot of the overly-neat features have to do with making it more convenient to talk about it having a fixed, human-comprehensible goal.

Re: creativity, I see how you'd get that from what I wrote but I think that's only half right. The model laid out in this post is perfectly capable of designing new solutions to problems - it just tends to do it by making a deliberate choice to take a "design a new solution" action. Another source of creativity is finding surprising solutions to difficult search problems, which is perfectly possible in complicated contexts.

Another source of creativity is compositionality, which you can have in this formalism by attributing it to the transition function putting you ino to a composed context. Can you learn this while trying to mimic humans? I'm not sure, but it seems possible.

We might also attribute a deficit in creativity to the fact that the reward functions are only valid in-context, and aren't designed to generalize to new states, even if there were really apt ways of thinking about the world that involved novel contexts or adding new states to existing contexts. And maybe this is the important part, because I think this is a key feature, not at all a bug.

In exchange for the mess, we get a lot closer to the structure of what humans think when they imagine the goal of "doing good." Humans strive towards such abstract goals by having a vague notion of what it would look and feel like, and by breaking down those goals into more concrete sub-tasks. This encodes a pattern of preferences over universe-histories that treats some temporally extended patterns as "states."

Thank you for writing this post! I've had very similar thoughts for the past year or so, and I think the quote above is exactly right. IMO, part of the alignment problem involves representational alignment -- i.e., ensuring that AI systems accurately model both the abstract concepts we use to understand the world, as well as the abstract tasks, goals, and "reasons for acting" that humans take as instrumental or final ends. Perhap's you're already familiar with Bratman's work on Intentions, Plans, & Practical Reason, but to the the extent that "intentions" feature heavily in human mental life as the reasons we cite for why we do things, developing AI models of human intention feels very important.

As it happens, one of the next research projects I'll be embarking on is modeling humans as hierarchical planners (most likely in the vein of Hierarchical Task & Motion Planning in the Now by Kaelbling & Lozano-Perez) in order to do Bayesian inference over their goals and sub-goals -- would be happy to chat more about it if you'd like! 

Oh wait, are you the first author on this paper? I didn't make the connection until I got around to reading your recent post.

So when you talk about moving to a hierarchical human model, how practical do you think it is to also move to a higher-dimensional space of possible human-models, rather than using a few hand-crafted goals? This necessitates some loss function or prior probability over models, and I'm not sure how many orders of magnitude more computationally expensive it makes everything.

Yup! And yeah I think those are open research questions -- inference over certain kinds of non-parametric Bayesian models is tractable, but not in general. What makes me optimistic is that humans in similar cultures have similar priors over vast spaces of goals, and seem to do inference over that vast space in a fairly tractable manner. I think things get harder when you can't assume shared priors over goal structure or task structure, both for humans and machines.

Sorry for being slow :) No, I haven't read anything of Bratman's. Should I? The synopsis looks like it might have some interesting ideas but I'm worried he could get bogged down in what human planning "really is" rather than what models are useful.

I'd totally be happy to chat either here or in PMs. Full Bayesian reasoning seems tricky if the environment is complicated enough to make hierarchical planning attractive - or do you mean optimizing a model for posterior probability (the prior being something like MML?) by local search?

I think one interesting question there is if it can learn human foibles. For example, suppose we're playing a racing game and I want to win the race, but fail because my driving skills are bad. How diverse a dataset about me do you need to actually be able to infer that a) I am capable of conceptualizing how good my performance is b) I wanted it to be good c) It wasn't good, from a hierarchical perpective, because of the lower-level planning faculties I have. I think maybe you could actually learn this only from racing game data (no need to make an AGI that can ask me about my goals and do top-down inference), so long as you had diverse enough driving data to make the "bottom-up" generalization that my low-level driving skill can be modeled as bad almost no matter the higher-level goal, and therefore it's simplest to explain me not winning a race by taking the bad driving I display elsewhere as a given and asking what simple higher-level goal fits on top.