If you end up applying this post, please do it in the name of safety research.
I - intro
Suppose I need groceries. When I make a plan to get more groceries, what I don't do is fully model the world (e.g. as a Markov process), find a policy that leads to me getting groceries, and then execute that policy. Among the reasons why not, there are two at the top of the heap both in importance and ridiculousness. First, this requires imagining every situation I can think of and making a plan in case of every eventuality. Imagine me sitting at home, thinking "Yes, well if the peanut butter is on the third shelf, I'll do this, but if it's on the second shelf, I'll do that..." Second, every plan I make, for all of these situations, would have to be in the most exacting detail possible. Not "Then I'll take the peanut butter," but "Then I will activate muscle group #1938 at 43% intensity and muscle group #552 at 9% intensity, and..."
What I actually do is think of some abstract plan like "I'm going to bike to the supermarket down the street, then get some veggies and tofu and stuff, and come home." Then to get started, I figure out how to bike to the supermarket in terms of actions like "put on my shoes and a coat, and grab my bike." Then to get started on that, I make a plan to put on my shoes in terms of actions like "Insert feet into shoes," which itself is done with a plan that looks like "Activate muscle group #698 at 35% intensity, and..."
This sort of hierarchical planning is vital to operating in the real world with any sort of efficiency. In this post I will verbally outline some hierarchical planners - "context agents" - followed by a more precise definition, and finally talk about what the values of context agents look like and why I think that's interesting.
Context agents as stated are going to be pretty impractical, but if you're clever there are ways to make them cheaper. The main point of introducing them is to talk about what values and Goodhart's law look like in the context of hierarchical planning - I'm not wedded to this exact formulation by any means, but I hope you'll agree they help describe something interesting.
II - description
The basic part of a context agent is the contexts. A context is mainly just an MDP-type model of the world. It models the world as having states and actions - at each state within the context, there is some set of actions you can take, and each action advertises that taking it results in some probability distribution over other states in the context. Goals are given by some reward function that is exogenous to the context (so that we can talk about using the same context with different reward functions); the context agent plans out a policy that maximizes reward and samples the first action from that policy.
Of course, since this is supposed to be a model of hierarchical planning, there's more than one context. Actions serve a dual purpose here: within the context, they predict state transitions, but once an action is chosen it transitions the agent to a new context plus a goal. I make the plan to get groceries in terms of states like "at home" and "go to the store," but once I select the action "go to the store," I figure out how to go to the store by making a new, more fine-grained plan in terms of actions like "put on my shoes." "Go to the store" goes from being an action in the parent context to defining the goal in the child context.
There are plenty of similar takes on hierarchical reasoning out there in the literature (e.g.), and these tend to really be hierarchical: when making a plan, you have to only be moving from general to specific, and each level of this hierarchy is a description of the same stuff, just in different terms.
Context agents are more relaxed about this. You can even have contexts with a cyclical relationship, where you can have two contexts A and B such that sometimes using way A of making plans leads to context B, but also vice versa. The context switching part of actions is very broad, with no particular restrictions on what you switch from or to.
An example? How about if while biking to the store, the light on my bike dies - and so I plan to buy batteries at the store. In this situation I started out making a plan to "go to the store," which put me in a bicycle-operating context, but it turns out that I was in the state "the light on my bike is out," which led me to an action like "buy some batteries," which led right back to "go to the store" (albeit with a different reward function!).
Of course, infinite loops are bad. But all that's important here is that you eventually take a real-world action ("muscle group #552" et c.) with probability 1 . In the context agent formalism, there's some set of actions that are the real-world outputs of the agent, and selecting one of these actions updates the observations available to the agent according to some discrete-time function of the agent's output history (the environment). Then, at least in the nice simple case we're covering in this post, the agent jumps back to the origin of all its plans and plots out the next timestep.
: Guaranteeing that you eventually get output is a somewhat tricky problem. If we assume that there's some "basic actions only" context where you are forced to choose an , it gets a bit easier and looks like an ergodicity condition on the transition dynamics between contexts. These conditions are actually pretty easy to fulfill, because the non-ergodic transition matrices are of measure zero, but only if transitions are probabilistic rather than deterministic. Therefore, if the search process is deterministic, and the found policy is deterministic, we have to worry more about finding ourselves in a cycle, and in that case deliberately injecting noise can help avoid cycles.
III - more specific description
A context agent (to use the usual jargon) is a tuple where:
- is the contexts.
- is the starting context.
- is a reward function given for the starting context. Reward functions are (in general) of type . You can have a reward that depends on what state you started in, what action you took, and what state you ended up in.
- (with elements ) is the set of real-world actions or outputs.
- is the set of possible sensory observations.
- is the transition function between contexts. This is of type . It takes in your current context and the action taken in that context, and gives you back the new context and a reward function for that context. You could also think of this split into context-specific transition functions .
Each context is in turn a tuple , where:
- is the set of states.
- is the set of available actions for each in .
- are the transition probabilities between states, .
- is the context's inference function for what state you're in, given the agent's entire history. This is of type . This is a potentially impractical thing that I am going to ignore for now because it doesn't bear much on the alignment bits I want to talk about in this post.
The activity of a context agent follows a "tick, tock" pattern, alternating planning and action-selecting.
The context agent always has some context and some reward function . Based on this the agent samples a policy that achieves the highest reward, starting from the inferred current state . In the real world where we can't just say "," this search can be difficult, especially if we allow the transition probabilities to be difficult-to-evaluate functions rather than a pre-computed matrix. Or maybe the number of possible states to search through is just very large (since the contexts are like MDPs rather than POMDPs, there can be contexts where there need to be at least as many states as there are states of knowledge of the agent). But we can suppose that the context agent is able to solve this search problem at least to some degree, and find some decent policy.
It then selects the first action in that policy, . If is not part of , then the context agent switches contexts: it repeats the planning step except with new (context, reward function) = = . On the other hand, if , the agent takes the real-world action , records a new sensory observation , and jumps back to the starting context and reward .
IV - motivation
The introduction was a little ambiguous as to the real motivations of this post. We can either interpret context agents in terms of their practical use, or we can interpret them normatively.
The practical perspective looks at the introductory arguments like "modeling things as POMDPs is really hard on long time scales" and evaluates context agents in terms of their ability to solve this problem. This puts them in the company of the macro-actions literature, which is on the "being clever" side of the historical trend that being clever isn't as good as being trained end-to-end with a much larger computer. (One can see most things related to Josh Tenenbaum as attempting to navigate this divide, see e.g. Kulkarni & Narasimhan et al. 2016). And indeed, context agents don't seem all that well-suited for end-to-end training.
If there's a practically exciting feature, maybe it's the potential for unsupervised / self-supervised learning of contexts. After all, each context is an abstraction of the world that makes its own local kind of sense - that should perk up your ears if you like autoencoders and reconstruction loss. However, there are challenges in forming a new context - it should be a good abstract description of part of the world, but it also has to be built out of plans that can be made in already-learned contexts. This is a thorny optimization problem to solve efficiently. And since this is an agent, the data it's learning from are its interactions with an environment, so unsupervised learning would also require intrinsically motivated planning for what we might call novelty and mastery.
The normative perspective looks at the introductory arguments like "when humans make a plan to fulfill their goals, they use conceive of it abstractly and tend to only fill in details as needed" and evaluates context agents on how well they help us understand human values.
To clarify what I mean, consider trying to understand human values through the lens of utility maximization. As soon as we start, the utility-maximization framework immediately suggests interesting questions, like: What is the human utility function? Or if that doesn't exist, can we attribute a utility function to an idealized human trying to do good? What is the domain of the utility function? What are some ways humans deviate from (or complicate) utility maximization? If you tell a superintelligent AI to fit human behavior to a utility function and then adopt that utility function as its own, what happens?
The problem with utility maximization as a framework for understanding human values isn't really that humans don't have utility functions. Okay, maybe some of the problem is that humans very don't have utility functions. But the real problem is that a lot of the questions it generates have dismal and barren answers.
Example one: Humans don't have a utility function over physical states of the universe (we don't just want it to be frozen in the one best state forever), and so we are relegated to the extremely general space of utility functions over universe-histories, about which it is difficult to say interesting things.
Example two: There are many utility functions over universe-histories that would be wonderful to implement, and we can interpret some of these as being the utility functions of idealized human-like agents ("what would we do if we knew more, were more moral, deliberated for longer, etc.?"). But the actual idealization process, the interesting bit, doesn't take place in the language of utility functions.
We get some of the same questions for context agents, and some different ones: What context agent is a human? Or if that doesn't exist, what context agent corresponds to an idealized human trying to do good? How do actual humans deviate deviate from (or complicate) context agency? Can we nevertheless infer the contexts used by humans in making certain sorts of plans? If we take our idealized do-gooder context agent, what does it even mean to try to "give this to a superintelligence," given that context agents mix up the boundary between goal-having and decision-making?
That last one is important - we've definitely given something up by deviating from VNM rationality. Optimization power can be "hidden" from the starting context both in learned behavior of the transition function, and in the unavoidable facts that more concrete contexts will fill in details that were left out of more-abstract contexts. Context agents don't have goals in the sense of a utility function over states of the universe, but they have goals in their own messier, more holistic way.
In exchange for the mess, we get a lot closer to the structure of what humans think when they imagine the goal of "doing good." Humans strive towards such abstract goals by having a vague notion of what it would look and feel like, and by breaking down those goals into more concrete sub-tasks. This encodes a pattern of preferences over universe-histories that treats some temporally extended patterns as "states."
Unfortunately, the normative usage of context agents is at odds with the practical usage. It's like context agents are a paremetric model of humans, and what I called the practical perspective is asking "How can we use this model of humans to make skillful plans and decisions?", while what I call the normative perspective is asking "How can we use this model of humans to mimic actual humans?". The contexts of a context agent, and the transitions between them, encode common sense in what options they present to the planning process. We might abstractly say "go over there" without worrying about specifying that it should be bipedal locomotion, and we might say "get some groceries" without specifying that that means going to the store, rather than robbing the neighbor. The more concrete contexts can fill in such details. But for them to do so means they have to have a lot of learned information from humans, which rather torpedoes the cool ideas I had for unsupervised learning of contexts.
On the plus side, I'm pretty sure this has interesting Goodhart's law properties. In the same sense that GPT-2 doesn't have Goodhart's law problems (sort of) because it's not doing agent-like selection of its output based on expected consequences for the world, the learned contexts and transition function exert non-agenty optimization pressure. The Goodhart's law concern is that you boot up a context agent with the high-level goal "do good things," and it plans out the cheapest thing that qualifies as good, and then kills you to prevent you from stopping it. But if "do good things" is an action that maps onto a cause-selection context, which leads to a malaria-curing context, which leads to actually curing malaria, everything is fine.
This is a bit like a cross between a quantilizer and existentialist philosophy. If the optimization via choice of context eliminates undesirable maxima, well, it eliminates most things, only leaving behind a smaller selection that's more in line with human plans. Quantilizer-style thinking might frame this as missing out on high-value states of the unknown True Utility Function in order to be safe. Surely our AI could be doing something better than using some human-comprehensible plan to cure malaria (or whatever), right? But the normative perspective on context agents would be more comfortable treating this as a legitimate expression of human values.