Disclaimer: The information herein is not at all new. I wrote this a year ago, but when I wanted to link to it recently I found out it wasn't in existence on LW (except as a link post which isn't very satisfactory). I decided to polish it up and post it here.
This is basic information, and while it is original to me, I make no claim that the the content thereof is new. Nevertheless, I think it is important enough to be worth posting.
I outline a framework for thinking about the motivations (goals/values) of agents.
In a few articles I've read (written at different times) and in some conversations I've had, I identified what I perceived as confused thinking regarding the motivations of agents. This post is my attempt to rectify that confused thinking and provide a framework that will help people reason more coherently about agents' motivations.
Three Levels of Motivation
I have identified three levels at which the motivations of an agent can be considered. I will briefly outline these levels, and then expand on them in subsequent sections.
Level 1 (L1) - Ideal Specification: The motivations the designer created the agent with the intention of accomplishing. These exist exclusively in the designer's mind and are not part of the agent.
Level 2 (L2) - Designed Specification: The motivations that are explicitly represented in the agent itself. What motivations the agent makes decisions with the intention of accomplishing. These motivations exist within the agent program.
Level 3 (L3) - Revealed Specification: The motivations that are actually pursued by the agent via its actions; insomuch as the actions of the agent can be described as optimising over the output of some objective function, which function is it? This is a property of the agent function.
The three levels are interrelated, but are distinct. Traditional language does not clearly delineate which level of motivation one is talking about when one talks about an agent's motivations, and thus allows the potential for confusion to arise due to people using the same word(s) to refer to different concepts.
I will consider a couple of (toy) agents to more clearly illustrate the differences between the three levels of motivation.
Fred is a naive attempt at creating a Friendly Artificial Intelligence (FAI). Fred was designed with the intention of maximising human utility (according to some model of utilitarianism which shall not be specified, as there are many variants, no clear consensus and the actual implementation involved isn't relevant for our purposes). However, because human values are complex and hard to define, Fred's programming represents the simpler concept of happiness (which is still a little complex, but is easier to capture than human utility. At any rate, let's assume that Fred's designer succeeds in suitably capturing the concept of human happiness). Fred's actions are chosen with the aim of maximising human happiness. As Fred learns more about the world and updates its model of the world, Fred learns that "happiness" in humans is associated with a increase in the presence of certain neurotransmitters in the brain. Repeatedly encountering this correlation, leads Fred to infer causation. Fred "learns" that dopamine causes happiness. Fred proceeds to enact plans to pump as many humans full of dopamine as it can, sending them to catatonic bliss.
Tiffany is an agent that plays tic tac toe. Tiffany's creator designed Tiffany with the intention of achieving optimal performance in tic tac toe (never losing, and winning when possible). However, because tic tac toe is a simple deterministic game with no memory, Tiffany contains no internal state and no model of the world. Tiffany's programming consists only of a collection of if then rules describing the optimal action to play for any given board position. Because tic tac toe is a solved game (and the number of possible board positions is relatively small), Tiffany is effective and plays optimally.
In Depth Explanation
L1 motivations are the designer's intentions for the agent. What the designer wants the agent to achieve, as separate from what the agent itself wants to achieve (L2)—in so far as the agent can be said to "want" anything—and what the agent is actually achieving/pursuing as inferred from its actions (L3). In the Fred example, Fred's L1 motivation would be maximising human utility (it is entirely irrelevant that Fred completely fails at achieving this). In the Tiffany example, Tiffany's L1 motivation is optimal tic tac toe play (it is similarly irrelevant that Tiffany actually succeeds in achieving this). Only the intentions of the designer determine(s) the L1 motivation(s) of an agent, and to identify them you look at the designer's mind (and neither the agent's mind nor the agent's actions).
When the motivations of an agent are colloquially referred to, L1 motivation is typically not what is meant as people usually distinguish between the motivation of the designer in designing an agent and the motivation of the agent itself. The divide between L2 and L3 motivations is not so clear, and that is usually where the confusion/ambiguity arises when motivations of an agent are referred to in everyday usage.
In their article: "Building Safe Artificial Intelligence: Specification, Robustness and Assurance", the Deepmind Safety Research team refers to L1 motivations as the "ideal specification".
The L2 motivations of an agent are those explicitly represented within the agent program—what the agent "wants". It may intuitively seem that the L2 motivation of an agent is the agent's actual motivation, but further explanation would show that this is not necessarily the case. First let us consider two classes of agents:
- Motivational agents.
- Motivationless agents.
Motivational agents are agents who possess L2 motivations. They include goal based, utility based and general learning agents. These agents deliberately act in pursuit of some goal(s) or to maximise some value(s). In our case study, Fred would be an example of a motivational agent.
Motivationless agents are agents who don't possess L2 motivations. They include simple reflex agents and model based reflex agents. Generally, the agent program of a motivationless agent is behavioural (it contains a set of “if then” rules that prescribe the actions of the agent in any situation it faces). From our case study, Tiffany is an example of a motivationless agent.
It is obvious that regardless of the contents of the agent program of a reflex agent, it is always motivationless because its agent program contains no explicit reference to any motivation which the agent pursues. The action of the agent in any possible situation it can face has already been hard coded, so the agent does not explicitly act in pursuit of a particular motivation. Revisiting Tiffany again, Tiffany has no L2 motivation, but Tiffany’s actions are such that they ensure that Tiffany does not lose, and maximise the probability of Tiffany winning. Tiffany can be viewed as an agent who optimises for tic tac toe performance, even though no reference is made to the concept in Tiffany’s programming. This model of Tiffany is useful and carries predictive power as it allows us to predict Tiffany’s actions in a variety of scenarios without knowing Tiffany’s programming. It seems that in some sense, Tiffany has the motivation of maximising its tic tac toe performance. Given that agents can lack L2 motivations, yet behave as if they were acting in pursuit of (a) certain goal(s)/value(s), it seems flawed to hold the view that the motivation of an agent can only be its L2 motivation. In fact, it shall be shown in the subsequent section that every motivational agent is functionally identical to some motivationless agent.
In the aforementioned article, L2 motivations are referred to as the "design specification".
In determining the L3 motivations of an agent, we look exclusively at the agent's actions, and consider what motivation the agent’s actions are pursuing (what goal(s)/value(s) is/are realised as a result of the actions the agent takes). To do this, we'd examine the agent function and likely future evolution of the world giving the actions the agent takes. The agent function is the mapping from the agent’s percept sequence (the sequence of all percepts from inception to the moment the agent is choosing an action) to the agent’s actions. The agent function maps every possible percept sequence to an action, and thus completely characterises the agent’s behaviour (as a result, any given agent is functionally identical to some reflex agent who contains the production rules mapping each percept sequence to the output of the agent function for that percept sequence (reflex agent equivalents of generally intelligent agents (e.g the average adult human) would probably be impractical to build in our physical universe)).
In our case study, Tiffany’s agent function is such that Tiffany’s actions optimise for tic tac toe performance. Thus maximising tic tac toe performance is Tiffany’s L3 motivation. On the other hand, because of Fred’s model of the world, Fred maximises dopamine concentration in the brain, and not actually happiness (insofar as you do not agree that pumping you full of dopamine/alternative wireheading is the way to make you happiest). Fred and Tiffany both display L3 motivations that are different from their L2 motivations (in Tiffany’s case because they outright lack L2 motivations). Fred illustrates another way in which agents can have differing L2 and L3 motivations—when the agent’s model of the world is inaccurate or otherwise inadequate.
Determining an agent’s L3 motivations is not so straightforward, as every agent trivially optimises for being the agent that they are. That is each agent can be said to maximise the utility function which assigns highest utility to the output of their agent function for their given percept sequence and low utility to everything else. However, this is not a useful motivation to ascribe to agents as it does not convey any predictive power (we cannot determine the output of the agent function for a given percept sequence if we don’t already know it), and would also be quite complex. I would suggest that as a rule of thumb, when evaluating candidate objective functions for an agent, one should reward candidate functions for higher predictive power and punish candidate functions for higher complexity. If you were to assign a real number to the candidate function that represented how suitable it was as an L3 motivation of the agent, then assuming complexity of the function was held constant, the number would be directly proportional to predictive power, and if predictive power were held constant, it would be inversely proportional to complexity. However, formalising the notion of predictive power is beyond the scope of this post (and my ability).
It is my view that the most relevant motivation of the agent is the agent’s L3 motivation, since it is what tell us how the agent will actually behave. In fact, as you move from L1 to L3 motivation, the usefulness (in terms of predicting the agent’s behaviour) increases. Insofar as one is concerned primarily about the behaviour of agents, the L3 motivation of the agent has the greatest importance.
The Deepmind Safety Research article refers to L3 motivations as "revealed specification".
Level 1: What the designer “wants”.
Level 2: What the agent “wants”.
Level 3: What the agent does.
As you move from L1 to L3, the predictive power of the motivation at that level increases. For those primarily concerned with the behaviour of the agent, L3 is the most important motivation.
- Russel and Norvig, “Artificial Intelligence A Modern Approach Third Edition”, Ch 2 pg 34(48) - 61(75), Pearson Education Limited..
- For agents, agent programs, agent functions, percept sequence, and agent types (simple reflex, model based, goal based, utility based and general learning agents).
- Pedro A. Ortega et al, “Building safe artificial intelligence: specification, robustness, and assurance”: https://medium.com/@deepmindsafetyresearch/building-safe-artificial-intelligence-52f5f75058f1
- For the specification hierarchy.