Deconfusing goal-directedness would boost your favorite research approach for solving AI Alignment.
Why? Because every approach I know of stands to gain from the clarification of goal-directedness, from Prosaic AGI Alignment to Agents Foundations. In turn, this ubiquitous usefulness of goal-directedness motivates the writing of this sequence, which will include a literature review of the idea in the AI Safety literature and beyond, as well as advanced explorations of goal-directedness by me and collaborators Michele Campolo and Joe Collman.
But before that, I need to back up my provocative thesis. This is why this post exists: it compiles reasons to care about goal-directedness, from the perspective of every research approach and direction I could think of. Although not all reasons given are equally straightforward, none feels outrageously far-fetched to me.
I thus hope that by the end of this post, you will agree that improving our understanding of goal-directedness is relevant for you too.
Thanks to Michele Campolo and Joe Collman for many research discussions, and feedback on this post. Thanks to Alexis Carlier, Evan Hubinger, and Jérémy Perret for feedback on this post.
Meaning of Deconfusion
Before giving you the reasons for caring about goal-directedness, I need to synchronize our interpretations of “deconfusion”. The term comes from MIRI, and specifically this blog post; it captures the process of making a concept clear and explicit enough to have meaningful discussions about it. So it’s not about solving all problems related to the concept, or even formalizing it perfectly (although that would be nice) -- just about allowing coherent thinking. To quote Nate Soares (MIRI’s Executive Director, and the author of the linked blog post):
By deconfusion, I mean something like “making it so that you can think about a given topic without continuously accidentally spouting nonsense.”
What would that look like for goal-directedness? At first approximation, the idea simply means the property of trying to accomplish a goal. Which feels rather simple. But after digging deeper, issues and subtleties emerge: the difference between having a goal and being competent at accomplishing it (discussed here), what should count as a goal (discussed here), which meaningful classes of goals exist (discussed here), and many others.
Thus the concept is in dire need of deconfusion. Such clarification could take many forms, including:
- A mathematical formalization
- A decomposition into formalized components
- A decomposition into simpler and less confused informal components
- A list of accepted examples with different levels of goal-directedness
- A list of properties and their link with the intuitions behind goal-directedness
- And many more variants
Obviously, only time will reveal the form of our results on goal-directedness. Still, it’s valuable to keep in mind the multitudes of shapes they could take.
Reasons To Care
Let’s be honest: just listing research approach after research approach, and my reason for the relevance of goal-directedness to them, might be too much information to take in one reading. Fortunately, the reasons I found show some trends, and fit neatly into three groups.
- (Overseeing) In some cases, alignment comes from supervisors and overseers that monitor the AI during training. Goal-directedness is a natural and fundamental property to check, because of its many negative consequences. So deconfusion would facilitate checks of this important property by overseers and supervisors, and thus improve every approach depending on monitoring.
- (Additional Structure on Utility/Reward Functions) Many approaches to alignment rely on utility functions and reward functions to capture goals and values. Such representations are powerful, but so general that maximizing a utility function or a reward function doesn’t reveal much about whether the system actually follows a goal or not (see the discussion here and here).
Furthering our understanding of goal-directedness could reveal more structure to add on these representations of goals, making the pursuit of such a “goal” more closely tied to being goal-directed.
- (Natural Mathematical Abstraction) When attempting to formalize and clarify many aspects of decision making, AI and alignment, concepts like agency and optimization play a big role. Goal-directedness naturally relates to both, because agents are generally considered goal-directed, and so are explicit optimizers doing internal search. Thus goal-directedness should intuitively play a role in these formalizations, whether as a building block, a metric or an example to draw from.
The reasons from this section assume the use of an overseer. This is common for approaches affiliated with Prosaic AI Alignment, where the gist of alignment emerges from training constraints that forbid, push and monitor specific behaviors.
Interpretability and Formal Methods
Interpretability is one way to monitor an AI: it studies how the learned models work, and how to interpret and explain them. Similarly, formal methods (applied to AI) take a formal specification, a model of computation and an AI, and verify whether the AI follows the specification when executed on this model of computation.
Ultimately, both interpretability and formal methods try to check properties of trained models, notably neural networks. Goal-directedness is an example of an important property to look for, as discussed above. And deconfusing goal-directedness would move us towards finding a specification of this property.
(Interpretability à la Clarity Team at OpenAI (for example here) might also prove important in deconfusing goal-directedness, by letting us look into and compare systems with various levels of goal-directedness)
IDA and Debate
IDA attempts to align a superintelligent AI by starting from a simple AI, amplifying it (training a second AI to imitate the human supervisor using the simple AI), and then distilling this amplified version (by training a simpler model to imitate the amplified AI) into a new AI that can be used by the human supervisor. Hopefully, repeating this will eventually create an AI with superhuman capabilities, while maintaining alignment.
Debate on the other hand places the human supervisor as a judge of a debate between two AIs. This value of debate comes from extending the reach of human feedback: judging a debate (which only presents bits and pieces of the arguments) is intuitively easier than checking a complete solution, which is easier than finding a solution. Whether or not debate works hinges on the importance of honesty for optimal play, and other theoretical and empirical questions about human evaluation of debates.
The current approach to IDA, and some approaches to Debate, rely on the same building block, factored cognition: answering questions by splitting them into subquestions given to other AIs with some but not all the context, with recursive splitting allowed. And a big question with this factorization is how to do it without allowing one subAI to manipulate the whole decision.
Deconfusing goal-directedness would help, in that checking the level of goal-directedness of the subAIs, or what kind of goals they might have, correlates with checking non-manipulation. Indeed, manipulation probably requires incentives to make long term plans towards something, which looks, smells and sounds like goal-directedness.
Universality (see also my own explanation) is a property of an overseer that matters a great deal to approaches like IDA and Debate. Basically, an overseer O is ascription universal for a computation C with regard to an epistemic viewpoint (usually the human supervisor H) if, whatever finite information about O and C is provided to H, the latter will privilege the probabilistic predictions of O over the ones of C. This is supposed to capture the fact that O is able to reveal the different beliefs or reasons that might be ascribed to C to explain it.
One very natural way to ascribe belief is through assuming goals and a desire to accomplish them. This is basically the intentional stance of Daniel Dennett, a classical framework to understand goal-directedness, and deconfusing goal-directedness would put such an ascription of beliefs into more formal and computational grounds.
The inner alignment problem, as stated in Risks from Learned Optimization, concerns learned optimizers: optimizers that result from an optimization process, like gradient descent. Such learned optimizers are called mesa-optimizers, and the inner alignment research direction studies two questions: under which conditions mesa-optimizers appear, and how aligned the mesa-objective is with the base objective of the initial optimization process.
One common criticism with mesa-optimizers is that their objective (the mesa-objective) is assumed to be explicitly represented in the system (see Richard Ngo’s post here, for example). This is a bit weird, and looks more like a simplifying assumption than a property expected in actual systems. Moreover, maybe some of the reasoning about mesa-optimizers still works when considering learned goal-directed systems instead, as proposed in this comment by Tom Everitt.
By deconfusing goal-directedness, we might find properties of goal-directed systems and use them for studying the previous question: does the reasoning from Risks from Learned Optimization go through when considering learned models with these properties, instead of learned explicit optimizers? Such an investigation could shift the focus of research on inner alignment, or reinforce the importance of internal optimization.
One especially worrying case of mesa-optimization is deceptive alignment: a mesa-optimizer with a different objective from the base objective, but which is competent enough to deceive us into believing that it is pursuing the base objective. This can take many forms, from behaving nicely during training and defecting when put in the world, up to more outlandish ideas like gradient hacking.
The foremost proposals for dealing with deceptive alignment rely on relaxed adversarial training: using the overseer feedback (on the risk of catastrophic behavior) as part of the training signal. But getting this overseer feedback right, notably the detection of deceptiveness, proves difficult.
One possibility is to use myopia. Intuitively, myopia is supposed to capture the property that a system only makes short-term plans. Then the hope is that deceptive systems should probably be non-myopic. Thus we could detect non-myopia instead of deception, which is hopefully easier, and get the overseer feedback necessary for relaxed adversarial training.
Now, defining myopia is an open research problem -- see these two posts for pointers. Yet there is an interesting connection with goal-directedness: multiple researchers consider long-term goals as an important component of goal-directedness. Hence disentangling the various intuitions about goal-directedness could help deconfuse the idea of long-term goals, which in turn would help tremendously for deconfusing myopia.
Additional Structure on Utility/Reward Functions
Reasons in this section apply to a broader range of alignment proposals. Their common thread is to assume that utility functions or reward functions are used to capture goals and values.
The Safety Team at DeepMind wrote many different papers on agent incentives; specifically, on observation and intervention incentives that come from having a specific goal. Assuming a causal graph of the system and a goal, graphical criteria exists to find which nodes would be useful to monitor (observation incentives), and which nodes would be useful to control (intervention incentives). For goals, these papers consider controlling a utility node in the causal graph. That is, this research places itself within the framework of expected utility maximization.
As mentioned before, utility functions look too general to capture exactly what we mean by goals: every system can be seen as maximizing some utility function, even those intuitively not goal-directed. Deconfusing goal-directedness might allow the derivation of more structure for goals, which could be applied to these utility functions. The goals studied in this approach would then model more closely those of actual goal-directed systems, allowing in turn the derivation of incentives for more concrete and practical settings.
Value Learning is a pretty broad idea, which boils down to learning what we don’t want the AI to mess up (our values), instead of trying to formalize them ourselves. This includes the reward modeling agenda at DeepMind, work on Cooperative Inverse Reinforcement Learning and Inverse Reward Design at CHAI, Stuart Armstrong’s research agenda and G Gordon Worley III’s research agenda, among others.
For all of these, the main value of deconfusing goal-directedness is the same: learning values usually takes the form of learning a utility function or a reward function, that is something similar to a goal. But values probably share many of the structure of goals. Such structure could be added to utility functions or reward functions to model values, if we had a better understanding of goal-directedness.
Impact measures provide metrics for the impact of specific actions, notably catastrophic impact. Such an impact measure can be used to ensure that even a possibly misaligned AI will not completely destroy all value (for us) on Earth and the universe. There are many different impact measures, but I’ll focus on Alex Turner’s Attainable Utility Preservation (AUP), which is the one I know best and the one which has been discussed the most in recent years.
Attainable Utility Preservation ensures that the attainable utilities (how much value can be reached) for a wide range of goals (reward/utility functions) stays the same or improves after each action of the AI. This should notably remove the incentives for power-seeking, and thus many of the catastrophic unaligned behaviors of AI (while not solving alignment itself).
You guessed it, here too the value of goal-directedness comes from defining goals with more structure than simple utility or reward functions. Among other things, this might help extend AUP to more realistic environments.
Natural Mathematical Abstraction
Lastly, these reasons concern the Agents Foundations part of AI alignment research. They thus assume a focus on formalization, with applications to practical problems of alignment.
Mathematical Theory of RL and Alignment
Vanessa Kosoy from MIRI has been the main proponent of the creation of a mathematical theory of RL and alignment. Her point of view focuses on deriving formal guarantees about alignment in a learning theoretic setting, and this requires a theory of RL dealing with issues like non-realizability and traps.
Such guarantees will probably depend on the goal-directedness of the system, as different levels of goal-directedness should produce different behaviors. So knowing how to capture these levels will ground the dependency of the guarantees on it.
(Note that Vanessa already has her own definition of goal-directed intelligence, which doesn’t seem to completely deconfuse goal-directedness, but may be sufficient for her research).
Embedded Agency is a broad class of research directions that focus on dealing with theoretical issues linked to embeddedness -- the fact that the AI inhabits the world on which it acts, as opposed to dualistic models in which the AI and the environment are cleanly separated. The original research agenda carves out four subproblems: Decision Theory, Embedded World Models, Robust Delegation and Subsystem Alignment. I’ll focus on Embedded World Models, which has the clearest ties to goal-directedness. That being said, the others might have some links -- for example Subsystem Alignment is very close to Inner Alignment and Deceptive Alignment, which I already mentioned.
Embedded World Models ask specifically how to represent the world as a model inside the agent. Trouble comes from self-reference: since the agent is part of the world, so is its model, and thus a perfect model would need to represent itself, and this representation would need to represent itself, ad infinitum. So the model cannot be exact. Another issue comes from the lack of hardcoded agent/environment boundary: the model need to add it in some way.
Understanding goal-directedness would hopefully provide a representation of systems with goals in a compressed way. This helps both with the necessary imprecision of the map (notably because the AI can model itself this way) and to draw a line between such systems and the complex world they inhabit.
John S. Wentworth’s research on abstraction centers around one aspect of Embedded World Models: what can be thrown out of the perfect model to get a simpler non-self-referential model (an abstraction) that is useful for a specific purpose?
Using goal-directedness for modelling systems in a compressed way is an example of a natural abstraction. Searching for a definition of goal-directedness is thus directly relevant to abstraction research, both because of its potential usefulness for building abstractions, and because it’s such a fundamental abstraction that it might teach us some lessons on how to define, study and use abstractions in general.
To summarize, for a broad range of research agendas and approaches, deconfusing goal-directedness is at least partially relevant, and sometimes really important. The reasons behind that statement fit into three categories:
- Helping an overseer to check for issues during training
- Adding structure to utility functions/reward functions to make them behave more like goals.
- Abstracting many important systems into a compressed form..
So you should probably care about goal-directedness; even without working on it, taking stock of what has been done in this question might impact your research.
The next post in this sequence lay the groundwork for such considerations, by reviewing the literature on goal-directedness: the intuitions behind it, the proposed definitions, and the debates over the shape of a good solution to the problem.