In a short time I will be starting a project (funded by LTFF) under the supervision of Adam Shimi with the goal of deconfusing goal-directedness. I have decided to record the progress on this project on a biweekly basis here on LW, to test whether this helps to keep me accountable for making progress on my project, and to record the process.
Before the project begins, I want to record my baseline beliefs about goal-directedness. I'm doing this partly to see how by beliefs change through the research process and partly just to get my thoughts in order.
Existing ideas on goal-directedness
Adam Shimi has thought a lot about this topic. His literature review with Michele Campolo and Joe Collman assembles ideas and intuitions about goal-directedness, some criteria which any proposed definition of goal-directedness can be compared against, and finally an assembly of proposed definitions from various sources. The definitions cut down my initial workload significantly, especially Vanessa Kosoy's definition of goal-directed intelligence, since it gives me a formal criterion in terms of numerical values which might be feasible to actually implement and compute in toy examples.
The aim of my project is not simply to pit these definitions against one another in examples, though: I want to do some serious stress-testing of the concept itself, and maybe eventually bring some serious maths to bear on these definitions. Some questions include:
- How do (different formalizations of) goal-directedness vary as different aspects of the agent's behaviour and environment change? Expectation: a notion of goal-directedness which is too sensitive to small changes will not be useful, so this may be a criterion for refining definitions of goal-directedness (see next section). Generally speaking, I expect goal-directed behaviour to become increasingly rare and therefore more identifiable as the complexity of the environment and behaviour grow.
- What happens in extreme cases? Expectation: extreme simplicity is the easiest to think about, and here goal-directedness collapses. For a very simple system, many explanations adequately account for behaviour: it's roughly as simple to say that a ball falls because it "wants" to accelerate towards the ground as to simply describe the falling motion. Generally speaking, I expect goal-directedness to be the extreme end of a spectrum which grows as complexity increases, but it will be interesting to explore how this spectrum varies along the dimensions along which complexity can increase.
- Are the intuitions surrounding goal-directedness even consistent? Expectation: there are bound to be some speculations which won't hold up under a quantitative treatment. I also expect there is some tension which results from the assumption that goal-directedness should be a binary property, which I do not think is sensible. Since I think the spectrum of goal-directedness collapses in the extreme of simple behaviour, I don't expect simple examples such as Rohin Shah's twitching robot to cause any problems.
- Is there a polar opposite to goal-directedness; even if not, are there other types of extreme behaviour that are comparable to goal-directedness? Expectation: Purely random behaviour will most likely be at an extreme of the goal-directedness spectrum, but this of course depends on what "purely random" is taken to mean...
What do I think it is?
My own intuition about goal-directedness aligns quite closely with the ideas surrounding "explainability" in Adam's post above: an agent is goal-directed to the extent that its behaviour can be explained more efficiently/simply in terms of its goals than in terms of any explicit/mechanistic description.
While this might sound straightforward, any thought experiment that one tries with this idea raises potential problems. The very notion of an explanation is highly dependent on the available information about the world and the agent. Think how much easier it became to explain the motion of falling objects once the notion of gravity was discovered! This illustrates the possibility that by revealing information about the world, apparent goal-directedness might be revealed or broken. At a conceptual level, it also illustrates how metaphysical problems surrounding objects and their properties might change which explanations are the most complicated. For example, if several objects are apparently identical but an agent collects only some of them apparently at random, we can get a much simpler goal-based description if we allow the explanation to include a hidden property which only some of the objects possess. One proposed definition of goal-directedness implicitly does this, with the assigned hidden property an (unknown) numerical utility value on the states of a Markov Decision Process (MDP).
Ideally, I hope to find a quantitative definition of goal-directedness (which can be implemented in at least one standard model, such as for MDPs) which is not a binary. Regarding the issues above, I expect that this definition should be approximately continuous: it will not vary too greatly with small changes in the environment, information, or behaviour. That is, I'm hoping that the notion of goal-directedness is actually a robust one, and that the above are examples where the extra details actually don't tip the balance very much.
Eventually I hope to probe some of the questions raised at the start of that literature review: does goal-directedness intrinsically create the same catastrophic problems attributed to insufficiently aligned optimizers?
I'm very interested to see what you discover or what framework for deconfusion you arrive at with this research project.
My own take on goal-directedness, to a first approximation, is that it is the property of a system that allows it to consistently arrive at a narrow region of state space from a broad range of starting points. The tighter the distribution of steady states, and the wider the distribution of starting states that allow the system to reach them, the more goal-directed it is. A system where a ball is always rolling down to the center of a basin could be considered more goal-directed than a ball rolling around a random landscape, for instance.
These goal states could exist within the agent itself (i.e., homeostatic set points) or out in the external environment (e.g., states that maximize attainable utility, like collecting resources). They could also be represented either explicitly as patterns within the agent's mental model or implicitly within the structure of the agent's policy functions.
Another dimension to this could be the ability to avoid unexpected states that would prevent the achievement of goals (e.g., avoid predators or move around obstacles), or the ability to select actions, either choosing among multiple narrow policies in pursuit of a single goal or choosing among multiple goals in pursuit of utility (or a meta-goal).
Thanks for the ideas!
I like the idea about the size of the target states; there's bound to be some interesting measure theory that I can apply if I decide to formalize in that direction. In fact, measure theory might be able to clarify some of the subtleties I alluded to above regarding what happens when we refine the world model (for example, in a way that causes a single goal state to split into two or more).
There are hints in your last paragraph of associating competence with goal-directedness, which I think is an association to avoid. For example, when a zebra is swimming across a river as fast as it can, I would like the extent to which that behaviour is considered goal-directed to be independent of whether that zebra is the one that gets attacked by a crocodile.
Maybe I could try to disentangle competence from goal-directedness in what I wrote. The main idea that I was trying to push in that paragraph is that there is more to goal-directed behavior in real animals than just movement toward a goal state. There is also (attempted) movement away from anti-goal states and around obstacle states.
An example of the former could be a zebra seeing a bunch of crocodiles congregated by the bank of the Nile and deciding not to cross the river today (unfortunately, it later got chased down and eaten by a lion due to the zebra's incompetence at evading all anti-goal states).
An example of the latter could be a golfer veering his swing slightly to the right to avoid the sand traps on the left (unfortunately, the ball ended up landing in the pond instead due to the golfer's incompetence at avoiding all obstacle states).
Anti-goals and obstacles act as repulsor states, complementing the attractor states known as goals, redirecting the flow of behavior to maximize the chances of survival and of reaching the actual goals.
As to the latter part of that paragraph, I think policy-selection for single goals and goal-selection more generally are important for enabling systems to exhibit flexible behavior. Someone in a recent thread (https://www.lesswrong.com/posts/3L46WGauGpr7nYubu/the-plan?commentId=nLCpJnxfaPzKXsbE2) brought up some interesting research on goal selection (more like goal pruning) in animals that could be worth looking into.