In a short time I will be starting a project (funded by LTFF) under the supervision of Adam Shimi with the goal of deconfusing goal-directedness. I have decided to record the progress on this project on a biweekly basis here on LW, to test whether this helps to keep me accountable for making progress on my project, and to record the process.
Before the project begins, I want to record my baseline beliefs about goal-directedness. I'm doing this partly to see how by beliefs change through the research process and partly just to get my thoughts in order.
Existing ideas on goal-directedness
Adam Shimi has thought a lot about this topic. His literature review with Michele Campolo and Joe Collman assembles ideas and intuitions about goal-directedness, some criteria which any proposed definition of goal-directedness can be compared against, and finally an assembly of proposed definitions from various sources. The definitions cut down my initial workload significantly, especially Vanessa Kosoy's definition of goal-directed intelligence, since it gives me a formal criterion in terms of numerical values which might be feasible to actually implement and compute in toy examples.
The aim of my project is not simply to pit these definitions against one another in examples, though: I want to do some serious stress-testing of the concept itself, and maybe eventually bring some serious maths to bear on these definitions. Some questions include:
- How do (different formalizations of) goal-directedness vary as different aspects of the agent's behaviour and environment change? Expectation: a notion of goal-directedness which is too sensitive to small changes will not be useful, so this may be a criterion for refining definitions of goal-directedness (see next section). Generally speaking, I expect goal-directed behaviour to become increasingly rare and therefore more identifiable as the complexity of the environment and behaviour grow.
- What happens in extreme cases? Expectation: extreme simplicity is the easiest to think about, and here goal-directedness collapses. For a very simple system, many explanations adequately account for behaviour: it's roughly as simple to say that a ball falls because it "wants" to accelerate towards the ground as to simply describe the falling motion. Generally speaking, I expect goal-directedness to be the extreme end of a spectrum which grows as complexity increases, but it will be interesting to explore how this spectrum varies along the dimensions along which complexity can increase.
- Are the intuitions surrounding goal-directedness even consistent? Expectation: there are bound to be some speculations which won't hold up under a quantitative treatment. I also expect there is some tension which results from the assumption that goal-directedness should be a binary property, which I do not think is sensible. Since I think the spectrum of goal-directedness collapses in the extreme of simple behaviour, I don't expect simple examples such as Rohin Shah's twitching robot to cause any problems.
- Is there a polar opposite to goal-directedness; even if not, are there other types of extreme behaviour that are comparable to goal-directedness? Expectation: Purely random behaviour will most likely be at an extreme of the goal-directedness spectrum, but this of course depends on what "purely random" is taken to mean...
What do I think it is?
My own intuition about goal-directedness aligns quite closely with the ideas surrounding "explainability" in Adam's post above: an agent is goal-directed to the extent that its behaviour can be explained more efficiently/simply in terms of its goals than in terms of any explicit/mechanistic description.
While this might sound straightforward, any thought experiment that one tries with this idea raises potential problems. The very notion of an explanation is highly dependent on the available information about the world and the agent. Think how much easier it became to explain the motion of falling objects once the notion of gravity was discovered! This illustrates the possibility that by revealing information about the world, apparent goal-directedness might be revealed or broken. At a conceptual level, it also illustrates how metaphysical problems surrounding objects and their properties might change which explanations are the most complicated. For example, if several objects are apparently identical but an agent collects only some of them apparently at random, we can get a much simpler goal-based description if we allow the explanation to include a hidden property which only some of the objects possess. One proposed definition of goal-directedness implicitly does this, with the assigned hidden property an (unknown) numerical utility value on the states of a Markov Decision Process (MDP).
Ideally, I hope to find a quantitative definition of goal-directedness (which can be implemented in at least one standard model, such as for MDPs) which is not a binary. Regarding the issues above, I expect that this definition should be approximately continuous: it will not vary too greatly with small changes in the environment, information, or behaviour. That is, I'm hoping that the notion of goal-directedness is actually a robust one, and that the above are examples where the extra details actually don't tip the balance very much.
Eventually I hope to probe some of the questions raised at the start of that literature review: does goal-directedness intrinsically create the same catastrophic problems attributed to insufficiently aligned optimizers?