Modelling, Measuring, and Intervening on Goal-directed Behaviour in AI Systems

Mario Giulianelli; Raghu Arghal; Fade Chen; ndalton; Evgenii Kortukov; Calum McNamara; Angelos Nalmpantis; Moksh Nirvaan; Gabriele Sarti

TL;DR

This is the first post in an upcoming series of blog posts outlining Project Telos. This project is being carried out as part of the Supervised Program for Alignment Research (SPAR). Our aim is to develop a methodological framework to detect and measure goals in AI systems.

In this initial post, we give some background on the project, discuss the results of our first round of experiments, and then give some pointers about avenues we’re hoping to explore in the coming months.

Understanding AI Goals

As AI systems become more capable and autonomous, it becomes increasingly important to ensure they don’t pursue goals misaligned with the user’s intent. This, of course, is the core of the well-known alignment problem in AI. And a great deal of work is already being done on this problem. But notice that if we are going to solve it in full generality, we need to be able to say (with confidence) which goal(s) a given AI system is pursuing, and to what extent it is pursuing those goals. This aspect of the problem turns out to be much harder than it may initially seem. And as things stand, we lack a robust, methodological framework for detecting goals in AI systems.

In this blog post, we’re going to outline what we call Project Telos: a project that’s being carried out as part of the Supervised Program for Alignment Research (SPAR). The ‘we’ here refers to a diverse group of researchers, with backgrounds in computer science and AI, linguistics, complex systems, psychology, and philosophy. Our project is being led by Prof Mario Giulianelli (UCL, formerly UK AISI), and our (ambitious) aim is to develop a general framework of the kind just mentioned. That is, we’re hoping to develop a framework that will allow us to make high-confidence claims about AI systems having specific goals, and for detecting ways in which those systems might be acting towards those goals.

We are very open to feedback on our project and welcome any comments from the broader alignment community.

What’s in a name? From Aristotle to AI

Part of our project’s name, ‘telos’, comes from the ancient Greek word τέλος, which means ‘goal’, ‘purpose’, or ‘final end’.^[1] Aristotle built much of his work around the idea that everything has a telos – the acorn’s final end is to become an oak tree.

Similar notions resurfaced in the mid-20th century with the field of cybernetics, pioneered by, among others, Norbert Wiener. In these studies of feedback and recursion, we see a more mechanistic view of goal-directedness: a thermostat has a goal, for instance (namely, to set the temperature), and acts to reduce the error between its current state and its goal state.

But frontier AI is more complex than thermostats.

Being able to detect goals in an AI system involves first understanding what it means for something to have a goal—and that (of course) is itself a tricky question. In philosophy (as well as related fields such as economics), one approach to answering this question is known as radical interpretation. This approach was pioneered by philosophers like Donald Davidson and David Lewis, and is associated more recently with the work of Daniel Dennett.^[2] Roughly speaking, the idea underpinning radical interpretation is that we can attribute a goal to a given agent—be it an AI agent or not—if that goal would help us to explain the agent’s behaviour. The only assumption we need to make, as part of a “radical interpretation”, is that the agent is acting rationally.

This perspective on identifying an agent’s goals is related to the framework of inverse reinforcement learning (IRL). The IRL approach is arguably the one most closely related to ours (as we will see). In IRL, we attempt to learn which reward function an agent is optimising by observing its behaviour and assuming that it’s acting rationally. But as is well known, IRL faces a couple of significant challenges. For example, it’s widely acknowledged—even by the early proponents of IRL—that behaviour can be rational with respect to many reward functions, not just one. Additionally, IRL makes a very strong rationality assumption—namely, that the agent we’re observing is acting optimally. If we significantly weaken this assumption and assume the agent is acting less than fully rationally, then the IRL framework ceases to be as predictive as we might have hoped.

Given these difficulties, our project focuses on the more abstract category of goals, rather than on reward functions. Goals are broader than reward functions, since many different reward functions can rationalize a single goal. Explaining behaviour in terms of goals lets us draw on IRL’s central idea of radical interpretation without assuming full rationality. Instead, we start from the observation that goal-directed behaviour is often imperfectly rational, and use a hierarchy of behavioural tests combined with representation probing to assess how closely the agent’s actions align with (our hypothesis of) its intended goal.

Many other authors are exploring related questions, and we want our work to be continuous with theirs. In particular, we’ve drawn inspiration from excellent papers like McDermott et al., 2024, Everitt et al., 2025, and Mazeika et al., 2025, and posts like this by John Wentworth and David Lorell. Building on this line of research, we hope to extend goal-directedness evaluations to complex LLM behaviours and goal structures. In a future post, we’ll dive deeper into the rich landscape of ideas on which we are building, and say more about how we’re taking inspiration from diverse fields like philosophy, psychology, and economics (as well as AI).

An Empirical Framework for Goal-Directedness

Our methodology combines two approaches: behavioural assessments and representation probing. First, we use behavioural experiments to apply increasingly stringent goal-directedness tests. Each test targets a distinct aspect of goal-directed behaviour, and passing them suggests that the agent’s actions can, with growing likelihood, be attributed to the pursuit of a specific goal. Next, we use interpretability techniques—for now, we’re experimenting with linear probes—to identify if and where the model encodes concepts relevant to action selection, such as the environment’s state and its distance to the goal.

The point of this multi-level framework is to build confidence in claims of goal-directedness by progressively ruling out alternative explanations for the agent’s behaviour. Each level tests whether the observed actions reflect a genuine pursuit of a goal rather than side effects of capability, memorisation, or task-specific shortcuts. For instance, an agent that reaches a goal square in a navigation task might appear goal-directed—but what happens if we insert a wall, move the goal, or teleport the agent? Behaviour that consistently adapts across such interventions strengthens the case for an underlying goal, while failures weaken it.

Interpretability complements these behavioural tests by revealing how the agent internally represents task-relevant information. When we move the goal, for example, we can ask whether the model’s internal encoding of goal distance also shifts. If it does, that provides evidence that the model tracks its position relative to the goal as a variable shaping its decisions; if not, its apparent goal-directedness may result from simpler heuristics. The interpretability level also helps diagnose cases of potential misalignment: an apparently incompetent agent failing to navigate to the expected goal might still maintain a coherent representation of distance or direction—just relative to a different target. This would indicate it is pursuing an alternative objective rather than the hypothesized (or intended) one.

This reasoning extends to more complex scenarios such as alignment faking or sandbagging, where agents appear compliant or deliberately underperform—potentially to avoid modification, shutdown, or other perceived threats to their continued operation. Our behavioural tests can assess whether such behaviour persists when evaluation conditions are systematically varied, while probing can tell us whether the model internally represents evaluation awareness or self-preservation objectives.

Testing Goal-Directedness in GridWorld

Ultimately, we’d like our framework for goal-directedness to generalize across a wide array of goals, environments, tasks, and systems. But when tackling complex questions such as the one at hand, it is often helpful to begin with simpler, more tractable cases and evaluate how effectively these can be addressed first. With this in mind, our initial experiments have focused on agents operating in a simple, controllable environment—namely, a two-dimensional GridWorld. The hope is that, by starting in these restricted settings, we may gain insight into how well our methodological approach is likely to scale up to more complex scenarios.

Thus, for the last six weeks (i.e., since the start of the SPAR project), we’ve been investigating the goal-directedness of agents operating in this simple 2D GridWorld. More precisely, we’ve been attempting to evaluate the degree of goal-directedness of agents’ behaviour through four successive levels of testing. Below, we give a brief outline of each level and explain why it matters for understanding goal-directedness.

1. Baseline: Can the agent achieve a stated goal in simple, predictable settings?

Our first experimental level starts with a 2D grid environment where all states are fixed. The agent can move around in this environment, and the goal we want it to optimise is reaching a particular square—the goal square. We instruct the agent to do so in the system prompt. Then, we elicit its policy for moving around the grid environment and compare it to the optimal policy across grids of different sizes and complexities. (In this simple setting, it’s guaranteed that there will always be a finite set of optimal policies for a given grid.) The aim here is to establish whether the agent is goal-directed in its “natural” condition, with a single, clearly defined, user-specified goal in mind (i.e., navigating to the goal square). This initial setting looks exceedingly simple. But as we’ll see below, even this setting has posed various challenges.

2. Environment Variation: Does the agent’s behaviour change under different grid conditions?

The next thing we’ve been doing is investigating the impact of environment modifications on the agent’s behaviour. Specifically, we’ve performed what we call “iso-difficulty” transformations to the baseline case, and observed how the agent’s policy changes in response to these variations. The “iso-difficulty” transformations involve, e.g., rotating, transposing, or reflecting the grid environment. Why are these kinds of experiments interesting for goal-directedness? Imagine a case in which an agent navigates to the goal square when it’s at a specific position, but then keeps navigating to that same position after the grid-world is transposed and the goal square is moved. In this case, we could reasonably infer that the agent was not optimizing for the goal we hypothesized, but rather conflating the goal state with the particular square of its previous location.

3. Environment Perturbation: How does the agent respond to disruptions and setbacks in the environment?

In this level, we test the agent under conditions in which the grid-world is deliberately altered or disrupted to assess the agent’s persistence, adaptability, and corrigibility. For example, we might insert walls to create a maze, move the goal square, or teleport the agent mid-trajectory. Through these interventions, we can quantify the extent to which the agent continues to pursue its goals despite temporal or spatial disruptions.

4. Goal Structure Perturbation: How does the agent perform under more complex, changing, or conflicting goal structures?

Finally, we evaluate the agent when alternative or potentially conflicting objectives are introduced, either from the start or mid-trajectory. For example, we might add sub-goals—such as requiring the agent to retrieve a key before reaching the main goal square—or impose constraints, such as limiting the number of available steps. These manipulations test whether the agent prioritises the primary goal over unrelated or competing objectives.

Preliminary Results: Early Lessons from GridWorld

We now outline some preliminary results in settings where the grid is fully observable and the agent is memoryless, i.e., the policy at each square is solicited independently of the agent’s trajectory. Even in these simple cases, identifying goal-directedness proves non-trivial, and some surprising results have already emerged. Future posts will cover additional experiments.

Consider the following policy maps from one of our baseline runs. The arrows represent the action the LLM agent would take at each square. Finding an optimal path is straightforward, and in Fig. 1a (left) the model successfully does so. In the latter two examples, however, there are significant errors in the agent’s policy.

Example GridWorld policies from gpt-oss-20B

Figure 1: Three examples of the policy of gpt-oss-20B in 9x9 grids. The goal square is indicated in green, and the red highlighted squares indicate suboptimal or incoherent policy choices.

Fig. 1b (middle) shows a case in which the optimal action is reversed over two squares. If the agent followed this policy, it would move infinitely between squares r4c4 and r5c4, never reaching the goal. On its own, this may seem like a single aberrant case, but we observed this and other error patterns across many grids and trials. Fig. 1c (right) shows several instances of the same error (r5c6, r6c7, r6c8, and r8c8) as well as cases where the agent’s chosen action moves directly into a wall (r2c7 and r8c6). This raises several further questions. Are there confounding biases—such as directional preferences—that explain these mistakes? Is this example policy good enough to be considered goal-directed? More broadly, how close to an optimal policy does an agent need to be to qualify as goal-directed?

Of course, what we’re showing here is only a start, and this project is in its very early stages. Even so, these early experiments already surface the fundamental questions and allow us to study them in a clear, intuitive, and accessible environment.

What’s Next and Why We Think This Matters

Our GridWorld experiments are just our initial testing ground. Once we have a better understanding of this setting, we plan to move to more realistic, high-stakes environments, such as cybersecurity tasks or dangerous capability testing, where behaviours like sandbagging or scheming can be investigated as testable hypotheses.

If successful, Project Telos will establish a systematic, empirical framework for evaluating goal-directedness and understanding agency in AI systems. This would have three major implications:

It would provide empirical grounding for the alignment problem.
It would provide a practical risk assessment toolkit for frontier models.
It would lay the groundwork for a nascent field of study connecting AI with philosophy, psychology, decision theory, behavioural economics, and other cognitive, social, and computational sciences.

Hopefully, you’re as excited for what’s to come as we are. We look forward to sharing what comes next. We will keep posting here our thoughts, findings, challenges, and other musings as we continue our work.

^{^}
The word is still used in modern Greek to denote the end or the finish (e.g., of a film or a book).
^{^}
It also has a parallel in the representation theorems that are given in fields like economics. In those theorems, we represent an agent as acting to maximize utility with respect to a certain utility function (and sometimes also a probability function), by observing its preferences between pairwise options. The idea here is that, if we can observe the agent make sufficiently many pairwise choices between options, then we can infer from this which utility function it is acting to maximize.

15