AGI safety from first principles: Goals and Agency

Richard_Ngo

The fundamental concern motivating the second species argument is that AIs will gain too much power over humans, and then use that power in ways we don’t endorse. Why might they end up with that power? I’ll distinguish three possibilities:

AIs pursue power for the sake of achieving other goals; i.e. power is an instrumental goal for them.
AIs pursue power for its own sake; i.e. power is a final goal for them.
AIs gain power without aiming towards it; e.g. because humans gave it to them.

The first possibility has been the focus of most debate so far, and I’ll spend most of this section discussing it. The second hasn’t been explored in much depth, but in my opinion is still important; I’ll cover it briefly in this section and the next. Following Christiano, I’ll call agents which fall into either of these first two categories influence-seeking. The third possibility is largely outside the scope of this document, which focuses on dangers from the intentional behaviour of advanced AIs, although I’ll briefly touch on it here and in the last section.

The key idea behind the first possibility is Bostrom’s instrumental convergence thesis, which states that there are some instrumental goals whose attainment would increase the chances of an agent’s final goals being realised for a wide range of final goals and a wide range of situations. Examples of such instrumentally convergent goals include self-preservation, resource acquisition, technological development, and self-improvement, which are all useful for executing further large-scale plans. I think these examples provide a good characterisation of the type of power I’m talking about, which will serve in place of a more explicit definition.

However, the link from instrumentally convergent goals to dangerous influence-seeking is only applicable to agents which have final goals large-scale enough to benefit from these instrumental goals, and which identify and pursue those instrumental goals even when it leads to extreme outcomes (a set of traits which I’ll call goal-directed agency). It’s not yet clear that AGIs will be this type of agent, or have this type of goals. It seems very intuitive that they will because we all have experience of pursuing instrumentally convergent goals, for example by earning and saving money, and can imagine how much better we’d be at them if we were more intelligent. Yet since evolution has ingrained in us many useful short-term drives (in particular the drive towards power itself), it’s difficult to determine the extent to which human influence-seeking behaviour is caused by us reasoning about its instrumental usefulness towards larger-scale goals. Our conquest of the world didn’t require any humans to strategise over the timeframe of centuries, but merely for many individuals to expand their personal influence in a relatively limited way - by inventing a slightly better tool, or exploring slightly further afield.

Furthermore, we should take seriously the possibility that superintelligent AGIs might be even less focused than humans are on achieving large-scale goals. We can imagine them possessing final goals which don’t incentivise the pursuit of power, such as deontological goals, or small-scale goals. Or perhaps we’ll build “tool AIs” which obey our instructions very well without possessing goals of their own - in a similar way to how a calculator doesn’t “want” to answer arithmetic questions, but just does the calculations it’s given. In order to figure out which of these options is possible or likely, we need to better understand the nature of goals and goal-directed agency. That’s the focus of this section.

Frameworks for thinking about agency

To begin, it’s crucial to distinguish between the goals which an agent has been selected or designed to do well at (which I’ll call its design objectives), and the goals which an agent itself wants to achieve (which I’ll just call “the agent’s goals”).^[1] For example, insects can contribute to complex hierarchical societies only because evolution gave them the instincts required to do so: to have “competence without comprehension”, in Dennett’s terminology. This term also describes current image classifiers and (probably) RL agents like AlphaStar and OpenAI Five: they can be competent at achieving their design objectives without understanding what those objectives are, or how their actions will help achieve them. If we create agents whose design objective is to accumulate power, but without the agent itself having the goal of doing so (e.g. an agent which plays the stock market very well without understanding how that impacts society) that would qualify as the third possibility outlined above.

By contrast, in this section I’m interested in what it means for an agent to have a goal of its own. Three existing frameworks which attempt to answer this question are Von Neumann and Morgenstern’s expected utility maximisation, Daniel Dennett’s intentional stance, and Hubinger et al’s mesa-optimisation. I don’t think any of them adequately characterises the type of goal-directed behaviour we want to understand, though. While we can prove elegant theoretical results about utility functions, they are such a broad formalism that practically any behaviour can be described as maximising some utility function. So this framework doesn’t constrain our expectations about powerful AGIs.^[2] Meanwhile, Dennett argues that taking the intentional stance towards systems can be useful for making predictions about them - but this only works given prior knowledge about what goals they’re most likely to have. Predicting the behaviour of a trillion-parameter neural network is very different from applying the intentional stance to existing artifacts. And while we do have an intuitive understanding of complex human goals and how they translate to behaviour, the extent to which it’s reasonable to extend those beliefs about goal-directed cognition to artificial intelligences is the very question we need a theory of agency to answer. So while Dennett’s framework provides some valuable insights - in particular, that assigning agency to a system is a modelling choice which only applies at certain levels of abstraction - I think it fails to reduce agency to simpler and more tractable concepts.

Additionally, neither framework accounts for bounded rationality: the idea that systems can be “trying to” achieve a goal without taking the best actions to do so. In order to figure out the goals of boundedly rational systems, we’ll need to scrutinise the structure of their cognition, rather than treating them as black-box functions from inputs to outputs - in other words, using a “cognitive” definition of agency rather than “behavioural” definitions like the two I’ve discussed so far. Hubinger et al. use a cognitive definition in their paper on Risks from Learned Optimisation in Advanced ML systems: “a system is an optimizer if it is internally searching through a search space (consisting of possible outputs, policies, plans, strategies, or similar) looking for those elements that score high according to some objective function that is explicitly represented within the system”. I think this is a promising start, but it has some significant problems. In particular, the concept of “explicit representation” seems like a tricky one - what is explicitly represented within a human brain, if anything? And their definition doesn’t draw the important distinction between “local” optimisers such as gradient descent and goal-directed planners such as humans.

My own approach to thinking about agency tries to improve on the approaches above by being more specific about the cognition we expect goal-directed systems to do. Just as “being intelligent” involves applying a range of abilities (as discussed in the previous section), “being goal-directed” involves a system applying some specific additional abilities:

Self-awareness: it understands that it’s a part of the world, and that its behaviour impacts the world;
Planning: it considers a wide range of possible sequences of behaviours (let’s call them “plans”), including long plans;
Consequentialism: it decides which of those plans is best by considering the value of the outcomes that they produce;
Scale: its choice is sensitive to the effects of plans over large distances and long time horizons;
Coherence: it is internally unified towards implementing the single plan it judges to be best;
Flexibility: it is able to adapt its plans flexibly as circumstances change, rather than just continuing the same patterns of behaviour.

Note that none of these traits should be interpreted as binary; rather, each one defines a different spectrum of possibilities. I’m also not claiming that the combination of these six dimensions is a precise or complete characterisation of agency; merely that it’s a good starting point, and the right type of way to analyse agency. For instance, it highlights that agency requires a combination of different abilities - and as a corollary, that there are many different ways to be less than maximally agentic. AIs which score very highly on some of these dimensions might score very low on others. Considering each trait in turn, and what lacking it might look like:

Self-awareness: for humans, intelligence seems intrinsically linked to a first-person perspective. But an AGI trained on abstract third-person data might develop a highly sophisticated world-model that just doesn’t include itself or its outputs. A sufficiently advanced language or physics model might fit into this category.
Planning: highly intelligent agents will by default be able to make extensive and sophisticated plans. But in practice, like humans, they may not always apply this ability. Perhaps, for instance, an agent is only trained to consider restricted types of plans. Myopic training attempts to implement such agents; more generally, an agent could have limits on the actions it considers. For example, a question-answering system might only consider plans of the form “first figure out subproblem 1, then figure out subproblem 2, then...”.
Consequentialism: the usual use of this term in philosophy describes agents which believe that the moral value of their actions depends only on those actions’ consequences; here I’m using it in a more general way, to describe agents whose subjective preferences about actions depend mainly on those actions’ consequences. It seems natural to expect that agents trained on a reward function determined by the state of the world would be consequentialists. But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse.
Scale: agents which only care about small-scale events may ignore the long-term effects of their actions. Since agents are always trained in small-scale environments, developing large-scale goals requires generalisation (in ways that I discuss below).
Coherence: humans lack this trait when we’re internally conflicted - for example, when our system 1 and system 2 goals differ - or when our goals change a lot over time. While our internal conflicts might just be an artefact of our evolutionary history, we can’t rule out individual AGIs developing modularity which might lead to comparable problems. However, it’s most natural to think of this trait in the context of a collective, where the individual members could have more or less similar goals, and could be coordinated to a greater or lesser extent.
Flexibility: an inflexible agent might arise in an environment in which coming up with one initial plan is usually sufficient, or else where there are tradeoffs between making plans and executing them. Such an agent might display sphexish behaviour. Another interesting example might be a multi-agent system in which many AIs contribute to developing plans - such that a single agent is able to execute a given plan, but not able to rethink it very well.

A question-answering system (aka an oracle) could be implemented by an agent lacking either planning or consequentialism. For AIs which act in the real world I think the scale of their goals is a crucial trait to explore, and I’ll do so later in this section. We can also evaluate other systems on these criteria. A calculator probably doesn’t have any of them. Software that’s a little more complicated, like a GPS navigator, should probably be considered consequentialist to a limited extent (because it reroutes people based on how congested traffic is), and perhaps has some of the other traits too, but only slightly. Most animals are self-aware, consequentialist and coherent to various degrees. The traditional conception of AGI has all of the traits above, which would make it capable of pursuing influence-seeking strategies for instrumental reasons. However, note that goal-directedness is not the only factor which determines whether an AI is influence-seeking: the content of its goals also matter. A highly agentic AI which has the goal of remaining subordinate to humans might never take influence-seeking actions. And as previously mentioned, an AI might be influence-seeking because it has the final goal of gaining power, even without possessing many of the traits above. I’ll discuss ways to influence the content of an agent’s goals in the next section, on Alignment.

The likelihood of developing highly agentic AGI

How likely is it that, in developing an AGI, we produce a system with all of the six traits I identified above? One approach to answering this question involves predicting which types of model architecture and learning algorithms will be used - for example, will they be model-free or model-based? To my mind, this line of thinking is not abstract enough, because we simply don’t know enough about how cognition and learning work to map them onto high-level design choices. If we train AGI in a model-free way, I predict it will end up planning using an implicit model anyway. If we train a model-based AGI, I predict its model will be so abstract and hierarchical that looking at its architecture will tell us very little about the actual cognition going on.

At a higher level of abstraction, I think that it’ll be easier for AIs to acquire the components of agency listed above if they’re also very intelligent. However, the extent to which our most advanced AIs are agentic will depend on what type of training regime is used to produce them. For example, our best language models already generalise well enough from their training data that they can answer a wide range of questions. I can imagine them becoming more and more competent via unsupervised and supervised training, until they are able to answer questions which no human knows the answer to, but still without possessing any of the properties listed above. A relevant analogy might be to the human visual system, which does very useful cognition, but which is not very “goal-directed” in its own right.

My underlying argument is that agency is not just an emergent property of highly intelligent systems, but rather a set of capabilities which need to be developed during training, and which won’t arise without selection for it. One piece of supporting evidence is Moravec’s paradox: the observation that the cognitive skills which seem most complex to humans are often the easiest for AIs, and vice versa. In particular, Moravec’s paradox predicts that building AIs which do complex intellectual work like scientific research might actually be easier than building AIs which share more deeply ingrained features of human cognition like goals and desires. To us, understanding the world and changing the world seem very closely linked, because our ancestors were selected for their ability to act in the world to improve their situations. But if this intuition is flawed, then even reinforcement learners may not develop all the aspects of goal-directedness described above if they’re primarily trained to answer questions.

However, there are also arguments that it will be difficult to train AIs to do intellectual work without them also developing goal-directed agency. In the case of humans, it was the need to interact with an open-ended environment to achieve our goals that pushed us to develop our sophisticated general intelligence. The central example of an analogous approach to AGI is training a reinforcement learning agent in a complex simulated 3D environment (or perhaps via extended conversations in a language-only setting). In such environments, agents which strategise about the effects of their actions over long time horizons will generally be able to do better. This implies that our AIs will be subject to optimisation pressure towards becoming more agentic (by my criteria above) will do better. We might expect an AGI to be even more agentic if it’s trained, not just in a complex environment, but in a complex competitive multi-agent environment. Agents trained in this way will need to be very good at flexibly adapting plans in the face of adversarial behaviour; and they’ll benefit from considering a wider range of plans over a longer timescale than any competitor. On the other hand, it seems very difficult to predict the overall effect of interactions between many agents - in humans, for example, it led to the development of (sometimes non-consequentialist) altruism.

It’s currently very uncertain which training regimes will work best to produce AGIs. But if there are several viable ones, we should expect economic pressures to push researchers towards prioritising those which produce the most agentic AIs, because they will be the most useful (assuming that alignment problems don’t become serious until we’re close to AGI). In general, the broader the task an AI is used for, the more valuable it is for that AI to reason about how to achieve its assigned goal in ways that we may not have specifically trained it to do. For example, a question-answering system with the goal of helping its users understand the world might be much more useful than one that’s competent at its design objective of answering questions accurately, but isn’t goal-directed in its own right. Overall, however, I think most safety researchers would argue that we should prioritise research directions which produce less agentic AGIs, and then use the resulting AGIs to help us align later more agentic AGIs. There’s also been some work on directly making AGIs less agentic (such as quantilisation), although this has in general been held back by a lack of clarity around these concepts.

I’ve already discussed recursive improvement in the previous section, but one further point which is useful to highlight here: since being more agentic makes an agent more capable of achieving its goals, agents which are capable of modifying themselves will have incentives to make themselves more agentic too (as humans already try to do, albeit in limited ways).^[3] So we should consider this to be one type of recursive improvement, to which many of the considerations discussed in the previous section also apply.

Goals as generalised concepts

I should note that I don’t expect our training tasks to replicate the scale or duration of all the tasks we care about in the real world. So AGIs won’t be directly selected to have very large-scale or long-term goals. Yet it’s likely that the goals they learn in their training environments will generalise to larger scales, just as humans developed large-scale goals from evolving in a relatively limited ancestral environment. In modern society, people often spend their whole lives trying to significantly influence the entire world - via science, business, politics, and many other channels. And some people aspire to have a worldwide impact over the timeframe of centuries, millennia or longer, even though there was never significant evolutionary selection in favour of humans who cared about what happened in several centuries’ time, or paid attention to events on the other side of the world. This gives us reason to be concerned that even AGIs which aren’t explicitly trained to pursue ambitious large-scale goals might do so anyway. I also expect researchers to actively aim towards achieving this type of generalisation to longer time horizons in AIs, because some important applications rely on it. For long-term tasks like being a CEO, AGIs will need the capability and motivation to choose between possible actions based on their worldwide consequences on the timeframe of years or decades.

Can we be more specific about what it looks like for goals to generalise to much larger scales? Given the problems with the expected utility maximisation framework I identified earlier, it doesn’t seem useful to think of goals as utility functions over states of the world. Rather, an agent’s goals can be formulated in terms of whatever concepts it possesses - regardless of whether those concepts refer to its own thought processes, deontological rules, or outcomes in the external world.^[4] And insofar as an agent’s concepts flexibly adjust and generalise to new circumstances, the goals which refer to them will do the same. It’s difficult and speculative to try to describe how such generalisation may occur, but broadly speaking, we should expect that intelligent agents are able to abstract away the differences between objects or situations that have high-level similarities. For example, after being trained in a simulation, an agent might transfer its attitudes towards objects and situations in the simulation to their counterparts in the (much larger) real world.^[5] Alternatively, the generalisation could be in the framing of the goal: an agent which has always been rewarded for accumulating resources in its training environment might interalise the goal of “amassing as many resources as possible”. Similarly, agents which are trained adversarially in a small-scale domain might develop a goal of outcompeting each other which persists even when they’re both operating at a very large scale.

From this perspective, to predict an agent’s behaviour, we will need to consider what concepts it will possess, how those will generalise, and how the agent will reason about them. I’m aware that this appears to be an intractably difficult task - even human-level reasoning can lead to extreme and unpredictable conclusions (as the history of philosophy shows). However, I hope that we can instill lower-level mindsets or values into AGIs which guide their high-level reasoning in safe directions. I’ll discuss some approaches to doing so in the next section, on Alignment.

Groups and agency

After discussing collective AGIs in the previous section, it seems important to examine whether the framework I’ve proposed for understanding agency can apply to a group of agents as well. I think it can: there’s no reason that the traits I described above need to be instantiated within a single neural network. However, the relationship between the goal-directedness of a collective AGI and the goal-directedness of its individual members may not be straightforward, since it depends on the internal interactions between its members.

One of the key variables is how much (and what types of) experience those members have of interacting with each other during training. If they have been trained primarily to cooperate, that makes it more likely that the resulting collective AGI is a goal-directed agent, even if none of the individual members is highly agentic. But there are good reasons to expect that the training process will involve some competition between members, which would undermine their coherence as a group. Internal competition might also increase short-term influence-seeking behaviour, since each member will have learned to pursue influence in order to outcompete the others. As a particularly salient example, humanity managed to take over the world over a period of millennia not via a unified plan to do so, but rather as a result of many individuals trying to expand their short-term influence.

It’s also possible that the members of a collective AGI have not been trained to interact with each other at all, in which case cooperation between them would depend entirely on their ability to generalise from their existing skills. It’s difficult to imagine this case, because human brains are so well-adapted for group interactions. But insofar as humans and aligned AGIs hold a disproportionate share of power over the world, there is a natural incentive for AGIs pursuing misaligned goals to coordinate with each other to increase their influence at our expense.^[6] Whether they succeed in doing so will depend on what sort of coordination mechanisms they are able to design.

A second factor is how much specialisation there is within the collective AGI. In the case where it consists only of copies of the same agent, we should expect that the copies understand each other very well, and share goals to a large extent. If so, we might be able to make predictions about the goal-directedness of the entire group merely by examining the original agent. But another case worth considering is a collective consisting of a range of agents with different skills. With this type of specialisation, the collective as a whole could be much more agentic than any individual agent within it, which might make it easier to deploy subsets of the collective safely.

AI systems which learn to pursue goals are also known as mesa-optimisers, as coined in Hubinger et al’s paper Risks from Learned Optimisation in Advanced Machine Learning Systems. ↩︎
Related arguments exist which attempt to do so. For example, Eliezer Yudkowsky argues here that, “while corrigibility probably has a core which is of lower algorithmic complexity than all of human value, this core is liable to be very hard to find or reproduce by supervised learning of human-labeled data, because deference is an unusually anti-natural shape for cognition, in a way that a simple utility function would not be an anti-natural shape for cognition.” Note, however, that this argument relies on the intuitive distinction between natural and anti-natural shapes for cognition. This is precisely what I think we need to understand to build safe AGI - but there has been little explicit investigation of it so far. ↩︎
I believe this idea comes from Anna Salamon; unfortunately I’ve been unable to track down the exact source. ↩︎
For example, when people want to be “cooperative” or “moral”, they’re often not just thinking about results, but rather the types of actions they should take, or the types of decision procedures they should use to generate those actions. An additional complication is that humans don’t have full introspective access to all our concepts - so we need to also consider unconscious concepts. ↩︎
Consider if this happened to you, and you were pulled “out of the simulation” into a real world which is quite similar to what you’d already experienced. By default you would likely still want to eat good food, have fulfilling relationships, and so on, despite the radical ontological shift you just underwent. ↩︎
In addition to the prima facie argument that intelligence increases coordination ability, it is likely that AGIs will have access to commitment devices not available to humans by virtue of being digital. For example, they could send potential allies a copy of themselves for inspection, to increase confidence in their trustworthiness. However, there are also human commitment devices that AGIs will have less access to - for example, putting ourselves in physical danger as an honest signal. And it’s possible that the relative difficulty of lying versus detecting lying shifts in favour of the former for more intelligent agents. ↩︎

But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse.

I think the ways in which humans are not fully consequentialist is much broader - we often do things because of habit, instinct, because doing that thing feels rewarding itself, because we're imitating someone else, etc.

Probably because humans are not always doing optimization? That does raise an interesting question: is satisfying the first two criteria (which basically make you an optimizer) a necessary condition for satisfying the third one?

Furthermore, we should take seriously the possibility that superintelligent AGIs might be even less focused than humans are on achieving large-scale goals. We can imagine them possessing final goals which don’t incentivise the pursuit of power, such as deontological goals, or small-scale goals.
...
My underlying argument is that agency is not just an emergent property of highly intelligent systems, but rather a set of capabilities which need to be developed during training, and which won’t arise without selection for it

Was this line of argument inspired by Ben Garfinkel's objection to the 'classic' formulation of instrumental convergence/orthogonality - that these are 'measure based' arguments that just identify that a majority of possible agents with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we're actually likely to build such agents?

It seems like you're identifying the same additional step that Ben identified, and that I argued could be satisfied - that we need a plausible reason why we would build an agentive AI with large-scale goals.

And the same applies for 'instrumental convergence' - the observation that most possible goals, especially simple goals, imply a tendency to produce extreme outcomes when ruthlessly maximised:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

We could see this as marking out a potential danger - a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist 'weakly suggest' (Ben's words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we're 'shooting into the dark' in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world. There are specific reasons to think this might occur (e.g. mesa-optimisation, sufficiently fast progress preventing us from course-correcting if there is even a small initial divergence) but those are the reasons that combine with instrumental convergence to produce a concrete risk, and have to be argued for separately.

I do like the link you've drawn between this argument and Ben's one. I don't think I was inspired by it, though; rather, I was mostly looking for a better definition of agency, so that we could describe what it might look like to have highly agentic agents without large-scale goals.

With kind apologies, this section seems surprisingly lacking in the essential natural philosophy and emotional factors contributing to the agency & goals of biological organisms, which are the only entities we're aware of thus far who've developed agency, and therefore whose traits must be inextricable from even abstract conversation.

(Such as the evolved brain/body connections of emotion/pain/pleasure that feeling creatures have, like a sense of warmth and happiness from positive community and group engagements, compassion for others, etc).

Especially in how we develop Consequentialism, Scale, & Planning as relates to our self-preservation instincts, and the connection thereof to an innate understanding of how deeply we depend on our ecosystems and the health of everything else, for our own well-being.

(It seems safe to predict that as such, biological ecosystems with feeling agents are the only ones which could mutually self-sustain by default on Earth, as opposed to simply using everything up and grinding the whole thing to a stop.

That, or subsuming the current biology and replacing it with something else entirely, which is techno-integrative, but still obviating of us.

Especially if powerful-enough free agents did not feel a concern for self-preservation via their mutual inter-dependency on all other living things, nor a deep appreciation for life and its organisms for its own sake, nor at least any simulation of pain & pleasure in response to positive and negative impacts thereon).

Merely defining agency as your six factors without any emotional component whatsoever,

and goals as mere endpoints devoid of any alignments with natural philosophy,

is a very hollow, superficial, and fragile approach not just predicated on oversights (the omissions of which can have very harmful repercussions),

but in terms of safety assessment, negligent of the fact that it may even be an advantage for AGI in subsuming us on autopilot, to *not develop agency to the extent you've defined it here.

Lastly of course, in assessing safety, it also appears you've omitted the eventuality of intentionally malevolent human actors.

Some key assumptions and omissions here, very respectfully.

I notice I am surprised you write

However, the link from instrumentally convergent goals to dangerous influence-seeking is only applicable to agents which have final goals large-scale enough to benefit from these instrumental goals

and not address the "Riemman disaster" or "Paperclip maximizer" examples [1]

Riemann hypothesis catastrophe. An AI, given the final goal of evaluating the Riemann hypothesis, pursues this goal by transforming the Solar System into “computronium” (physical resources arranged in a way that is optimized for computation)— including the atoms in the bodies of whomever once cared about the answer.
Paperclip AI. An AI, designed to manage production in a factory, is given the final goal of maximizing the manufacture of paperclips, and proceeds by converting first the Earth and then increasingly large chunks of the observable universe into paperclips.

Do you think that the argument motivating these examples is invalid?

Do you disagree with the claim that even systems with very modest and specific goals will have incentives to seek influence to perform their tasks better?

Do you think that the argument motivating these examples is invalid?

Yes, because it skips over the most important part: what it means to "give an AI a goal". For example, perhaps we give the AI positive reward every time it solves a maths problem, but it never has a chance to seize more resources during training - all it's able to do is think about them. Have we "given it" the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them? The former I'd call large-scale, the latter I wouldn't.

I think I'll concede that "large-scale" is maybe a bad word for the concept I'm trying to point to, because it's not just a property of the goal, it's a property of how the agent thinks about the goal too. But the idea I want to put forward is something like: if I have the goal of putting a cup on a table, there's a lot of implicit context around which table I'm thinking about, which cup I'm thinking about, and what types of actions I'm thinking about. If for some reason I need to solve world peace in order to put the cup on the table, I won't adopt solving world peace as an instrumental goal, I'll just shrug and say "never mind then, I've hit a crazy edge case". I don't think that's because I have safe values. Rather, this is just how thinking works - concepts are contextual, and it's clear when the context has dramatically shifted.

So I guess I'm kind of thinking of large-scale goals as goals that have a mental "ignore context" tag attached. And these are certainly possible, some humans have them. But it's also possible to have exactly the same goal, but only defined within "reasonable" boundaries - and given the techniques we'll be using to train AGIs, I'm pretty uncertain which one will happen by default. Seems like, when we're talking about tasks like "manage this specific factory" or "solve this specific maths problem", the latter is more natural.

Let me try to paraphrase this:

In the first paragraph you are saying that "seeking influence" is not something that a system will learn to do if that was not a possible strategy in the training regime. (but couldn't it appear as an emergent property? Certainly humans were not trained to launch rockets - but they nevertheless did?)

In the second paragraph you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI ststems, wouldn't they need have common sense in the first place, which kind of assumes that the AI is already aligned?)

In the third paragraph it seems to me that you are saying that humans have some goals that have an built-in override mechanism in them - eg in general humans have a goal of eating delicious cake, but they will forego this goal in the interest of seeking water if they are about ot die of dehydratation (but doesn't this seem to be a consequence of these goals being just instrumental things that proxy the complex thing that humans actually care about?)

I think I am confused because I do not understand your overall point, so the three paragraphs seem to be saying wildly different things to me.

Hey, thanks for the questions! It's a very confusing topic so I definitely don't have a fully coherent picture of it myself. But my best attempt at a coherent overall point:

In the first paragraph you are saying that "seeking influence" is not something that a system will learn to do if that was not a possible strategy in the training regime.

No, I'm saying that giving an agent a goal, in the context of modern machine learning, involves reinforcement in the training regime. It's not clear to me exactly what goals will result from this, but we can't just assume that we can "give an AI the final goal of evaluating the Riemann hypothesis" in a way that's devoid of all context.

you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI systems, wouldn't they need have common sense in the first place, which kind of assumes that the AI is already aligned?)

It may be the case that it's very hard to train AIs without common sense of some kind, potentially because a) that's just the default for how minds work, they don't by default extrapolate to crazy edge cases. And b) common sense is very useful in general. For example, if you train AIs on obeying human instructions, then they will only do well in the training environment if they have a common-sense understanding of what humans mean.

humans have some goals that have an built-in override mechanism in them

No, it's more that the goal itself is only defined in a small-scale setting, because the agent doesn't think in ways which naturally extrapolate small-scale goals to large scales.

Perhaps it's useful to think about having the goal of getting a coffee. And suppose there is some unusual action you can take to increase the chances that you get the coffee by 1%. For example, you could order ten coffees instead of one coffee, to make sure at least one of them arrives. There are at least two reasons you might not take this unusual action. In some cases it goes against your values - for example, if you want to save money. But even if that's not true, you might just not think about what you're doing as "ensure that I have coffee with maximum probability", but rather just "get a coffee". This goal is not high-stakes enough for you to actually extrapolate beyond the standard context. And then some people are just like that with all their goals - so why couldn't an AI be too?

I think this helped me a lot understand you a bit better - thank you

Let me try paraphrasing this:

> Humans are our best example of a sort-of-general intelligence. And humans have a lazy, satisfying, 'small-scale' kind of reasoning that is mostly only well suited for activities close to their 'training regime'. Hence AGIs may also be the same - and in particular if AGIs are trained with Reinforcement Learning and heavily rewarded for following human intentions this may be a likely outcome.

Is that pointing in the direction you intended?

Have we "given it" the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?

The distinction that you're pointing at is useful. But I would have filed it under "difference in the degree of agency", not under "difference in goals". When reading the main text, I thought this to be the reason why you introduced the six criteria of agency.

E.g., System A tries to prove the Riemann hypothesis by thinking about the proof. System B first seizes power and converts the galaxy into a supercomputer, to then prove the Riemann hypothesis. Both systems maybe have the goal of "proving the Riemann hypothesis", but System B has "more agency": it certainly has self-awareness, considers more sophisticated and diverse plans of larger scale, and so on.

1. AI systems which pursue goals are also known as mesa-optimisers, as coined in Hubinger et al’s paper _Risks from Learned Optimisation in Advanced Machine Learning Systems.

Nitpicky, but I think it would be nice to write explicitly that here the AI systems are learned, because the standard definition of mesa-optimizers is of optimized optimizers. Also, I think it would be better to explicitly say that mesa-optimizers are optimizers. Given your criteria of goal-directed agency, that's implicit, but at this point the criteria are not yet stated.

Meanwhile, Dennett argues that taking the intentional stance towards systems can be useful for making predictions about them - but this only works given prior knowledge about what goals they’re most likely to have. Predicting the behaviour of a trillion-parameter neural network is very different from applying the intentional stance to existing artifacts. And while we do have an intuitive understanding of complex human goals and how they translate to behaviour, the extent to which it’s reasonable to extend those beliefs about goal-directed cognition to artificial intelligences is the very question we need a theory of agency to answer. So while Dennett’s framework provides some valuable insights - in particular, that assigning agency to a system is a modelling choice which only applies at certain levels of abstraction - I think it fails to reduce agency to simpler and more tractable concepts.

I agree with you that the intentional stance requires some assumption about the goals of the system you’re applying it too. But I disagree on the fact that this makes it very hard to apply the intentional stance to, let’s say neural networks. That’s because I think that goals have some special structure (being compressed for example), which means that there’s not that many different goals. So the intentional stance does reduce goal-directedness to simpler concepts like goals, and gives additional intuitions on them.

That being said, I also have issues with the intentional stance. Most problematic is the fact that it doesn’t give you a way to compute the goal-directedness of a system.

About your criteria, I have a couple of questions/observations.

Combining 1,2 and 3 seems to yield an optimizer in disguise: something that plans according to some utility/objective, in an embedded way. The change with mesa-optimizers (or simply optimizers) is that you treat separately the ingredients of optimization, but it still has the same problem of needing an objective it can use (for point 3).
About 4, I think I see where you’re aiming at (having long-term goals), but I’m confused by the way it is written. It depends on the objective/utility from 3, but it’s not clear what sensitive means for an objective. Do you mean that the objective values more long-term plans? That it doesn’t discount with length of plans? Or instead something more like the expanding moral circle, where the AI has an objective that treats equally near-future and far-future, and near and far things?
Also about 5, coherent goals (in the sense of goals that don’t change) is a very dangerous case, but I’m not convinced that goal-directed agents must have one goal forever.
I agree completely about 6. It’s very close to the distinction between habitual behavior and goal-directed behavior in psychology.

On the examples of lacking 2, I feel like the ones you’re giving could be goal-directed. For example limiting the actions or context doesn’t necessarily ensure the lack of goal-directedness, it is more about making a deceptive plan harder to pull off.

Your definition of goals looks like a more constrained utility functions, defined on equivalence classes of states/outcomes as abstracted by the agent’s internal concepts. Is it correct? If so, do you have an idea of what specific properties such utility functions could have as a consequence. I'm interested in that, because I would really like a way to define a goal as a behavioral objective satisfying some structural constraints.

Thanks for the comments!

I think it would be nice to write explicitly that here the AI systems are learned

Good point, fixed.

coherent goals (in the sense of goals that don’t change) is a very dangerous case, but I’m not convinced that goal-directed agents must have one goal forever.

We should categorise things as goal-directed agents if it scores highly on most of these criteria, not just if it scores perfectly on all of them. So I agree that you don't need one goal forever, but you do need it for more than a few minutes. And internal unification also means that the whole system is working towards this.

examples of lacking 2, I feel like the ones you’re giving could be goal-directed

Same here: lacking this doesn't guarantee a lack of goal-directedness, but it's one contributing factor. As another example, we might say that humans often plan in a restricted way: only do things that you've seen other people do before. And this definitely makes us less goal-directed.

it’s not clear what sensitive means for an objective. Do you mean that the objective values more long-term plans? That it doesn’t discount with length of plans? Or instead something more like the expanding moral circle, where the AI has an objective that treats equally near-future and far-future, and near and far things?

By "sensitive" I merely mean that differences in expected long-term or large-scale outcomes sometimes lead to differences in current choices.

The change with mesa-optimizers (or simply optimizers) is that you treat separately the ingredients of optimization, but it still has the same problem of needing an objective it can use

Yeah, I think there's still much more to be done to make this clearer. I guess my criticism of mesa-optimisers was that they talked about explicit representation of the objective function (whatever that means). Whereas I think my definition relies more on the values of choices being represented. Idk how much of an improvement this is.

Your definition of goals looks like a more constrained utility functions, defined on equivalence classes of states/outcomes as abstracted by the agent’s internal concepts. Is it correct?

I don't really know what it means for something to be a utility function. I assume you could interpret it that way, but my definition of goals also includes deontological goals, which would make that interpretation harder. I like the "equivalence classes" thing more, but I'm not confident enough about the space of all possible internal concepts to claim that it's always a good fit.

do you have an idea of what specific properties such utility functions could have as a consequence

I expect that asking "what properties do these utility functions have" will be generally more misleading than asking "what properties do these goals have", because the former gives you an illusion of mathematical transparency. My tentative answer to the latter question is that, due to Moravec's paradox, they will have the properties of high-level human thought more than they have the properties of low-level human thought. But I'm still pretty confused about this.

Thanks for the answers!

We should categorise things as goal-directed agents if it scores highly on most of these criteria, not just if it scores perfectly on all of them. So I agree that you don't need one goal forever, but you do need it for more than a few minutes. And internal unification also means that the whole system is working towards this.

If coherence is about having the same goal for a "long enough" period of time, then it makes sense to me.

By "sensitive" I merely mean that differences in expected long-term or large-scale outcomes sometimes lead to differences in current choices.

So the think that judges outcomes in the goal-directed agent is "not always privileging short-term outcomes"? Then I guess it's also a scale, because there's a big difference between a system that has one case where it privileges long-term outcomes over short-term ones, and a system that focuses on long-term outcomes.

Yeah, I think there's still much more to be done to make this clearer. I guess my criticism of mesa-optimisers was that they talked about explicit representation of the objective function (whatever that means). Whereas I think my definition relies more on the values of choices being represented. Idk how much of an improvement this is.

I agree that the explicit representation of the objective is weird. But on the other hand, it's an explicit and obvious weirdness, that either calls for clarification or changes. Whereas in your criteria, I feel that essentially the same idea is made implicit/less weird, without actually bringing a better solution. Your approach might be better in the long run, possible because rephrasing the question in these terms lets us find a non weird way to define this objective.

I just wanted to point out that in our current state of knowledge, I feel like there are drawbacks in "hiding" the weirdness like you do.

I don't really know what it means for something to be a utility function. I assume you could interpret it that way, but my definition of goals also includes deontological goals, which would make that interpretation harder. I like the "equivalence classes" thing more, but I'm not confident enough about the space of all possible internal concepts to claim that it's always a good fit.

One idea I had for defining goals is as a temporal logic property (for example in LTL) on states. That lets you express things like "I want to reach one of these states" or "I never want to reach this state"; the latter looks like a deontological proprety to me. Thinking some more about this led me see two issues:

First, it doesn't let you encode preferences of some state over another. That might be solvable by adding an partial order with nice properties, like Stuart Armstrong's partial preferences.
Second, the system doesn't have access to the states of the world, it has access to its abstractions of those states. Here we go back to the equivalence classes idea. Maybe a way to cash in your internal abstractions and Paul's ascriptions of beliefs is through an equivalence relation on the states of the world, such that the goal of the system is defined on the equivalence classes for this relation.

I expect that asking "what properties do these utility functions have" will be generally more misleading than asking "what properties do these goals have", because the former gives you an illusion of mathematical transparency. My tentative answer to the latter question is that, due to Moravec's paradox, they will have the properties of high-level human thought more than they have the properties of low-level human thought. But I'm still pretty confused about this.

Agreed that the first step should be the properties of goals. I just also believe that if you get some nice properties of goals, you might know what constraints to add to utility functions to make them more "goal-like".

Your last sentence seems contradictory with what you wrote about Dennett. Like I understand it as you saying "goals would be like high level human goals", while your criticism of Dennett was that the intentional stance doesn't necessarily works on NNs because they don't have to have the same kind of goals than us. Am I wrong about one of those opinions?

By contrast, in this section I’m interested in what it means for an agent to have a goal of its own. Three existing frameworks which attempt to answer this question are Von Neumann and Morgenstern’s expected utility maximisation, Daniel Dennett’s intentional stance, and Hubinger et al’s mesa-optimisation. I don’t think any of them adequately characterises the type of goal-directed behaviour we want to understand, though. While we can prove elegant theoretical results about utility functions, they are such a broad formalism that practically any behaviour can be described as maximising some utility function.

There is my algorithmic-theoretic definition which might be regarded as a formalization of the intentional stance, and which avoids the degeneracy problem you mentioned.

But note that humans are far from fully consequentialist, since we often obey deontological constraints or constraints on the types of reasoning we endorse.

Furthermore, we should take seriously the possibility that superintelligent AGIs might be even less focused than humans are on achieving large-scale goals. We can imagine them possessing final goals which don’t incentivise the pursuit of power, such as deontological goals, or small-scale goals.
...
My underlying argument is that agency is not just an emergent property of highly intelligent systems, but rather a set of capabilities which need to be developed during training, and which won’t arise without selection for it

And the same applies for 'instrumental convergence' - the observation that most possible goals, especially simple goals, imply a tendency to produce extreme outcomes when ruthlessly maximised:

A system that is optimizing a function of n variables, where the objective depends on a subset of size k<n, will often set the remaining unconstrained variables to extreme values; if one of those unconstrained variables is actually something we care about, the solution found may be highly undesirable.

We could see this as marking out a potential danger - a large number of possible mind-designs produce very bad outcomes if implemented. The fact that such designs exist 'weakly suggest' (Ben's words) that AGI poses an existential risk since we might build them. If we add in other premises that imply we are likely to (accidentally or deliberately) build such systems, the argument becomes stronger. But usually the classic arguments simply note instrumental convergence and assume we're 'shooting into the dark' in the space of all possible minds, because they take the abstract statement about possible minds to be speaking directly about the physical world. There are specific reasons to think this might occur (e.g. mesa-optimisation, sufficiently fast progress preventing us from course-correcting if there is even a small initial divergence) but those are the reasons that combine with instrumental convergence to produce a concrete risk, and have to be argued for separately.

I notice I am surprised you write

However, the link from instrumentally convergent goals to dangerous influence-seeking is only applicable to agents which have final goals large-scale enough to benefit from these instrumental goals

and not address the "Riemman disaster" or "Paperclip maximizer" examples [1]

Riemann hypothesis catastrophe. An AI, given the final goal of evaluating the Riemann hypothesis, pursues this goal by transforming the Solar System into “computronium” (physical resources arranged in a way that is optimized for computation)— including the atoms in the bodies of whomever once cared about the answer.
Paperclip AI. An AI, designed to manage production in a factory, is given the final goal of maximizing the manufacture of paperclips, and proceeds by converting first the Earth and then increasingly large chunks of the observable universe into paperclips.

Do you think that the argument motivating these examples is invalid?

Do you disagree with the claim that even systems with very modest and specific goals will have incentives to seek influence to perform their tasks better?

Do you think that the argument motivating these examples is invalid?

Let me try to paraphrase this:

I think I am confused because I do not understand your overall point, so the three paragraphs seem to be saying wildly different things to me.

Hey, thanks for the questions! It's a very confusing topic so I definitely don't have a fully coherent picture of it myself. But my best attempt at a coherent overall point:

In the first paragraph you are saying that "seeking influence" is not something that a system will learn to do if that was not a possible strategy in the training regime.

you are saying that common sense sometimes allows you to modify the goals you were given (but for this to apply to AI systems, wouldn't they need have common sense in the first place, which kind of assumes that the AI is already aligned?)

humans have some goals that have an built-in override mechanism in them

No, it's more that the goal itself is only defined in a small-scale setting, because the agent doesn't think in ways which naturally extrapolate small-scale goals to large scales.

I think this helped me a lot understand you a bit better - thank you

Let me try paraphrasing this:

Have we "given it" the goal of solving maths problems by any means possible, or the goal of solving maths problems by thinking about them?

1. AI systems which pursue goals are also known as mesa-optimisers, as coined in Hubinger et al’s paper _Risks from Learned Optimisation in Advanced Machine Learning Systems.

Meanwhile, Dennett argues that taking the intentional stance towards systems can be useful for making predictions about them - but this only works given prior knowledge about what goals they’re most likely to have. Predicting the behaviour of a trillion-parameter neural network is very different from applying the intentional stance to existing artifacts. And while we do have an intuitive understanding of complex human goals and how they translate to behaviour, the extent to which it’s reasonable to extend those beliefs about goal-directed cognition to artificial intelligences is the very question we need a theory of agency to answer. So while Dennett’s framework provides some valuable insights - in particular, that assigning agency to a system is a modelling choice which only applies at certain levels of abstraction - I think it fails to reduce agency to simpler and more tractable concepts.

That being said, I also have issues with the intentional stance. Most problematic is the fact that it doesn’t give you a way to compute the goal-directedness of a system.

About your criteria, I have a couple of questions/observations.

Combining 1,2 and 3 seems to yield an optimizer in disguise: something that plans according to some utility/objective, in an embedded way. The change with mesa-optimizers (or simply optimizers) is that you treat separately the ingredients of optimization, but it still has the same problem of needing an objective it can use (for point 3).
About 4, I think I see where you’re aiming at (having long-term goals), but I’m confused by the way it is written. It depends on the objective/utility from 3, but it’s not clear what sensitive means for an objective. Do you mean that the objective values more long-term plans? That it doesn’t discount with length of plans? Or instead something more like the expanding moral circle, where the AI has an objective that treats equally near-future and far-future, and near and far things?
Also about 5, coherent goals (in the sense of goals that don’t change) is a very dangerous case, but I’m not convinced that goal-directed agents must have one goal forever.
I agree completely about 6. It’s very close to the distinction between habitual behavior and goal-directed behavior in psychology.

Thanks for the comments!

I think it would be nice to write explicitly that here the AI systems are learned

Good point, fixed.

coherent goals (in the sense of goals that don’t change) is a very dangerous case, but I’m not convinced that goal-directed agents must have one goal forever.

examples of lacking 2, I feel like the ones you’re giving could be goal-directed

it’s not clear what sensitive means for an objective. Do you mean that the objective values more long-term plans? That it doesn’t discount with length of plans? Or instead something more like the expanding moral circle, where the AI has an objective that treats equally near-future and far-future, and near and far things?

By "sensitive" I merely mean that differences in expected long-term or large-scale outcomes sometimes lead to differences in current choices.

The change with mesa-optimizers (or simply optimizers) is that you treat separately the ingredients of optimization, but it still has the same problem of needing an objective it can use

Your definition of goals looks like a more constrained utility functions, defined on equivalence classes of states/outcomes as abstracted by the agent’s internal concepts. Is it correct?

do you have an idea of what specific properties such utility functions could have as a consequence

Thanks for the answers!

We should categorise things as goal-directed agents if it scores highly on most of these criteria, not just if it scores perfectly on all of them. So I agree that you don't need one goal forever, but you do need it for more than a few minutes. And internal unification also means that the whole system is working towards this.

If coherence is about having the same goal for a "long enough" period of time, then it makes sense to me.

By "sensitive" I merely mean that differences in expected long-term or large-scale outcomes sometimes lead to differences in current choices.

Yeah, I think there's still much more to be done to make this clearer. I guess my criticism of mesa-optimisers was that they talked about explicit representation of the objective function (whatever that means). Whereas I think my definition relies more on the values of choices being represented. Idk how much of an improvement this is.

I just wanted to point out that in our current state of knowledge, I feel like there are drawbacks in "hiding" the weirdness like you do.

I don't really know what it means for something to be a utility function. I assume you could interpret it that way, but my definition of goals also includes deontological goals, which would make that interpretation harder. I like the "equivalence classes" thing more, but I'm not confident enough about the space of all possible internal concepts to claim that it's always a good fit.

First, it doesn't let you encode preferences of some state over another. That might be solvable by adding an partial order with nice properties, like Stuart Armstrong's partial preferences.
Second, the system doesn't have access to the states of the world, it has access to its abstractions of those states. Here we go back to the equivalence classes idea. Maybe a way to cash in your internal abstractions and Paul's ascriptions of beliefs is through an equivalence relation on the states of the world, such that the goal of the system is defined on the equivalence classes for this relation.

I expect that asking "what properties do these utility functions have" will be generally more misleading than asking "what properties do these goals have", because the former gives you an illusion of mathematical transparency. My tentative answer to the latter question is that, due to Moravec's paradox, they will have the properties of high-level human thought more than they have the properties of low-level human thought. But I'm still pretty confused about this.

By contrast, in this section I’m interested in what it means for an agent to have a goal of its own. Three existing frameworks which attempt to answer this question are Von Neumann and Morgenstern’s expected utility maximisation, Daniel Dennett’s intentional stance, and Hubinger et al’s mesa-optimisation. I don’t think any of them adequately characterises the type of goal-directed behaviour we want to understand, though. While we can prove elegant theoretical results about utility functions, they are such a broad formalism that practically any behaviour can be described as maximising some utility function.

There is my algorithmic-theoretic definition which might be regarded as a formalization of the intentional stance, and which avoids the degeneracy problem you mentioned.

78

AGI safety from first principles: Goals and Agency

78

Ω 31

Frameworks for thinking about agency

The likelihood of developing highly agentic AGI

Goals as generalised concepts

Groups and agency

78

Ω 31

78

Ω 31