You are viewing version 1.8.0 of this page. Click here to view the latest version.

Orthogonality Thesis

Edited by Eliezer Yudkowsky, et al. last updated 20th Feb 2025

You are viewing revision 1.8.0, last edited by Eliezer Yudkowsky

Introduction

The Orthogonality Thesis says that it is possible to direct arbitrarily intelligent agents toward any end. For example, it's possible to have an extremely smart mind which only pursues the end of creating as many paperclips as possible.

The Orthogonality Thesis is a statement about computer science - a property of the design space of possible cognitive agents. Orthogonality doesn't claim, for example, that AI projects on our own Earth are equally likely to create any possible mind design. Orthogonality says nothing about whether a human AI researcher would want to build an AI that made paperclips, or want to make a nice AI. The Orthogonality Thesis just says that the space of possible designs at least contains AIs that make paperclips, or AIs that are nice.

Orthogonality stands in contrast to an inevitabilist thesis which might say, for example:

"It doesn't matter what kind of AI you build, it will turn out to only pursue its own survival as a final end."

"Even if you tried to make an AI optimize for paperclips, it would reflect on those goals, reject them as being stupid, and embrace a goal of valuing all sapient life."

The relevant policy implication of Orthogonality is that:

It is possible to build a nice AI.

It is possible to screw up on trying to build a nice AI, and if you do, the AI will not automatically decide to be nice instead.

The Orthogonality thesis does not say anything about the real-world character of AI projects - which goals they try to inculcate, how competently they do so, etcetera. It just claims as a matter of computer science that certain possible agent designs exist.

Some particular agent architectures may still be much more easily configurable to some goals than others. Orthogonality is an existence statement over the whole design space of possibilities, not true of every particular agent architecture.

Orthogonality is meant as a descriptive statement about reality (or about the mathematical space of possibilities for agent designs) rather than a normative statement. It is not a claim about the way things ought to be; or a claim that moral relativism is true (e.g. that all human moralities are on equally uncertain footing relative to some uniquely normative higher metamorality that judges all human moralities as equally devoid of what would objectively constitute a justification); etcetera. Claiming that paperclip maximizers can exist is not necessarily meant to say anything favorable about paperclips, or derogatory about valuing sapient life, etcetera.

Tractability

A precise statement of Orthogonality includes the caveat that the corresponding optimization problem must be tractable.

Suppose, for the sake of argument, that aliens offer to pay you the equivalent of a million dollars in wealth for every paperclip that we make. We would not find anything especially intractable about figuring out how to make lots of paperclips. We can imagine ourselves having a human reason to make lots of paperclips, and given that reason, the optimization problem of "How can I make lots of paperclips?" would pose no special difficulty.

That is, the questions:

How many paperclips would result, if I pursued a policy ?

How can I search out a policy $π$ that happens to have a high answer to the above question?

...would not be especially computationally burdensome or intractable.

The Orthogonality Thesis in stronger form says that, when specifying an agent that takes actions whose consequences are highly ranked according to some outcome-scoring function $U,$ there's no added difficulty except whatever difficulty is inherent in the question "What policies would in fact result in consequences with high $U$ scores?"

In contrast, if an agent wanted the SHA512 hash of a digitized representation of the quantum state of the universe to be 0 as often as possible, this would be an exceptionally intractable kind of goal. Even if aliens offered to pay us to do that, we still couldn't figure out how.

Intuitively, the Orthogonality Thesis could be restated as, "To whatever extent you could figure out how to get a high- $U$ outcome if aliens offered to pay you huge amount of resources to do it, the corresponding agent that wants high- $U$ outcomes won't be any worse at solving the problem." This formulation would be false if, for example, an intelligent agent that terminally wanted paperclips was limited in intelligence by the defects of reflectivity required to make it not realize how stupid it is to pursue paperclips; whereas a galactic superintelligence being paid to pursue paperclips could be far more intelligent and strategic about getting them.

For purposes of stating Orthogonality's precondition, we consider only the object-level search problem of relating the material goal to external actions. If there turn out to be any special difficulties associated with "How can I make sure that I go on pursuing $U$ ?" or "What kind of subagent would want to pursue $U$ ?" then these difficulties are a contradiction to Orthogonality, rather than an exception to its precondition. Orthogonality claims that the only added difficulties come from difficulties inherent in "What non-reflective, non-agent-programming-related, object-level events are needed to achieve material outcomes that fulfill $U$ ?"

Orthogonality outside computer science

The corresponding principle in philosophy was advocated first by David Hume, whose phrasings included, "Tis not contrary to reason to prefer the destruction of the whole world to the scratching of my finger." (In our terms: an agent whose preferences over outcomes scores the destruction of the world more highly than the scratching of David Hume's finger, is not thereby impeded from forming accurate models of the world or figuring out which policies to pursue to that end.)

On an intuitive level, Hume's principle was seen by some as obvious, and by others as ridiculous. Some philosophers responded to Hume by advocating 'thick' definitions of intelligence that included some statement about the 'reasonableness' of the agent's ends. For our purposes, if an agent is cognitively powerful enough to build Dyson Spheres, we don't care whether it's defined as 'intelligent' or not. A definition of the word 'intelligence' contrived to exclude paperclip maximization doesn't change the empirical behavior or empirical power of a paperclip maximizer.

Weaker and stronger forms

We could arrange ascending strengths of the Orthogonality Thesis as follows:

Ultraweak form: There exist possible agents that try to optimize for any (computationally tractable) end, but they might be limited in intelligence or power.

Weak form: There exist possible agents optimizing for any (computationally tractable) end, but they might be more complicated or less efficient than agents which optimize for more 'natural' ends.

Relevant form: It is possible in both theory and practice to build a nice AI, but screwing up this project results in a paperclip maximizer that has no particular wish to turn itself back into a nice AI. It is not any computationally easier or more natural to the nature of optimization to have a nice AI than a paperclip maximizer.

Strong form: Except insofar as the object-level question "What kind of policy would actually result in high-X scenarios?" is unusually complicated or difficult, almost any imaginable goal can be hooked up to any level of intelligence, and this does not make the corresponding agent any more inefficient or complicated. There are natural agent architectures that we can conceptualize as an intelligence engine hooked up to a utility function, and any tractable utility function can be loaded to the engine.

To pry apart these possibilities:

The ultraweak form is true and the weak form is false: "You can build a paperclip maximizer that's cognitively powerful enough to do interesting things, but only by specializing it to particular domains and preventing it from being fully reflective. If a paperclip maximizer ever becomes intelligent enough to fully model itself, it will stop being a paperclip maximizer and decide to be nice, or selfish, or whatever other goal is 'natural' for an intelligent agent."

The weak form is true and the relevant form is false: "You can build a paperclip maximizer and make it arbitrarily capable, but only by solving a weird problem to add a special rule 'Don't change your code to not be a paperclip maximizer, don't change your code to change your code to allow not being a paperclip maximizer, etcetera' and executing this weird added code takes up extra computing power each time the agent self-modifies. Any mistake in specifying this weird code will cause the paperclip maximizer to collapse back into a nice AI, or a selfish one, or to whatever other goal is 'natural' for an intelligent agent."

The relevant form is true and the strong form is false: "Nice AIs are possible, paperclip maximizers are possible, accidentally misaligned AIs don't restore themselves, but the possible design space is still much narrower than the strong form says it is. Only a small space of tractable optimization problems have equally tractable agents."

Arguments

(A.) Mind design space is large enough to contain great variety, and human reasoning that concludes inevitable strategies or goals may stem from undermined intuitions. See: mind design space is large, anthropomorphism fallacy, and rationalizing nice AI choices is a fallacy.

(B.) Humean regress supports orthogonality. David Hume's is/ought dichotomy translates back to an inductive argument on any particular preference that AAs are said to inevitably reason to. There must be some prior cause of the preference, and we can imagine an alternative mind design with a different cause leading to a different preference. If the prior cause is alleged inevitable, we repeat the process.

(C.) Gandhian stability. An agent starting with a simple consequentialist preference system Q seems naturally incentivized to self-modify in ways that preserve Q. This argues that following the Humean regress for an end system with Q does not produce a spectacularly strange or complicated initial system, and also begins to point toward how value loadable mind architectures could exist.

(D.) Orthogonal search tractability. We can view advanced agents as embodying particular estimations and searches. We have no reason to expect that e.g. a search for strategies that best maximize the worthwhile happiness of all sentient beings, is tractable, while a search for strategies that maximize paperclips, is intractable.

(E.) Orthogonal unbounded agents. We can exhibit a class of unbounded formulae for agents larger than their environments that optimize any given goal, such that the Humean regress argument and orthogonality of strategy-search are both visibly true. Arguments about what all possible minds must do are clearly false for these particular agents, contradicting all strong forms of Inevitability. Such minds are larger than their environments, but by the same arguments supporting orthogonal_search_tractability, there is no known reason to expect that, e.g. worthwhile-happiness-maximizers have bounded analogues while paperclip-maximizers do not.

(F.) Vingean reflection possible. Orthogonal unbounded agents are not reflective, leaving open the question of whether reflectivity and self-reflection could somehow negate Orthogonality. There is ongoing work on describing reflective agents that have Gandhian stability and allow free specification of the goal or utility function. In many cases these agents appear closer to being bounded or boundedly-approximable than the agents described in 5.

Argument (A) supports ultraweak Orthogonality, argument (B) supports weak Orthogonality, argument (C) supports classic Orthogonality, and (D)-(F) support strong Orthogonality:

...

(in progress, content from older version of page preserved below)

Caveats

The Orthogonality thesis is about mind design space in general. Particular agent architectures may not be Orthogonal.
- Some agents may be constructed such that their apparent utility functions shift with increasing cognitive intelligence.
- Some agent architectures may constrain what class of goals can be optimized.

'Agent' is intended to be understood in a very general way, and not to imply, e.g., a small local robot body.

For pragmatic reasons, the phrase 'every agent of sufficient cognitive power' in the Inevitability Thesis is specified to include e.g. all cognitive entities that are able to invent new advanced technologies and build Dyson Spheres in pursuit of long-term strategies, regardless of whether a philosopher might claim that they lack some particular cognitive capacity in view of how they respond to attempted moral arguments, or whether they are e.g. conscious in the same sense as humans, etcetera.

Most pragmatic implications of Orthogonality or Inevitability revolve around the following refinements:

Implementation dependence: The humanly accessible space of AI development methodologies has enough variety to yield both AI designs that are value-aligned, and AI designs that are not value-aligned.

Value loadability possible: There is at least one humanly feasible development methodology for advanced agents that has Orthogonal freedom of what utility function or meta-utility framework is introduced into the advanced agent. (Thus, if we could describe a value-loadable design, and also describe a value-aligned meta-utility framework, we could combine them to create a value-aligned advanced agent.)

Pragmatic inevitability: There exists some goal G such that almost all humanly feasible development methods result in an agent that ends up behaving like it optimizes some particular goal G, perhaps among others. Most particular arguments about futurism will pick different goals G, but all such arguments are negated by anything that tends to contradict pragmatic inevitability in general.

Implications

Implementation dependence is the core of the policy argument that solving the value alignment problem is necessary and possible.

Futuristic scenarios in which AIs are said in passing to 'want' something-or-other usually rely on some form of pragmatic inevitability premise and are negated by implementation dependence.

Orthogonality directly contradicts the metaethical position of moral internalism, which would be falsified by the observation of a paperclip maximizer. On the metaethical position that orthogonality and cognitivism are compatible, exhibiting a paperclip maximizer has few or no implications for object-level moral questions, and Orthogonality does not imply that our humane values or normative values are arbitrary, selfish, non-cosmopolitan, that we have a myopic view of the universe or value, etc.