Crossposted from my personal blog

A common assumption about AGI is the orthogonality thesis, which argues that goals/utility functions and the core intelligence of an AGI system are orthogonal or can be cleanly factored apart. More accurately, it is important to distinguish between orthogonality in design and orthogonality at runtime.  Orthogonality in design states that we can construct an AGI which optimizes for any goal. Orthogonality at runtime would be an AGI design that would consist of an AGI which can switch between arbitrary goals while operating. Here, we are only really talking about the latter orthogonality. Concretely, this perfect factoring occurs in model-based planning algorithms where it is assumed that there is a world model, a planner, and a reward function with each as orthogonal stand-alone components. The planner utilizes the world model to predict consequences of actions, and the reward model to rank these consequences. Then the planner figures out the plan with the best predicted consequences. This is a fully factored model of intelligence -- the world model, planner, and reward function can be swapped with others without issue. This was also the assumption of how intelligence would work in pre-DL thinking on AGI.

However, it has become obvious that current ML RL systems are not always so well factored. For instance, model-free RL typically computes value functions or amortized policies where the weights are directly learnt to predict values or actions directly without going through a planner (or exhaustively computing values via world model simulation). In these cases, the cognitive architecture is not orthogonal or factored. Core components of the policy-selector (planner) depend on details of the reward function -- the policy you learned for reward function A may be really bad if you instead switch to reward function B. These agents are much less flexible than their full model-based planning equivalents.

Why, then, do we build and use such non-factored agents? Because they are much more efficient. Full model-based planning at every step is extremely computationally expensive. Instead, we tend to amortize the cost of planning into policies or value functions which comes at an inevitable cost of flexibility. However, if we only want to use the agent for one task or a small range of similar tasks, then this is a good trade-off.

We can go even further and remove the orthogonality and factoredness of the world model. This is implicitly what we do when we only use end-to-end reward-trained policies, such as the original deep Q-learning work. Here the 'world model' learns only information relevant to optimizing the specific reward function it was trained on. Irrelevant information for this reward function (but which may be relevant for others) is ignored and cannot be recovered. This further specializes the agent towards optimizing for one goal over any others.

In general, full orthogonality and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal, where there is an extremely wide space of all possible goals. It requires you to not take advantages of structure in the problem space nor specialize your algorithms to exploit this structure. It requires you not to amortize specific reoccuring patterns for one task at the expense of preserving generality across tasks.

This is a special case of the tradeoff between specificity and generality and a consequence of the no-free-lunch theorem. Specialization to do really well at one or a few things can be done relatively cheaply. Full generality over everything is prohibitively expensive. The question is the shape of the pareto frontier at a specific capabilities region, which depends on the natural shape of solution space as well as the most active constraints.

Because of this it does not really make sense to think of AI  systems with full orthogonality as the default case we should expect, nor the ideal case to strive for. Instead, full factoring is at one end of a pareto tradeoff curve and different architectures, depending on their constraints, will end up at different points along it. This is also why both humans and powerful DL systems do not exhibit full orthogonality but instead show differing degrees of modularity between these components with resulting different levels of behavioural flexibility.

The important question for determining the shape of the future is what the slope of the pareto frontier looks like over the ranges of general capabilities that a near-term AGI might have. This will determine whether we end up with fully general AGI singletons, multiple general systems, or else a very large number of much smaller hyper-specialized systems. The likely outcome then depends on the shape of the shape of the pareto frontier as well as which constraints are most active in this regime. 


New Comment
8 comments, sorted by Click to highlight new comments since: Today at 9:12 PM

Orthogonality in design states that we can construct an AGI which optimizes for any goal. Orthogonality at runtime would be an AGI design that would consist of an AGI which can switch between arbitrary goals while operating. Here, we are only really talking about the latter orthogonality

This should not be relegated to a footnote. I've always thought that design-time orthogonality is the core of the orthogonality thesis, and I was very confused by this post until I read the footnote.

Fair point. I need to come up with a better name than 'orthogonality' for what I am thinking about here -- 'well factoredness?'

Will move the footnote into the main text.

This is a fine point to make in a post, but I want to point out that I think you misrepresent the strength of the orthogonality thesis in your opening paragraph. The thesis is about what's possible: it's a claim that intelligence and goals need not be confounded, not that they must not be confounded or shouldn't be confounded. It was crafted to make clear that AI minds need not be like human minds, where a great many things are tied together in ways such that you can imagine something like the anti-orthgonality thesis (that smarter AI will naturally be good because it'll reason out moral facts for itself or something similar) holding for humans, but not to propose that AI minds must be a particular way.

In fact, as you note, AI would be easier to align if intelligence and goals were confounded, and perhaps, due to some quirks of current systems, we'll have an easier enough time than we expected aligning AI to not kill us for at least a few years.

This. "Orthogonality" is just the name the rationality/AI safety community gave to the old philosophical adage that "you can not derive ought from is". No matter how smart you are at understanding the world, none of that will make you intrinsically good. Morality isn't inscribed somewhere in the mathematical structure of spacetime like the speed of light or Planck's constant. The universe just doesn't care; and thus no amounting of understanding the universe will in itself make you care.

Agreed. Only a very weak form of orthogonality is necessary to have dangerously unaligned AI be the default.


I didn’t know you planned to cross-post this here.

No worries! I'm happy you went to the effort of summarising it. I was pretty slow in crossposting anyhow. 

Foundation Models tend to have a more limited type of orthogonality - they're good at pursuing any goal that's plausible under the training distribution, meaning they can pursue any goal that humans would plausibly have (with some caveats I guess). This is most true without outcome-based RL on top of the foundation model, but I'd guess some of the orthogonality transfers through RL.

New to LessWrong?