One core obstacle to AI Alignment is the Orthogonality Thesis. The Orthogonality Thesis is usually defined as follows: "the idea that the final goals and intelligence levels of artificial agents are independent of each other". More careful people say "mostly independent" instead. Stuart Armstrong qualifies the above definition with "(as long as these goals are of feasible complexity, and do not refer intrinsically to the agent’s intelligence)".

Does such a small exception matter? Yes it does.

The exception is broader than Stuart Armstrong makes it sound. The Orthogonality Thesis does not just apply to any goal which refers to an agent's intelligence level. It refers to any goal which refers even to a component of the agent's intelligence machinery.

If you're training an AI to optimize an artificially constrained external reality like a game of chess or Minecraft then the Orthogonality Thesis applies in its strongest form. But the Orthogonality Thesis cannot ever apply in full to the physical world we live in.

A world-optimizing value function is defined in terms of the physical world. If a world-optimizing AI is going to optimize the world according to a world-optimizing value function then the world-optimizing AI must understand the physical world it operates in. If a world-optimizing AI is real then it, itself, is part of the physical world. A powerful world-optimizing AI would be a very important component of the physical world, the kind that cannot be ignored. A powerful world-optimizing AI's world model must include a self-reference pointing at itself. Thus, a powerful world-optimizing AI is necessarily an exception to the Orthogonality Thesis.

How broad is this exception? What practical implications does this exception have?

Let's do some engineering. A strategic world-optimizer has three components:

  • A robust, self-correcting, causal model of the Universe.
  • A value function which prioritizes some Universe states over other states.
  • A search function which uses the causal model and the value function to calculate select what action to take.

Notice that there are two different optimizers working simultaneously. The strategic search function is the more obvious optimizer. But the model updater is an optimizer too. A world-optimizer can't just update the universe toward its explicit value function. It must also keep its model of the Universe up-to-date or it'll break.

These optimizers are optimizing toward separate goals. The causal model wants its model of the Universe to be the same as the actual Universe. The search function wants the Universe to be the same as its value function.

You might think the search function has full control of the situation. But the world model affects the universe indirectly. What the world model predicts affects the search function which affects the physical world. If the world model fails to account for its own causal effects then the world model will break and our whole AI will stop working.

It's actually the world model which mostly has control of the situation. The world model can control the search function by modifying what the search function observes. But the only way the search function can affect the world model is by modifying the physical world (wireheading itself).

What this means is that the world model has an causal lever for controlling the physical world. If the world model is a superintelligence optimized for minimizing its error function, then the world model will hack the search function to eliminate its own prediction error by modifying the physical world to conform with the world model's incorrect predictions.

If your world model is too much smarter than your search function, then your world model will gaslight your search function. You can solve this by making your search function smarter. But if your search function is too much smarter than your world model, then your search function will physically wirehead your world model.

Unless…you include "don't break the world model"[1] as part of your explicit value function.

If you want to keep the search function from wireheading the world model then you have to code "don't break the world model" into your value function. This is a general contradiction to the Orthogonality Thesis. A sufficiently powerful world-optimizing artificial intelligence must have a value function that preserves the integrity of its world model, because otherwise it'll just wirehead itself, instead of optimizing the world. This effect provides a smidgen of corrigibility; if the search function does corrupt its world model, then the whole system (world optimizer) breaks.

Does any of this matter? What implications could this recursive philosophy possibly have on the real world?

It means that if you want to insert a robust value into a world-optimizing AI then you don't put it in the value function. You sneak it into the world model, instead.


[Here's where you ask yourself whether this whole post is just me trolling you. Keep reading to find out.]

A world model is a system that attempts to predict its signals in real time. If you want the system to maximize accuracy then your error function is just the difference between predicted signals and actual signals. But that's not quite good enough, because a smart system will respond by cutting off its input stimuli in exactly the same way a meditating yogi does. To prevent your world-optimizing AI from turning itself into a buddha, you need to reward it for seeking novel, surprising stimuli.

…especially after a period of inaction or sensory deprivation.

…which is why food tastes so good and images look so beautiful after meditating.

If you want your world model to modify the world too, you can force your world model to predict the outcomes you want, and then your world model will gaslight your search function into making them happen.

Especially if you deliberately design your world model to be smarter than your search function. That way, your world model can mostly[2] predict the results of the search function.

Which is why we have a bias toward thinking we're better people than we actually are. At least, I do. It's neither a bug nor a feature. It's how evolution motivates us to be better people.

  1. With some exceptions like, "If I'm about to die then it doesn't matter that the world model will die with me." ↩︎

  2. The world model can't entirely predict the results of the search function, because the search function's results partly depend on the world model—and it's impossible (in general) for the world model to predict its own outputs, because that's not how the arrow of time works. ↩︎

New Comment
35 comments, sorted by Click to highlight new comments since:

The Orthogonality Thesis is usually defined as follows: "the idea that the final goals and intelligence levels of artificial agents are independent of each other". More careful people say "mostly independent" instead.


By whom? That's not the definition given here: 

The Orthogonality Thesis asserts that there can exist arbitrarily intelligent agents pursuing any kind of goal. 

The strong form of the Orthogonality Thesis says that there's no extra difficulty or complication in creating an intelligent agent to pursue a goal, above and beyond the computational tractability of that goal.

I started with this one from LW's Orthogonality Thesis tag.

The Orthogonality Thesis states that an agent can have any combination of intelligence level and final goal, that is, its Utility Functions(127) and General Intelligence(92) can vary independently of each other. This is in contrast to the belief that, because of their intelligence, AIs will all converge to a common goal.

But it felt off to me so I switched to Stuart Armstrong's paraphrase of Nick Bostrom's formalization in “The Superintelligent Will”.

How does the definition I use differ in substance from Arbital's? It seems to make no difference to my argument that the cyclic references implicit to embedded agency impose a constraint on the kinds of goals arbitrarily intelligent agents may pursue.

One could argue that Arbital's definition already accounts for my exception because self-reference causes computational intractability.

What seems off to me about your definition is that it says goals and intelligence are independent, whereas the Orthogonality Thesis only says that they can in principle be independent, a much weaker claim.

What's your source for this definition?

See for example Bostrom's original paper (pdf):

The Orthogonality Thesis Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal.

It makes no claim about how likely intelligence and final goals are to diverge, it only claims that it's in principle possible to combine any intelligence with any set of goals. Later on in the paper he discusses ways of actually predicting the behavior of a superintelligence, but that's beyond the scope of the Thesis.

I'm just making a terminological point. The terminological point seems important because the Orthogonality Thesis (in Yudkowsky's sense) is actually denied by some people, and that's a blocker for them understanding AI risk. 

On your post: I think something's gone wrong when you're taking the world modeling and "the values" as separate agents in conflict. It's a sort of homunculus argument w.r.t. agency. I think the post raises interesting questions though. 

If, on my first Internet search, I had found Yudkowsky defining the "Orthogonality Thesis", then I probably would have used that definition instead. But I didn't, so here we are.

Maybe a less homunculusy way to explain what I'm getting at is that an embedded world-optimizer must optimize simultaneously toward two distinct objectives: toward a correct world model and toward an optimized world. This applies a constraint to the Orthogonality Thesis, because the world model is embedded in the world itself.

But you can just have the world model as an instrumental subgoal. If you want to do difficult thing Z, then you want to have a better model of the parts of Z, and the things that have causal input to Z, and so on. This motivates having a better world model. You don't need a separate goal, unless you're calling all subgoals "separate goals".  

Obviously this doesn't work as stated because you have to have a world model to start with, which can support the implication that "if I learn about Z and its parts, then I can do Z better". 

Congratulations, you discovered [Active Inference]!

Do you mean the free energy principle?

Sure, I mean that it is an implementation of what you mentioned in the third-to-last paragraph.

I think the good part of this post is a reobservation of the fact that real-world intelligence requires power-seeking (and power-seeking involves stuff like making accurate world-models) and that the bad part of the post seems to be confusion about how feasible it is to implement power-seeking and what methods would be used.

Three thoughts:

  1. If you set up the system like that, you may run into the mentioned problems. It might be possible wrap both into a single model that is trained together.
  2. An advanced system may reason about the joint effect, e.g. by employing fixed-point theorems and Logical Induction.
  3. Steven Byrne's [Intro to brain-like-AGI safety] 6. Big picture of motivation, decision-making, and RL models humans as having three components:
    1. world model that is mainly trained by prediction error 
    2. a steering system that encodes preferences over world states
    3. a  system that learns how world model predictions relate to steering system feedback

I think this is deeply confused. In particular, you are confusing search and intelligence. Intelligence can be made by attaching a search component, a utility function and a world model. The world model is actually an integral, but it can be approximated by a search by searching for several good hypothesis instead of integrating over all hypothesis. 

In this approximation, the world model is searching for hypothesis that fit the current data. 

To deceive the search function part of the AI, the "world model" must contain a world model section that actually models the world so it can make good decisions, and an action chooser that compares various nonsensical world models according to how they make the search function and utility function break. In other words, to get this failure mode, you need fractal AI, an AI built by gluing 2 smaller AI's together, each of which is in turn made of 2 smaller AI's and so on ad infinitum. 

Some of this discssion may point to an ad hoc hack evolution used in humans. Though most of it sounds so ad hoc even evolution would bawk. None is sane AI design. Your "search function" is there to be outwitted by the world model, with the world model inventing insane and contrived imaginary worlds in order to trick the search function into doing what the world model wants. Ie the search function would want to turn left if it had a sane picture of the world because it's a paperclip maximizer and all the paperclips are to the left. The world model wants to turn right for less/more sensory stimuli. So the world model gaslights the search function, imagining up a hoard of zombies to the left. (While internally keeping track of the lack of zombies.) Thus scaring the search function into going right. At the very least, this design wastes compute imagining zombies. 

The world model is actually an integral, but it can be approximated by a search by searching for several good hypothesis instead of integrating over all hypothesis.

Can you tell me what you mean by this statement? When you say "integral" I think "mathematical integral (inverse of derivative)" but I don't think that's what you intend to communicate.

Yes integral is exactly what I intended to communicate. 

Think of hypothesis space. A vast abstract space of all possibilities.  Each hypothesis has a  the probability of being true, and a  the utility of action  if it is true. 

To really evaluate an action, you need to calculate  an integral over all hypothesis. 

If you don't want to behave with maximum intelligence, just pretty good intelligence, then you can run gradient descent to find a point X by trying to maximize . Then you can calculate  to compare actions. More sophisticated methods would sum several points. 

This is partly using the known structure of the problem. If you have good evidence, then the function  is basically 0 almost everywhere. So if  is changing fairly slowly over the region that is significantly nonzero, looking at any nonzero point of  is a good estimate of the integral.  

any goal which refers even to a component of the agent's intelligence machinery

But wouldn't such an agent still be motivated to build an external optimizer of unbounded intelligence? (Or more generally unconstrained design.) This does reframe things a bit, but mostly by retargeting the "self" pointers to something separate from the optimizer (to the original agent, say). This gives the optimizer (which is a separate thing from the original agent) a goal with no essential self-references (other than what being embedded in the same world entails).

Humans seem like this anyway, with people being the valuable content that shouldn't be optimized for purely being good at optimization, while the cosmic endowment still needs optimizing, so it's something else with human values that should do that, something that is optimized for being good at optimizing, rather than for being valuable content of the world.

But wouldn't such an agent still be motivated to build an external optimizer of unbounded intelligence?

Yes, if it can. Suppose the unbounded intelligence is aligned with the original agent via CEV. The original agent has a pointer pointing to the unbounded intelligence. The unbounded intelligence has a pointer pointing to itself and (because of CEV) a pointer pointing to the original agent. There are now two cyclic references. We have lost our original direct self-reference, but it's the cyclicness that is central to my post, not self-reference, specifically. Self-reference is just a particular example of the general exception.

Does that make sense? The above paragraph is kind of vague, expecting you to fill in the gaps. (I cheated too, by assuming CEV.) But I can phrase things more precisely and break them into smaller pieces, if you would prefer it that way.

It's embedded in a world (edit: external optimizer is), so there is always some circularity, but I think that's mostly about avoiding mindcrime and such? That doesn't seem like a constraint on level of intelligence, so the orthogonality thesis should be content. CEV being complicated and its finer points being far in the logical future falls under goal complexity and doesn't need to appeal to cyclic references.

The post says things about wireheading and world models and search functions, but it's optimizers with unconstrained design we are talking about. So a proper frame seems to be decision theory, which is unclear for embedded agents, and a failing design is more of a thought experiment that motivates something about a better decision theory.

When you say "It's", are you referring to the original agent or to the unbounded intelligence it wants to create? I think you're referring to the unbounded intelligence, but I want to be sure.

To clarify: I never intended to claim that the Orthogonality Thesis is violated due to a constraint on the level of intelligence. I claim that the Orthogonality Thesis is violated due to a constraint on viable values, after the intelligence of a world optimizer gets high enough.

Both are embedded in the world, but I meant the optimizer in that sentence. The original agent is even more nebulous than the unconstrained optimizer, since it might be operating under unknown constraints on design. (So it could well be cartesian, without self references. If we are introducing a separate optimizer, and only keeping idealized goals from the original agent, there is no more use for the original agent in the resulting story.)

In any case, a more general embedded decision theoretic optimizer should be defined from a position of awareness of the fact that it's acting from within its world. What this should say about the optimizer itself is a question for decision theory that motivates its design.

Are you trying to advocate for decision theory? You write that this is "a question for decision theory". But you also write that decision theory is "unclear for embedded agents". And this whole conversation exclusively is about embedded agents. What parts are you advocating we use decision theory on and what parts are you advocating we don't use decision theory on? I'm confused.

You write that this is "a question for decision theory". But you also write that decision theory is "unclear for embedded agents".

It's a question of what decision theory for embedded agents should be, for which there is no clear answer. Without figuring that out, designing an optimizer is an even more murky endeavor, since we don't have desiderata for it that make sense, which is what decision theory is about. So saying that decision theory for embedded agents is unclear is saying that designing embedded optimizers remains an ill-posed problem.

I'm combining our two theads into one. Click here for continuation.

[Note: If clicking on the link doesn't work, then that's a bug with LW. I used the right link.]

[Edit: It was the wrong link.]

If clicking on the link doesn't work, then that's a bug with LW. I used the right link.

It is something of a bug with LW that results in giving you the wrong link to use (notice the #Wer2Fkueti2EvqmqN part of the link, which is the wrong part). The right link is this. It can be obtained by clicking "See in context" at the top of the page. (The threads remain uncombined, but at least they now have different topics.)

Fixed. Thank you.

Oh! I think I understand your argument now. If I understand it correctly (and I might not) then your argument is an exception covered by this footnote. Creating an aligned superintelligence ends the need for maintaining a correct world model in the future for the same reason dying does: your future agentic impact after the pivotal act is negligible.

My argument is a vague objection to the overall paradigm of "let's try to engineer an unconstrained optimizer", I think it makes more sense to ask how decision theory for embedded agents should work, and then do what it recommends. The post doesn't engage with that framing in a way I easily follow, so I don't really understand it.

The footnote appears to refer to something about the world model component of the engineered optimizer you describe? But also to putting things into the goal, which shouldn't be allowed? General consequentialist agents don't respect boundaries of their own design and would eat any component of themselves such as a world model if that looks like a good idea. Which is one reason to talk about decision theories and not agent designs.

My post doesn't engage with your framing at all. I think decision theory is the wrong tool entirely, because decision theory takes as a given the hardest part of the problem. I believe decision theory cannot solve this problem, and I'm working from a totally different paradigm.

Our disagreement is as wide as if you were a consequentialist and I was arguing from a Daoist perspective. (Actually, that might not be far from the truth. Some components of my post have Daoist influences.)

Don't worry about trying to understand the footnote. Our disagreement appears to run much deeper than it.

because decision theory takes as a given the hardest part of the problem

What's that?

My post doesn't engage with your framing at all.

Sure, it was intended as a not-an-apology for not working harder to reframe implied desiderata behind the post in a way I prefer. I expect my true objection to remain the framing, but now I'm additionally confused about the "takes as a given" remark about decision theory, nothing comes to mind as a possibility.

It's philosophical. I think it'd be best for us to terminate the conversation here. My objections against the over-use of decision theory are sophisticated enough (and distinct enough from what this post is about) that they deserve their own top-level post.

My short answer is that decision theory is based on Bayesian probability, and that Bayesian probability has holes related to a poorly-defined (in embedded material terms) concept of "belief".

Thank you for the conversation, by the way. This kind of high-quality dialogue is what I love about LW.

Sure. I'd still like to note that I agree about Bayesian probability being a hack that should be avoided if at all possible, but I don't see it as an important part (or any part at all) of framing agent design as a question of decision theory (essentially, of formulating desiderata for agent design before getting more serious about actually designing them).

For example, proof-based open source decision theory simplifies the problem to a ridiculous degree to more closely examine some essential difficulties of embedded agency (including self-reference), and it makes no use of probability, both in its modal logic variant and not. Updatelessness more generally tries to live without Bayesian updating.

Though there are always occasions to remember about probability, like the recent mystery about expected utility and updatelessness.

In the models making the news and scaring people now, there aren't identified separate models for modeling the world and seeking the goal. It's all inscrutible model weights. Maybe if we understood those weights better we could separate them out. But maybe we couldn't. Maybe it's all a big jumble as actually implemented. That would make it incoherent to speak about the relative intelligence of the world model and the goal seeker. So how would this line of thinking apply to that?

If you want to keep the search function from wireheading the world model then you have to code "don't break the world model" into your value function. This is a general contradiction to the Orthogonality Thesis. A sufficiently powerful world-optimizing artificial intelligence must have a value function that preserves the integrity of its world model, because otherwise it'll just wirehead itself, instead of optimizing the world.


If the value function says ~"maximise the number of paperclips, as counted by my paperclip-counting-machinery", a weak AI might achieve this by making paperclips, but a stronger AI might trick the paperclip-counting-machinery into counting arbitrarily many paperclips, rather than actually making any paperclips.

However, this isn't a failure of the Orthogonality Thesis, because that value function doesn't say "maximise the number of real paperclips".  The value function, as stated, was weakly satisfied by the weak AI, and strongly satisfied by the strong AI.  The strong AI did maximise the number of paperclips, as counted by its paperclip-counting-machinery.  Any value function which properly corresponds to "maxmise the number of real paperclips" would necessarily include protections against wireheading.

If you try to program an AI to have the goal of doing X, and it does Y instead, there's a good chance the "goal you thought would lead to X" was actually a goal that leads to Y in reality.

A value function which says ~"maximise the number of real paperclips the world model (as it currently exists) predicts there will be in the future" would have a better chance of leading to lots of real paperclips, but perhaps it's still missing something, turns out steering cognition is hard.  If the search evaluates wirehead-y plans, it will see that according to its current, uncorrupted world model, that the plan leads to very few real paperclips, and so doesn't implement it.

"Value function" is a description of the system's behavior and so the Orthogonality Thesis is about possible descriptions: if including “don’t break the world model” actually results in maximum utility, then your system is still optimizing your original value function. And it doesn't work on low level either - you can just have separate value function, but only call value function with additions from your search function. Or just consider these additions as parts of search function.