# Ω 7

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent's environment

In The Catastrophic Convergence Conjecture, I wrote:

But are there worlds where this isn't true? Consider a world where you supply a utility-maximizing AGI with a utility function.

The only information you have about the other half is your utility. For simplicity, let's suppose you and the AGI have utility functions over universe-histories which are additive across the halves of the universe. You don't observe any utility information about the other part of the universe until the end of time, and vice versa for the AGI. That is, for history ,

If the AGI uses something like causal decision theory, then it won't try to kill you, or "seek power" over you. The effects of its actions have no causal influence over what happens in your half of the universe. Your universe's evolution adds a constant term to its expected utility.

(Other decision theories might have it precommit to minimizing human utility unless it attains maximal AGI-utility from the left half of the universe-history, or some other shenanigans. This is beside the point I want to make in this post, but it's important to consider.)

However, the setup is still interesting because

1. Goodhart's law still applies: if you give the AGI an incomplete proxy objective, you'll get suboptimal true performance.
2. Value is still complex: it's still hard to get the AGI to optimize the right half of the universe for human flourishing.
3. If the AGI is autonomously trained via stochastic gradient descent in the right half of the universe, then we may still hit inner alignment problems.

Alignment is still hard, and we still want to get the AGI to do good things on its half of the universe. But it isn't instrumentally convergent for the AGI to seek power over you, and so you shouldn't expect an unaligned AGI to try to kill you in this universe. You shouldn't expect the AGI to kill other humans, either, since none exist in the right half of the universe - and it won't create any, either.

To restate: Bostrom's original instrumental convergence thesis needs to be applied carefully. The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent's environment. I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail.

Thanks to John Wentworth for feedback on this post. Edited to clarify the broader point I'm making.

New Comment
[-]Ofer3yΩ330

Bostrom's original instrumental convergence thesis needs to be applied carefully. The danger from power-seeking is not intrinsic to the alignment problem. This danger also depends on the structure of the agent's environment

This post uses the phrase "Bostrom's original instrumental convergence thesis". I'm not aware of there being more than one instrumental convergence thesis. In the 2012 paper that is linked here the formulation of the thesis is identical to the one in the book Superintelligence (2014), except that the paper uses the term "many intelligent agents" instead of "a broad spectrum of situated intelligent agents".

In case it'll be helpful to anyone, the formulation of the thesis in the book Superintelligence is the following:

Several instrumental values can be identified which are convergent in the sense that their attainment would increase the chances of the agent's goal being realized for a wide range of final goals and a wide range of situations, implying that these instrumental values are likely to be pursued by a broad spectrum of situated intelligent agents.

I'm not sure what you meant here by saying that the instrumental convergence thesis "needs to be applied carefully", and how the example you gave supports this. Even in environments where the agent is "alone", we may still expect the agent to have the following potential convergent instrumental values (which are all mentioned both in the linked paper and in the book Superintelligence as categories where "convergent instrumental values may be found"): self-preservation, cognitive enhancement, technological perfection and resource acquisition.

Weird coincidence, but I just read Superintelligence for the first time, and I was struck by the lack of mention of Steve Omohundro (though he does show up in endnote 8). My citation for instrumental convergence would be Omohundro 2008.

[-]Ofer3yΩ120

I think that most of the citations in Superintelligence are in endnotes. In the endnote that follows the first sentence after the formulation of instrumental convergence thesis, there's an entire paragraph about Stephen Omohundro's work on the topic (including citations of Omohundro's "two pioneering papers on this topic").

Even in environments where the agent is "alone", we may still expect the agent to have the following potential convergent instrumental values

Right. But I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail. (And note that the AGI could still hurt us in a sense, by simulating and torturing humans using its compute. And some decision theories do seem to have it do that kind of thing.)

(Edited post to clarify)

[-]Dagon3yΩ120

I don't follow the half-universe argument.  Are you somehow sending the AGI outside of your light-cone?  Or have you crafted the AGI utility function and altered your own to not care about the others' half?  I don't get the model of utility that works for

The only information you have about the other half is your utility.

My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.

Yeah, we are magically instantly influencing an AGI which will thereafter be outside of our light cone. This is not a proposal, or something which I'm claiming is possible in our universe. Just take for granted that such a thing is possible in this contrived example environment.

My conception of utility is that it's a synthetic calculation from observations about the state of the universe, not that it's a thing on it's own which can carry information.

Well, maybe here's a better way of communicating what I'm after:

Suppose that you have beliefs about the initial state of the right (AGI) half, and you know how it's going to evolve; this gives you a distribution over right-half universe histories - you have beliefs about the AGI's initial state, and you can compute the consequences of those beliefs in terms of how the right half of the universe will end up.

In this way, you can take expected utility over the joint universe history, without being able to observe what's actually happening on the AGI's end. This is similar to how I prefer "start a universe which grows to be filled with human flourishing" over "start a universe which fills itself with suffering", even though I may not observe the fruits of either decision.

Is this clearer?

Thanks, so in this thought experiment, you influence an AI such that you justify (to yourself) imagining a nicer future for unknowable parts of the universe?

I suspect for most of us, it's cleaner to model it as utility of your perceived/expected state of the universe(s) than a joint utility over multiple universes, but I think I understand what you're saying, at least.

I'm torn about whether this seems lower-stakes or not.  There's a whole lot in my light cone which I theoretically can, but actually don't watch.  My utility-from-imagination is the same for those areas as for the truly inaccessible ones.  Thus, errors in specification or implementation of the AGI seem to have as big an impact in your imagined part of the universe as they would in your part of reality.  It may be lower-stakes if your imagination is flawed and you never believe there was an error, but then it seems wireheading would be easier.

I'm torn about whether this seems lower-stakes or not.

I think it is lower-stakes in a fairly straightforward way: An unaligned AGI on the right side of the universe won't be able to kill you and your civilization.

Furthermore, most misspecifications will just end up with a worthless right half of the universe - you'd have to be quite good at alignment in order to motivate the AGI to actually create and harm humans, as opposed to the AGI wireheading forever off of a sensory reward signal.

If "you and your civilization" has an extreme privilege in your utility function, that means the AI is safer when it's partitioned away from you, but it also means it doesn't generate much utility in the first place.  that's not changing the problem, it's just setting the coefficient of u(right) very low.

I imagine you can reduce the stakes of any alignment problem by not caring about the outcome.

In theory, the main benefit, is that if it works, the same thing could be set lose in your half of the universe. (Which brings up a new kind of treacherous turn.)

that's not changing the problem, it's just setting the coefficient of u(right) very low.

Less than half the universe - just the moon, or Pluto or something (maybe an asteroid) arguably is a region which starts out with low achievable utility, so...if the AI might be really good at optimizing for your utility then, in expectation, it might be a reasonably large gain in utility.

Still not on board with the value of this.  Why would you expect an AGI that does no harm (as far as you know) in an unpopulated and unobserved portion of the universe also does no harm on Earth (where you keep your stuff, and get the vast majority of your utility, but also with a radically different context - nearly unrelated terms in your utility function).

Why would you expect an AGI that does no harm (as far as you know)

The hypothetical included knowing. So, the hypothetical approach I proposed exploited this.

The main benefit is that the AI is not at risk of killing you. In the left half of the universe, it is at risk of killing you.

I don't follow why you disagree. It's higher-stakes to operate something which can easily kill me, than to operate something which can't.

Sure, but the stakes are embedded in U.  The whole post would be simpler to just say "an AGI that can't impact my utility is low stakes".   The partitioning of the universe only "works" if you don't attempt to claim that U(left) is significant.   To the extent that you CARE about what happens in that part of the universe, your utility is impacted by the AGI's alignment or lack thereof.

I'm not sure if you're arguing that this is a good world in which to think about alignment. If you are arguing this, then I disagree.

It seems like in this formalization the human has to write down the code that contains human values, ship it off to the right side of the universe via a one-time-only Godly Transfer Of Information, and then that code needs to do things perfectly. (You can't have any transfer of information back, since otherwise the AI can affect humans.) But it seems like this rules out a huge number of promising-seeming approaches to alignment, where the human gives feedback based on the AI's behavior (see also Human-AI Interaction).

I'm not sure if you're arguing that this is a good world in which to think about alignment.

I am not arguing this. Quoting my reply to ofer:

I think I sometimes bump into reasoning that feels like "instrumental convergence, smart AI, & humans exist in the universe -> bad things happen to us / the AI finds a way to hurt us"; I think this is usually true, but not necessarily true, and so this extreme example illustrates how the implication can fail.

(Edited post to clarify)