Clarifying AI X-risk

Rohin Shah; David Lindner; Vikrant Varma; Vika; Mary Phuong; Ramana Kumar; Elliot Catt

I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.

To me, Christiano's Get What You Measure scenario looks much more plausible a priori to be "what happens by default". For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions? Why additionally assume that we need consequentialist reasoning, or that power-seeking has to kick in and incentivize deception over and above whatever competing incentives might be present? Why assume all that, when RLHF already selects for actions which deceive humans in practice even in the absence of consequentialism?

Or, another angle: the diagram in this post starts from "specification gaming" and "goal misgeneralization". If we just start from prototypical examples of those failure modes, don't assume anything more than that, and ask what kind of AGI failure the most prototypical versions of those failure modes lead to... it seems to me that they lead to Getting What You Measure. This story about consequentialism and power-seeking has a bunch of extra pieces in it, which aren't particularly necessary for an AGI disaster.

(To be clear, I'm not saying the consequentialist power-seeking deception story is implausible; it's certainly plausible enough that I wouldn't want to build an AGI without being pretty damn certain that it won't happen! Nor am I saying that I buy all the aspects of Get What You Measure - in particular, I definitely expect a much foomier future than Paul does, and I do in fact expect consequentialism to be convergent. The key thing I'm pointing to here is that the consequentialist power-seeking deception story has a bunch of extra assumptions in it, and we still get a disaster with those assumptions relaxed, so naively it seems like we should assign more probability to a story with fewer assumptions.)

[-]Rohin Shah3yΩ71011

(Speaking just for myself in this comment, not the other authors)

I still feel like the comments on your post are pretty relevant, but to summarize my current position:

AIs that actively think about deceiving us (e.g. to escape human oversight of the compute cluster they are running on) come well before (in capability ordering, not necessarily calendar time) AIs that are free enough from human-imposed constraints and powerful enough in their effects on the world that they can wipe out humanity + achieve their goals without thinking about how to deal with humans.
In situations where there is some meaningful human-imposed constraint (e.g. the AI starts out running on a data center that humans can turn off), if you don't think about deceiving humans at all, you choose plans that ask humans to help you with your undesirable goals, causing them to stop you. So, in these situations, x-risk stories require deception.
It seems kinda unlikely that even the AI free from human-imposed constraints like off switches doesn't think about humans at all. For example, it probably needs to think about other AI systems that might oppose it, including the possibility that humans build such other AI systems (which is best intervened on by ensuring the humans don't build those AI systems).

Responding to this in particular:

The key thing I'm pointing to here is that the consequentialist power-seeking deception story has a bunch of extra assumptions in it, and we still get a disaster with those assumptions relaxed, so naively it seems like we should assign more probability to a story with fewer assumptions.

The least conjunctive story for doom is "doom happens". Obviously this is not very useful. We need more details in order to find solutions. When adding an additional concrete detail, you generally want that detail to (a) capture lots of probability mass and (b) provide some angle of attack for solutions.

For (a): based on the points above I'd guess maybe 20:1 odds on "x-risk via misalignment with explicit deception" : "x-risk via misalignment without explicit deception" in our actual world. (Obviously "x-risk via misalignment" is going to be the sum of these and so higher than each one individually.)

For (b): the "explicit deception" detail is particularly useful to get an angle of attack on the problem. It allows us to assume that the AI "knows" that the thing it is doing is not what its designers intended, which suggests that what we need to do to avoid this class of scenarios is to find some way of getting that knowledge out of the AI system (rather than, say, solving all of human values and imbuing it into the AI).

One response is "but even if you solve the explicit deception case, then you just get x-risk via misalignment without explicit deception, so you didn't actually save any worlds". My response would be that P(x-risk via misalignment without explicit deception | no x-risk via misalignment with explicit deception) seems pretty low to me. But that seems like the main way someone could change my mind here.

[-]johnswentworth3yΩ7145

Two probable cruxes here...

First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a -2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it. (Note: I'm using "intelligence" here to point to something including ability to "actually try" as opposed to symbolically "try", effective mental habits, etc, not just IQ.) If copying is sufficiently cheap relative to building, I wouldn't be surprised if something within the human distribution would suffice.

Central intuition driver there: imagine the difference in effectiveness between someone who responds to a law they don't like by organizing a small protest at their university, vs someone who responds to a law they don't like by figuring out which exact bureaucrat is responsible for implementing that law and making a case directly to that person, or by finding some relevant case law and setting up a lawsuit to limit the disliked law. (That's not even my mental picture of -2sd vs +3sd; I'd think that's more like +1sd vs +3sd. A -2sd usually just reposts a few memes complaining about the law on social media, if they manage to do anything at all.) Now imagine an intelligence which is as much more effective than the "find the right bureaucrat/case law" person, as the "find the right bureaucrat/case law" person is compared to the "protest" person.

Second probable crux: there's two importantly-different notions of "thinking about humans" or "thinking about deceiving humans" here. In the prototypical picture of a misaligned mesaoptimizer deceiving humans for strategic reasons, the AI explicitly backchains from its goal, concludes that humans will shut it down if it doesn't hide its intentions, and therefore explicitly acts to conceal its true intentions. But when the training process contains direct selection pressure for deception (as in RLHF), we should expect to see a different phenomenon: an intelligence with hard-coded, not-necessarily-"explicit" habits/drives/etc which de-facto deceive humans. For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That's the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).

Is that "explicit deception"? I dunno, it seems like "explicit deception" is drawing the wrong boundary. But when that sort of deception happens, I wouldn't necessarily expect to be able to see deception in an AI's internal thoughts. It's not that it's "thinking about deceiving humans", so much as "thinking in ways which are selected for deceiving humans".

(Note that this is a different picture from the post you linked; I consider this picture more probable to be a problem sooner, though both are possibilities I keep in mind.)

[-]Rohin Shah3yΩ453

First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a -2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it.

This sounds roughly right to me, but I don't see why this matters to our disagreement?

For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That's the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).

This also sounds plausible to me (though it isn't clear to me how exactly doom happens). For me the relevant question is "could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge", and I think the answer is still yes.

I maybe want to stop saying "explicitly thinking about it" (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that "conscious thoughts" have deception in them) and instead say that "the AI system at some point computes some form of 'reason' that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action".

[-]johnswentworth3yΩ450

I maybe want to stop saying "explicitly thinking about it" (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that "conscious thoughts" have deception in them) and instead say that "the AI system at some point computes some form of 'reason' that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action".

I don't quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.

But unless that distinction is central to what you're trying to point to here, I think I basically agree with what you're gesturing at.

[-]Rohin Shah3yΩ550

But unless that distinction is central to what you're trying to point to here

Yeah, I don't think it's central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).

[-]Xodarap3y30

think about how humans most often deceive other humans: we do it mainly by deceiving ourselves... when that sort of deception happens, I wouldn't necessarily expect to be able to see deception in an AI's internal thoughts

The fact that humans will give different predictions when forced to make an explicit bet versus just casually talking seems to imply that it's theoretically possible to identify deception, even in cases of self-deception.

[-]johnswentworth3y30

Of course! I don't intend to claim that it's impossible-in-principle to detect this sort of thing. But if we're expecting "thinking in ways which are selected for deceiving humans", then we need to look for different (and I'd expect more general) things than if we're just looking for "thinking about deceiving humans".

(Though, to be clear, it does not seem like any current prosaic alignment work is on track to do either of those things.)

[-]Noosphere893y10

This. Without really high levels of capabilities afforded by quantum/reversible computers which is an additional assumption, you can't really win without explicitly modeling humans and deceiving them.

[-]tom4everitt3yΩ340

For instance: why expect that we need a multi-step story about consequentialism and power-seeking in order to deceive humans, when RLHF already directly selects for deceptive actions?

Is deception alone enough for x-risk? If we have a large language model that really wants to deceive any human it interacts with, then a number of humans will be deceived. But it seems like the danger stops there. Since the agent lacks intent to take over the world or similar, it won't be systematically deceiving humans to pursue some particular agenda of the agent.

As I understand it, this is why we need the extra assumption that the agent is also a misaligned power-seeker.

[-]johnswentworth3yΩ361

For that part, the weaker assumption I usually use is that AI will end up making lots of big and fast (relative to our ability to meaningfully react) changes to the world, running lots of large real-world systems, etc, simply because it's economically profitable to build AI which does those things. (That's kinda the point of AI, after all.)

In a world where most stuff is run by AI (because it's economically profitable to do so), and there's RLHF-style direct incentives for those AIs to deceive humans... well, that's the starting point to the Getting What You Measure scenario.

Insofar as power-seeking incentives enter the picture, it seems to me like the "minimal assumptions" entry point is not consequentialist reasoning within the AI, but rather economic selection pressures. If we're using lots of AIs to do economically-profitable things, well, AIs which deceive us in power-seeking ways (whether "intentional" or not) will tend to make more profit, and therefore there will be selection pressure for those AIs in the same way that there's selection pressure for profitable companies. Dial up the capabilities and widespread AI use, and that again looks like Getting What We Measure. (Related: the distinction here is basically the AI version of the distinction made in Unconscious Economics.)

[-]tom4everitt3yΩ330

This makes sense, thanks for explaining. So a threat model with specification gaming as its only technical cause, can cause x-risk under the right (i.e. wrong) societal conditions.

[-]Koen.Holtman3yΩ110

I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.

Me too, but note how the analysis leading to the conclusion above is very open about excluding a huge number of failure modes leading to x-risk from consideration first:

[...] our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to use AI.

In this context, I of course have to observe that any human decision, any decision to deploy an AGI agent that uses purely consequentialist planning towards maximising a simple metric, would be a very poor human decision to make indeed. But there are plenty of other poor decisions too that we need to worry about.

[-]Leon Lang3yΩ7137

To classify as specification gaming, there needs to be bad feedback provided on the actual training data. There are many ways to operationalize good/bad feedback. The choice we make here is that the training data feedback is good if it rewards exactly those outputs that would be chosen by a competent, well-motivated AI.

I assume you would agree with the following rephrasing of your last sentence:

The training data feedback is good if it rewards outputs if and only if they might be chosen by a competent, well-motivated AI.

If so, I would appreciate it if you could clarify why achieving good training data feedback is even possible: the system that gives feedback necessarily looks at the world through observations that conceal large parts of the state of the universe. For every observation that is consistent with the actions of a competent, well-motivated AI, the underlying state of the world might actually be catastrophic from the point of view of our "intentions". E.g., observations can be faked, or the universe can be arbitrarily altered outside of the range of view of the feedback system.

If you agree with this, then you probably assume that there are some limits to the physical capabilities of the AI, such that it is possible to have a feedback mechanism that cannot be effectively gamed. Possibly when the AI becomes more powerful, the feedback mechanism would in turn need to become more powerful to ensure that its observations "track reality" in the relevant way.

Does there exist a write-up of the meaning of specification gaming and/or outer alignment that takes into account that this notion is always "relative" to the AI's physical capabilities?

[-]Charlie Steiner3y72

Yeah, I think this comment is basically right. On nontrivial real-world training data, there are always going to be both good and bad ways to interpret it. At some point you need to argue from inductive biases, and those depend on the AI that's doing the learning, not just the data.

I think the real distinction between their categories is something like:

Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don't notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.

Misgeneralization: On the training distribution, the AI doesn't take object-level actions that humans think are bad. But then, later, it does.

[-]Leon Lang3y10

Specification gaming: Even on the training distribution, the AI is taking object-level actions that humans think are bad. This can include bad stuff that they don't notice at the time, but that seems obvious to us humans abstractly reasoning about the hypothetical.

Do you mean "the AI is taking object-level actions that humans think are bad while achieving high reward"?

If so, I don't see how this solves the problem. I still claim that every reward function can be gamed in principle, absent assumptions about the AI in question.

[-]Charlie Steiner3y30

Sure, something like that.

I agree it doesn't solve the problem if you don't use information / assumptions about the AI in question.

[-]Rohin Shah3yΩ330

I'm confused about what you're trying to say in this comment. Are you saying "good feedback as defined here does not solve alignment"? If so, I agree, that's the entire point of goal misgeneralization (see also footnote 1).

Perhaps you are saying that in some situations a competent, well-motivated AI would choose some action it thinks is good, but is actually bad, because e.g. its observations were faked in order to trick it? If so, I agree, and I see that as a feature of the definition, not a bug (and I'm not sure why you think it is a bug).

[-]Leon Lang3yΩ230

Neither of your interpretations is what I was trying to say. It seems like I expressed myself not well enough.

What I was trying to say is that I think outer alignment itself, as defined by you (and maybe also everyone else), is a priori impossible since no physically realizable reward function that is defined solely based on observations rewards only actions that would be chosen by a competent, well-motivated AI. It always also rewards actions that lead to corrupted observations that are consistent with the actions of a benevolent AI. These rewarded actions may come from a misaligned AI.

However, I notice people seem to use the terms of outer and inner alignment a lot, and quite some people seem to try to solve alignment by solving outer and inner alignment separately. Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.

[-]Rohin Shah3yΩ560

Oh, I see. I'm not interested in "solving outer alignment" if that means "creating a real-world physical process that outputs numbers that reward good things and punish bad things in all possible situations" (because as you point out it seems far too stringent a requirement).

Then I was wondering if they use a more refined notion of what outer alignment means, possibly by taking into account the physical capabilities of the agent, and I was trying to ask if something like that has already been written down anywhere.

You could look at ascription universality and ELK. The general mindset is roughly "ensure your reward signal captures everything that the agent knows"; I think the mindset is well captured in mundane solutions to exotic problems.

[-]Leon Lang3yΩ110

Thanks a lot for these pointers!

[-]Vika2y*Ω7120Review for 2022 Review

I continue to endorse this categorization of threat models and the consensus threat model. I often refer people to this post and use the "SG + GMG → MAPS" framing in my alignment overview talks. I remain uncertain about the likelihood of the deceptive alignment part of the threat model (in particular the requisite level of goal-directedness) arising in the LLM paradigm, relative to other mechanisms for AI risk.

In terms of adding new threat models to the categorization, the main one that comes to mind is Deep Deceptiveness (let's call it Soares2), which I would summarize as "non-deceptiveness is anti-natural / hard to disentangle from general capabilities". I would probably put this under "SG MAPS", assuming an irreducible kind of specification gaming where it's very difficult (or impossible) to distinguish deceptiveness from non-deceptiveness (including through feedback on the model's reasoning process). Though it could also be GMG, where the "non-deceptiveness" concept is incoherent and thus very difficult to generalize well.

[-]Jemal Young3y20

Not many more fundamental innovations needed for AGI.

Can you say more about this? Does the DeepMind AGI safety team have ideas about what's blocking AGI that could be addressed by not many more fundamental innovations?

[-]Rohin Shah3y42

If we did have such ideas we would not be likely to write about them publicly.

(That being said, I roughly believe "if you keep scaling things with some engineering work to make sure everything still works, the models will keep getting better, and this would eventually get you to transformative AI if you can keep the scaling going".)

^{^}

In this definition, whether the feedback is good/bad does not depend on the reasoning used by the AI system, so e.g. rewarding an action that was chosen by a misaligned AI system that is trying to hide its misaligned intentions would still count as good feedback under this definition.

^{^}

There are other possible formulations of misaligned, for example the system’s goal may not match what its users want it to do.

		Source of misalignment
		Specification gaming (SG)	SG + GMG	Goal mis-generalization (GMG)
Path to X-risk	Misaligned power seeking (MAPS)	Cohen et al	Carlsmith, Christiano2, Cotra, Ngo, Shah	Soares, Hubinger
Path to X-risk	Interaction of multiple systems	Critch, Christiano1	?	?

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

127

Clarifying AI X-risk

127

Ω 45

127

Ω 45

Categorization

Consensus Threat Model

Takeaway