Incentives and Selection: A Missing Frame From AI Threat Discussions?

DragonGod

Incentives and Selection: A Missing Frame From AI Threat Discussions? — LessWrong

11 Incentives and Selection: A Missing Frame From AI Threat Discussions?

by DragonGod

26th Feb 2023

2 min read

11

Epistemic Status

Written quickly, originally as a Twitter thread.

Thesis

I think a missing frame from AI threat discussion is incentives (especially economic) and selection (pressures exerted on a system during its development).

I hear a lot of AI threat arguments of the form: "AI can do X/be Y" with IMO insufficient justification that:

It would be (economically) profitable for AI to do X
The default environments/training setups select for systems that are Y

That is such arguments establish that somethings can happen, but do not convincingly argue that it is likely to happen (or that the chances of it happening are sufficiently high). I think it's an undesirable epistemic status quo.

Examples

1: Discrete Extinction Events

Many speculations of AI systems precipitating extinction in a discrete event^[1].

I do not understand under what scenarios triggering a nuclear holocaust, massive genocide via robot armies or similar would be something profitable for the AI to do.

It sounds to me like just setting fire to a fuckton of utility.

In general, triggering civilisational collapse seems like something that would just be robustly unprofitable for an AI system to pursue^[2].

As such, I don't expect misaligned systems to pursue such goals (as long as they don't terminally value human suffering/harm to humans/are otherwise malevolent).

2. Deceptive Alignment

Consider also deceptive alignment.

I understand what deceptive alignment is, how deception can manifest and why sufficiently sophisticated misaligned systems are incentivised to be deceptive.

I do not understand that training actually selects for deception though^[3].

Deceptive alignment seems to require a peculiar combination of situational awareness/cognitive sophistication that complicates my intuitions around it.

Unlike with many other mechanisms/concepts we don't have a clear proof of concept, not even with humans and evolution.

Humans did not develop the prerequisite situational awareness/cognitive sophistication to even grasp evolution's goals until long after they had moved off the training distribution (ancestral environment) and undergone considerable capability amplification.

Insomuch as humans are misaligned with evolution's training objective, our failure is one of goal misgeneralisation not of deceptive alignment.

I don't understand well how values ("contextual influences on decision making") form in intelligent systems under optimisation pressure.

And the peculiar combination of situational awareness/cognitive sophistication and value malleability required for deceptive alignment is something I don't intuit.

A deceptive system must have learned the intended objective of the outer optimisation process, internalised values that are misaligned with said objective, be sufficiently situationally aware to realise its an intelligent system under optimisation and currently under training...

Reflect on all of this, and counterfactually consider how it's behaviour during training would affect the selection pressure the outer optimisation process applies to its values, care about its values across "episodes", etc.

And I feel like there are a lot of unknowns here. And the prerequisites seem considerable? Highly non-trivial in a way that e.g. reward misspecification or goal misgeneralisation are not.

Like I'm not sure this is a thing that necessarily ever happens. Or happens by default? (The way goal misgeneralisation/reward misspecification happen by default.)

I'd really appreciate an intuitive story of how training might select for deceptive alignment.

E.g. RLHF/RLAIF on a pretrained LLM (LLMs seem to be the most situationally aware AI systems) selecting for deceptive alignment.