How likely are malign priors over objectives? [aborted WIP]

David Johnston

This was written as part of an entry to the FTX AI competition, but I presume that's off now and I can't justify spending more time on it. Neither my thinking nor my writing is polished here. I believe some of the arguments are original.

In many (though not all) AI catastrophe theories, a critical hypothesis is the hypothesis of a malign prior over objectives. Concretely:

Suppose that an advanced AI can be well-described by an agent with a comptent model of the world pursuing some objective
Suppose we can analyse which goal is likely to be pursued by considering a prior over objectives updated based on the score that the policy induced by this objective attains in some training regime
This prior is malign if deception + disaster-causing are instrumentally convergent behaviour for some collection of objectives with high total probability

Key arguments that have been made in favour include:

Humans are generally intelligent but not aligned with "what evolution wants"
Malign objectives might be algorithmically simple
Malign objectives are numerous
Extrapolating objectives is in some sense "natural" or "simple"
Malign objectives might not cause many problems prior to strongly superhuman AI
Malign objectives will get higher reward than benign ones

I don't consider these arguments conclusive.

Different optimisation algorithms should be expected to give difference results, and there are good reasons to expect that reinforcment is not the only control paradigm for advanced AI
- Empirical: self-supervised learning
- Theoretical: either your application has precedent - in which case you can regularise - or it doesn't, in which case reinforcement learning doesn't seem very attractive
- it feels almost silly to argue, but there’s no sensible way to discuss what the spirit of evolution “wants” unless you can say what alternatives it considered, and if you go down this path there seem to be some alternatives that would get you better generalisation to diverse environments
Optimality results for complexity-regularised RL don't obviously align with what people want from AI, nor is it obvious how practically significant they are. I'm not aware of examples of code complexity successfully predicting ML behaviour
Naive counting arguments imply that generalisation should be impossible for ML, but this is not so - it would not be especially surprising if naive counting arguments did not apply here either
- This argument seems to require that the number of viable alternative objectives be very small prior to AI becoming substantially superhuman, or else a general solution would already be needed
I don't buy that the relevant authors' judgements of naturalness are particularly predictive of advanced AI behaviour
Malign objectives are likely to cause many more problems as superhuman levels, but this doesn't mean they won't already cause many problems at subhuman levels
- In particular, I expect (moderate confidence) that ordinary Goodharting will be a serious enough problem for AI with large action space and complex objectives that general solutions to malign objectives will be needed
Regular ML routinely trades off training set loss for generalisation performance

I think it's possible that there may be a quite general mechanism by which SSL + fine-tuning is controllable.

I think that even if powerful AI is not malign by default, in the long run there will still be incentives to push the envelope with respect to safe control. Very roughly, I expect that, insofar as development of AI technology is not safety-limited, it's reasonable to expect that it will continue to be developed until it is safety-limited.

Literature survey

I surveyed five documents:

Yudkowsky's AGI ruin because it is widely known
Cotra's Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover also because it is widely known
Carlsmith's Is Power-Seeking AI an Existential Risk? because it is widely known and regarded as comprehensive
Hubinger's How likely is deceptive alignment because it goes into an unusual level of detail about this question
Ngo's The Alignment problem from a Deep-Learning perspective because the introduction made it look relevant

Some notes follow.

Appendix: literature review notes.

Yudkowsky AGI ruin

-3: orthogonality and instrumental convergence

I don't think this is intended to be a defense of malign priors, but just in case it is: orthogonality as stated behind the link is too weak an assumption to support the malign priors hypothesis.

Instrumental convergence is an important concept, and implies that many objectives are likely to be malign. We might be able to interpret this (perhaps as a stretch) as an instance of argument 4.

10, 11, 12: Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you

Predicting that AGI will do amazing things far OOD is based on general extrapolation of the "AI family of technologies", not based on any particular approach. Such a prediction is neutral on the question of whether or not such AGI will also be malign, which depends on assumptions about how exactly it achieves this.

13: Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability (+ 14)

Argument 5.

15: Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously 'the case of human intelligence. We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming' (+16)

Argument 1

17, 18, 19: on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over

I think it is false to say there is no known alternative control scheme to loss functions, and it is not obvious to me that control needs to be analysed in terms of "particular inner properties", whatever that means.

20: malign objectives might be favoured by losses, not just a priori

Argument 6

21: Capabilities generalize further than alignment once capabilities start to generalize far. "When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints."

This seems confused - losses impact behaviour directly, epistemology indirectly. Elementary theory says that there are many behaviours that minimise loss, only a few good ones. Insofar as there is an argument here, it seems to be something like "you get more data about predictive correctness than utility", but the importance of such a claim depends on the prior in the first place. More of an assertion of the malign prior hypothesis than a defense of it.

a relatively simple core structure that explains why complicated cognitive machines work; There is no analogous truth about there being a simple core of alignment

Argument 1

Cotra Without specific countermeasures, the easiest path to transformative AI likely leads to AI takeover

I think the absence of the malign prior assumption is notable.

The best way for Alex to maximize reward while under human control is to use its high situational awareness to deliberately appear safe and aligned at all times [...] because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively. [...] In both settings, being straightforwardly obedient and honest does not maximize reward -- strategies which are dishonest / disobedient in at least some ways outperform.

Argument 6

Carlsmith Power-seeking

But notably, we need ways of drawing on this feedback that don’t require unrealistic amounts of human supervision and human-generated data; we need to ensure that such feedback captures our preferences about behavior that we can’t directly understand and/or whose consequences we haven’t yet seen; we need ways of eliminating incentives to manipulate or mislead the human feedback mechanisms in question; and we need such methods to scale competitively as frontier AI capabilities increase (p. 23-24)

Not really an argument, but somewhat relevant.

Proxy goals correlated with the evaluation criteria may be simpler and therefore easier to learn, especially if the evaluation criteria are complex. (p. 25)

Argument 2

if the “true” objective function provides slower feedback, agents that pursue faster-feedback proxies have advantages (p. 25)

Not quite argument 2, but I'm bundling it in for now.

To the extent that many objectives would instrumentally incentivize good behavior in training (for example, because many objectives, when coupled with strategic awareness, incentivize gaining power in the world, and doing well in training leads to deployment/greater power in the world), but few involve intrinsic motivation to engage in such behavior, we might think it more likely that selecting for good behavior leads to agents who behave well for instrumental reasons.

Argument 3

Hubinger How likely is deceptive alignment

And so if we want to understand which one we're going to get, we have to look at the inductive biases of the training process. In any situation, if you're familiar with machine learning, where we want to understand which of multiple different possible models that are behaviorally indistinguishable, we will get, it's a question of inductive biases.

Agree on the importance of the malign priors hypothesis.

we don't understand machine learning (ML) inductive biases very well, they're actually really confusing. We just don’t have very much information about how they operate."

I agree that we don't understand inductive biases very well.

So first, I'm going to make the assumption that our training process is good enough that for our model to actually perform well, it has to really fully understand the thing that we're trying to get it to do.

No comment for now, seems worth highlighting

Furthermore, this is a really short and simple modification. All gradient descent has to do in order to hook up the model’s understanding of the thing that we want it to do to its actions here is just to make its proxies into long term goals—all it has to do is push them out, make them long term, and then everything else falls into place.

Argument 4

And so there’s just a huge number of Pascals because they can care about anything in the world at all. So the point is that there's more Pascals than there are the others, and so probably you’ll have to fix fewer bits to specify them in the space.

Argument 3

Ngo The Alignment problem from a Deep-Learning perspective

So in phase 3 we should expect that AGI behavior on novel tasks will primarily depend not on how we supervise them, but rather on how they generalize the goals they acquired earlier in training, which itself depends on the inductive biases of the training setup used.

Agree on the importance of the malign priors hypothesis.

We should expect AGIs to generalize goals to larger scales for the same reason that they’ll generalize capabilities to novel tasks: because they’ll learn high-level concepts which are not very domain-specific, and reason about how to achieve them.

Argument 4, seems to be equivocating on the word "generalisation" - generalising capabilities is "doing the same thing in a different context", generalising objectives is "doing a different thing".

Reasoning about how to achieve high-level goals generalizes very naturally to larger scales: for example, goals like “have more novel experiences”, “understand the world”, or “get high reward” don’t just apply within a specific time or place, but can be extrapolated to a nearly arbitrary extent.

Argument 4. That something can be arbitrarily extrapolated does not imply that it will be.

We could imagine AGIs instead generalizing to pursuing bounded versions of those goals, like “have more novel experiences, but not too many, and not too novel, and stopping after a certain time” — but I see little reason to expect generalization to stay within small-scale bounds as AGIs get smarter (especially given that many researchers will aim to build systems which generalize as far as possible).

No comment, just highlighting.

Underlying these arguments are implicit assumptions about simplicity. For instance, the arguments don’t apply to an astronomically large tabular policy which has memorized how to perform well on all tasks that take less than a year. However, since realistic AGIs will need to use simpler strategies like reasoning about outcomes, arguments like the instrumental convergence thesis can be informative about how they’ll generalize.

Argument 2

How common is Goodharting?

Wondering how much of a problem goodharting is likely to be. Basic idea is: suppose we pursue a RL-heavy approach without any regularisation towards "sensible actions" in the following areas. Even without deception, how badly do they fall apart?

Self-driving cars

Is RL w/out good regularisation going to produce bad outcomes?

very large action space (driving policies)
presumably lots of opportunities to improve naive metrics (anything in the bucked of "not following road rules/conventions")
presumably difficult to anticipate and/or catch all of the "creative" ways that a SD robot could game metrics
seems hard to make it work
also, no one is trying this in practice: we start with cars that do a pretty good job of driving

Scientific research

Is scientific research w/out good regularisation going to produce bad outcomes?

extremely large action space (stuff a researcher could do)
many documented problems with incentives
do these become obviously unmanageably bad with "sociopath AI"?
probably easy to get away with fraud if you take the time to do it properly
cannot just set reward to be "the truth"
ethics reviews etc. don't do a very comprehensive job of ensuring experiment safety - a lot depends on "using common sense"
seems hard to make it work

Medical diagnosis & treatment

moderately large action space (conditions and treatments are somewhat codified, but we can combine them)
many documented issues with incentives
do these become obviously unmanageably bad with "sociopath AI"?
- Time lags in observability (e.g. takes time to learn a drug is addictive)
Might be possible to manage with good reward structure + limited action space

Military

large action space and lots of incentive to exploit it
persistent mismatch b/t military goals and political ones in reality
"killer robots" is very imaginatively vivid, people probably worried even if they have minor problems
but I expect major problems
seems hard to make it work

Manufacturing

LESSWRONG
LW