This was written as part of an entry to the FTX AI competition, but I presume that's off now and I can't justify spending more time on it. Neither my thinking nor my writing is polished here. I believe some of the arguments are original.
In many (though not all) AI catastrophe theories, a critical hypothesis is the hypothesis of a malign prior over objectives. Concretely:
Key arguments that have been made in favour include:
I don't consider these arguments conclusive.
I think it's possible that there may be a quite general mechanism by which SSL + fine-tuning is controllable.
I think that even if powerful AI is not malign by default, in the long run there will still be incentives to push the envelope with respect to safe control. Very roughly, I expect that, insofar as development of AI technology is not safety-limited, it's reasonable to expect that it will continue to be developed until it is safety-limited.
I surveyed five documents:
Some notes follow.
-3: orthogonality and instrumental convergence
I don't think this is intended to be a defense of malign priors, but just in case it is: orthogonality as stated behind the link is too weak an assumption to support the malign priors hypothesis.
Instrumental convergence is an important concept, and implies that many objectives are likely to be malign. We might be able to interpret this (perhaps as a stretch) as an instance of argument 4.
10, 11, 12: Powerful AGIs doing dangerous things that will kill you if misaligned, must have an alignment property that generalized far out-of-distribution from safer building/training operations that didn't kill you
Predicting that AGI will do amazing things far OOD is based on general extrapolation of the "AI family of technologies", not based on any particular approach. Such a prediction is neutral on the question of whether or not such AGI will also be malign, which depends on assumptions about how exactly it achieves this.
13: Many alignment problems of superintelligence will not naturally appear at pre-dangerous, passively-safe levels of capability (+ 14)
Argument 5.
15: Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously 'the case of human intelligence. We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming' (+16)
Argument 1
17, 18, 19: on the current optimization paradigm there is no general idea of how to get particular inner properties into a system, or verify that they're there, rather than just observable outer ones you can run a loss function over
I think it is false to say there is no known alternative control scheme to loss functions, and it is not obvious to me that control needs to be analysed in terms of "particular inner properties", whatever that means.
20: malign objectives might be favoured by losses, not just a priori
Argument 6
21: Capabilities generalize further than alignment once capabilities start to generalize far. "When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints."
This seems confused - losses impact behaviour directly, epistemology indirectly. Elementary theory says that there are many behaviours that minimise loss, only a few good ones. Insofar as there is an argument here, it seems to be something like "you get more data about predictive correctness than utility", but the importance of such a claim depends on the prior in the first place. More of an assertion of the malign prior hypothesis than a defense of it.
- a relatively simple core structure that explains why complicated cognitive machines work; There is no analogous truth about there being a simple core of alignment
Argument 1
I think the absence of the malign prior assumption is notable.
The best way for Alex to maximize reward while under human control is to use its high situational awareness to deliberately appear safe and aligned at all times [...] because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively. [...] In both settings, being straightforwardly obedient and honest does not maximize reward -- strategies which are dishonest / disobedient in at least some ways outperform.
Argument 6
But notably, we need ways of drawing on this feedback that don’t require unrealistic amounts of human supervision and human-generated data; we need to ensure that such feedback captures our preferences about behavior that we can’t directly understand and/or whose consequences we haven’t yet seen; we need ways of eliminating incentives to manipulate or mislead the human feedback mechanisms in question; and we need such methods to scale competitively as frontier AI capabilities increase (p. 23-24)
Not really an argument, but somewhat relevant.
Proxy goals correlated with the evaluation criteria may be simpler and therefore easier to learn, especially if the evaluation criteria are complex. (p. 25)
Argument 2
if the “true” objective function provides slower feedback, agents that pursue faster-feedback proxies have advantages (p. 25)
Not quite argument 2, but I'm bundling it in for now.
To the extent that many objectives would instrumentally incentivize good behavior in training (for example, because many objectives, when coupled with strategic awareness, incentivize gaining power in the world, and doing well in training leads to deployment/greater power in the world), but few involve intrinsic motivation to engage in such behavior, we might think it more likely that selecting for good behavior leads to agents who behave well for instrumental reasons.
Argument 3
And so if we want to understand which one we're going to get, we have to look at the inductive biases of the training process. In any situation, if you're familiar with machine learning, where we want to understand which of multiple different possible models that are behaviorally indistinguishable, we will get, it's a question of inductive biases.
Agree on the importance of the malign priors hypothesis.
we don't understand machine learning (ML) inductive biases very well, they're actually really confusing. We just don’t have very much information about how they operate."
I agree that we don't understand inductive biases very well.
So first, I'm going to make the assumption that our training process is good enough that for our model to actually perform well, it has to really fully understand the thing that we're trying to get it to do.
No comment for now, seems worth highlighting
Furthermore, this is a really short and simple modification. All gradient descent has to do in order to hook up the model’s understanding of the thing that we want it to do to its actions here is just to make its proxies into long term goals—all it has to do is push them out, make them long term, and then everything else falls into place.
Argument 4
And so there’s just a huge number of Pascals because they can care about anything in the world at all. So the point is that there's more Pascals than there are the others, and so probably you’ll have to fix fewer bits to specify them in the space.
Argument 3
So in phase 3 we should expect that AGI behavior on novel tasks will primarily depend not on how we supervise them, but rather on how they generalize the goals they acquired earlier in training, which itself depends on the inductive biases of the training setup used.
Agree on the importance of the malign priors hypothesis.
We should expect AGIs to generalize goals to larger scales for the same reason that they’ll generalize capabilities to novel tasks: because they’ll learn high-level concepts which are not very domain-specific, and reason about how to achieve them.
Argument 4, seems to be equivocating on the word "generalisation" - generalising capabilities is "doing the same thing in a different context", generalising objectives is "doing a different thing".
Reasoning about how to achieve high-level goals generalizes very naturally to larger scales: for example, goals like “have more novel experiences”, “understand the world”, or “get high reward” don’t just apply within a specific time or place, but can be extrapolated to a nearly arbitrary extent.
Argument 4. That something can be arbitrarily extrapolated does not imply that it will be.
We could imagine AGIs instead generalizing to pursuing bounded versions of those goals, like “have more novel experiences, but not too many, and not too novel, and stopping after a certain time” — but I see little reason to expect generalization to stay within small-scale bounds as AGIs get smarter (especially given that many researchers will aim to build systems which generalize as far as possible).
No comment, just highlighting.
Underlying these arguments are implicit assumptions about simplicity. For instance, the arguments don’t apply to an astronomically large tabular policy which has memorized how to perform well on all tasks that take less than a year. However, since realistic AGIs will need to use simpler strategies like reasoning about outcomes, arguments like the instrumental convergence thesis can be informative about how they’ll generalize.
Argument 2
Wondering how much of a problem goodharting is likely to be. Basic idea is: suppose we pursue a RL-heavy approach without any regularisation towards "sensible actions" in the following areas. Even without deception, how badly do they fall apart?
Is RL w/out good regularisation going to produce bad outcomes?
Is scientific research w/out good regularisation going to produce bad outcomes?
Manufacturing