Arguments for outer alignment failure, i.e. that we will plausibly train advanced AI systems using a training objective that doesn't incentivise or produce the behaviour we actually want from the AI system. (Thanks to Richard for spelling out these arguments clearly in AGI safety from first principles.)
Arguments for inner alignment failure, i.e. that advanced AI systems will plausibly pursue an objective other than the training objective while retaining most or all of the capabilities it had on the training distribution.[1]
This follows abergal's suggestion of what inner alignment should refer to. ↩︎
I have 5 "inner" and 2 "outer" arguments in bullet point lists at My AGI Threat Model: Misaligned Model-Based RL Agent (although you'll notice that my threat model winds up with "outer" & "inner" referring to slightly different things than the way most people around here use the terms).
This is probably not the answer you are looking for, but as you are considering putting a lot of work into this...
Does anyone know if this has been done? If not, I might try to make it.
Probably has been done, but depends on what you mean with strongest arguments.
Does strongest mean that the argument has a lot of rhetorical power, so that it will convince people that alignment failure is more plausible than it actually is? Or does strongest mean that it gives the audience the best possible information about the likelihood of various levels of misalignment, where these levels go from 'annoying but can be fixed' to 'kills everybody and converts all matter in its light cone to paperclips'.
Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.
My main message here, I guess, is that many distilled collections of arguments already exist, even book-length ones like Superintelligence, Human Compatible, and The Alignment Problem. If you are thinking about adding to this mountain of existing work, you need to carefully ask yourself who your target audience is, and what you want to convince them of.
Thanks for your reply!
depends on what you mean with strongest arguments.
By strongest I definitely mean the second thing (probably I should have clarified here, thanks for picking up on this).
Also, the strongest argument when you address an audience of type A, say policy makers, may not be the strongest argument for an audience of type B, say ML researchers.
Agree, though I expect it's more like, the emphasis needs to be different, whilst the underlying argument is similar (conditional on talking about your second definition of "strongest").
...many di
Various arguments have been made for why advanced AI systems will plausibly not have the goals their operators intended them to have (due to either outer or inner alignment failure).
I would really like a distilled collection of the strongest arguments.
Does anyone know if this has been done?
If not, I might try to make it. So, any replies pointing me to resources with arguments that I've missed (in my own answer) would also be much appreciated!
Clarification: I'm most interested in arguments that alignment failure is plausible, rather than merely that it is possible (there are already examples that establish the possibility of outer and inner alignment failure for current ML systems, which probably implies we can't rule it out for more advanced versions of these systems either).