As part of learning the field and maximizing on new ideas, I've been trying to figure out what the goal of AI alignment is. So far I've found out what outer alignment is as a concept, but not what it should be as an instantiation.
So here is my suggestion:
Why don't we take Nick Bostrom's Instrumental Convergence goals and make those our terminal goals as a species?
Note how humanity's values are agnostic to morality. The whole idea of Bostrom's instrumental convergence goals is that they maximize the ability to achieve nearly any terminal goals. So by adopting these goals as the explicit terminal goals for humanity, we allow space for every individual human to pursue their self-chosen goals. We don't need to agree on religion, morality, the nature of reality, or how nice one should be in repeated coordination games. Instead we can agree that whatever AI's we happen to make, we at least ensure these AI's won't wipe out humanity at large, won't try to change humanity, won't limit us in our development or creations, and won't stymie our growth.
Honestly, that takes care of most horror scenarios!
It's basically like a better Asimov's laws!
Note that humanity's values are only applied to humanity at large and not to individual humans. That means the AGI can still ...
By formulating our goals at the level of humanity instead of the individual human, we are thus creating a path for AGI to navigate conflicts of interest without devolving in to catastrophic trade-offs no one thought to prohibit it from making. Of course, there is still the question of how to operationalize these values but knowing where the target is a good start.
The way I understand the problem space now, the goal of AI alignment is to ensure AGI adopts the instrumental convergence goals for humanity while we can assume the AGI will also have these goals for itself. The beauty of this solution is that any increase of instrumental success on the part of the AGI will translate into an increase in terminal success for humanity!
Win-win if I ever saw one.
Additionally, this approach doesn't rely on the incidental creator of the first AGI being a nice guy or gal. These goals are universal to humanity. So even though an individual creator might add goals that are detrimental to a subset of humanity (say the bad people get AGI first), the AGI will still be constrained in how much damage it can do to humanity at a large.
The distinction between civilization's goal and goals of individual people is real, but that doesn't make civilization's goal unmoored. Rounding it down to some instrumental goals changes it. And that exposes you to goodhart's curse: if you take something other than actual terminal values of civilization as an optimization target, the outcome looks bad from the point of view of actual terminal values of civilization.
I think something similar to what you say can be rescued, in the form of the more important terminal values of civilization turning out to be generic, like math, not specific to details of the people who seek to formulate their own values. Generic values are convergent across many processes of volition extrapolation, including for the more human-like AGIs, and form an even greater share of terminal values for coalitions of multiple different AGIs. (This doesn't apply to mature optimizers such as paperclip maximizers that already know their terminal values and aren't motivated to work on figuring out what else they should be.)
It is similar to being instrumentally convergent, in being discovered by many different processes for the same reason, but it's not the same thing. Convergent instrumental goals are discovered as subgoals in the process of solving many different problems, in service of many different terminal goals. Generic terminal goals are discovered as terminal goals in the process of extrapolating many different volitions (formulating terminal goals of many different people, including those of relatively alien psychology, not sharing many human psychological adaptations).
Thank you for your comment!
How do we change humanity's hypothesized terminal goals by assigning instrumental convergence goals for humanity as terminal goals to the AGI?Also, I'm trying to think of a goodhart's curse version of the the Humanity's Values framework, and can't think of any obvious cases. I'm not saying it's waterproof and the AGI can't misinterpret the goals, but if we presuppose that we find the ideal implementation of these values as goals, and there is no misalignment, then ... everything would be ok?
I think something similar to what you say can be rescued, in the form of the more important terminal values of civilization turning out to be generic, like math, not specific to details of the people who seek to formulate their own values.
I read the links but don't understand the terminal values you are pointing to. Could you paraphrase?
I don't think I understand this post. Are you making the claim that "because some subset of human values are (and were selected for being) instrumentally convergent, we don't have to worry about outer alignment if we project our values down to that subset"?
If so, that seems wrong to me because in most alignment failure scenarios the AI does actually have a terminal goal that to us would seem "arbitrary" or "wrong". It only pursues the instrumentally convergent goals because they help the AI towards the terminal goal. That means you can't bank on the AI not turning you into paperclips at some point, because it might judge that to be more expedient than keeping you around as another computer for doing research etc.
In addition, there's the added danger that if the AI leaves you around, your inability to precommit to some strategies will always pose a threat to the AI's totalizing vision of a universe full of paperclips. If so, it's instrumentally convergent for the AI to eliminate or permanently disempower you, even if you yourself are currently aiming for the same goals the AI is aiming for, both instrumental and terminal.
Hmmm, it's good to know my thesis wasn't very clear.
The idea is to train an AI on having our values as its end goals. It doesn't solve for inner alignment issues, indeed. But say the AI wants to maximize paperclips, then it would be constrained to not damaging our survival etc. while making paperclips.
I was trying to figure out what set of values we are even trying to give an AGI in the first place and this was my best guess: whatever else you do, optimize the instrumental convergence goals of humanity.
If it's more achievable for you to be sad than to be happy, would you change your priorities and strive to become sad?
What's the relevance of the question?
"Self-improvement" is one of those things which most humans can nod along to, but only because we're all assigning different meanings to it. Some people will read "self-improvement" and think self-help books, individual spiritual growth, etc.; some will think "transhumanist self-alteration of the mind and body"; some will think "improvement of the social structure of humanity even if individual humans remain basically the same"; etc.
It looks like a non-controversial thing to include on the list, but that's basically an optical illusion.
For those same reasons, it is much too broad to be programmed into an AGI as-is without horrifying consequences. The A.I. settling on "maximise human biological self-engineering" and deciding to nudge extremist eugenicists into positions of power is, like, one of the optimistic scenarios for how well that could go. I'm sure you can theoretically define "self-improvement" in ways that don't lead to horrifying scenarios, but then we're just back to Square 1 of having to think harder about what moral parameters to set rather than boiling it all down to an allegedly "simple" goal like "human self-improvement".
The operalationalization would indeed be the next step. I disagree the first step is meaningless without it though. E.g. having some form of self-improvement in the goal set is important as we want to do more than just survive as a species.