Right now the main focus of alignment seems to be on how to align powerful AGI agents, a.k.a AI Safety. I think the field can benefit from a small reframing: we should think not about aligning AI, but about alignment of systems in general, if we are not already.

It seems to me that the biggest problem in AI Safety comes not from the fact that the system will have unaligned goals, but from the fact that it is superhuman.

As in, it has nearly godlike power in understanding of the world and in turn manipulation of both the world and other humans in it. Does it really matter if an artificial agent gets godlike processing and self-improvement powers or a human will, or government/business?

I propose a little thought experiment – feel free to answer in the comments.

If you, the reader, or, say, Paul Christiano or Eliezer gets uploaded and obtains self-improvement, self-modification and processing speed/power capabilities, will your goals converge to damaging humanity as well?

If not, what makes it different? How can we transfer this secret sauce to an AI agent?

If yes, maybe we can see how big, superhuman systems get aligned right now and take some inspiration from that?


New Answer
Ask Related Question
New Comment

5 Answers sorted by

I believe that much of the difficulty in AI alignment comes from specific facts about how you might build an AI, and especially searching for policies that behave well empirically. Similarly, much of the hope comes from techniques that seem quite specific to AI.

I do think there are other contexts where alignment is a natural problem, especially the construction of institutions. But I'm not convinced that either the particular arguments for concern, or the specific technical approaches we are considering, transfer.

A more zoomed out "second species" style argument for risk may apply roughly equally well a priori to human institutions as to AI. But quantitatively I think that institutions tend to be weak when their interests are in conflict with any shared interests of their stakeholders/constituents, and so this poses a much lower risk (I'd guess that this is also e.g. Richard's opinion). I think aligning institutions is a pretty interesting and important question, but that the potential upside/downside is quite different from AI alignment, and the quantitative differences are large enough to be qualitative even before we get into specific technical facts about AI.

My thoughts here is that we should look into the value of identity. I feel like even with godlike capabilities I will still thread very carefully around self-modification to preserve what I consider "myself" (that includes valuing humanity).
I even have some ideas on safety experiments on transformer-based agents to look into if and how they value their identity.

If you, the reader, or, say, Paul Christiano or Eliezer gets uploaded and obtains self-improvement, self-modification and processing speed/power capabilities, will your goals converge to damaging humanity as well? If not, what makes it different? How can we transfer this secret sauce to an AI agent?

The Orthogonality Thesis states that values and capabilities can vary independently. The key question then is whether my/Paul's/Eliezer's values are actually as aligned with humanity as they appear to be, or if instead we are already unaligned and would perform a Treacherous Turn once we had the power to get away with it. There are certainly people who are already obviously bad choices, and people who would perform the Treacherous Turn (possibly most people[1]), but I believe there are people who are sufficiently aligned, so let's assume going forward we've picked one of those. At this point "If not, what makes it different?" answers itself: by assumption we've picked a person for whom the Value Loading Problem is already solved. But we have no idea how to "transfer this secret sauce to an AI agent" - the secret sauce is hidden somewhere along this person's particular upbringing and more importantly their multi-billion year evolutionary history.

  1. The adage "power tends to corrupt, and absolute power corrupts absolutely" basically says that treacherous turns are commonplace for humans - we claim to be aligned and might even believe it ourselves while we are weak, but then when we get power we abuse it. This adage existing does not of course mean it's universally true. ↩︎

The adage "power tends to corrupt, and absolute power corrupts absolutely" basically says that treacherous turns are commonplace for humans - we claim to be aligned and might even believe it ourselves while we are weak, but then when we get power we abuse it.

I would like to know the true answer to this.

On one hand, some people are assholes, and often it's just a fear of punishment or social disapproval that stops them. Remove all this feedback, and it's probably not going to end well. (Furthermore, a percent or two of the population are literally psychopat... (read more)

I think the insights from Selectorate Theory [https://www.lesswrong.com/posts/N6jeLwEzGpE45ucuS/building-blocks-of-politics-an-overview-of-selectorate] imply that it is impossible to keep power without gradually growing more corrupt, in the sense of appeasing the "Winning Coalition" with private goods. No matter what your terminal goals, more power, and power kept for longer, is a convergent instrumental goal, and one which usually takes so much effort to achieve that you gradually lose sight of your terminal goals too, compromising ethics in the short term in the name of an "ends justify the means" long term (which often never arrives). So yeah, I think that powerful humans are unaligned by default, as our ancestors who rejected all attempts to form hierarchies for tens of thousands of years before finally succumbing to the first nationstates may attest.
Seems like there are two meanings of "power" that get conflated, because in real life it is a combination of both: * to be able to do whatever you want; * to successfully balance the interests of others, so that you can stay nominally on the top.
Good point. Perhaps there's some people who would be corrupted by the realities of human politics, but not by e.g. ascension to superintelligence.
  1. Superhuman agents these days are all built up out of humans talking to each other. That helps a lot for their alignability, in multiple ways. For an attempt to transfer this secret sauce to an AI agent, see Iterated Distillation and Amplification, which as I understand it works by basically making a really good human-imitator, then making a giant bureacracy of them, and then imitating that bureacracy & repeating the process.
  2. The AIs we will soon build will be superhuman in new ways, ways that no current superhuman agent enjoys. (See e.g. Bostrom's breakdown of speed, quality, and collective intelligence -- current organizations are superhuman in "collective" but human-level in speed and quality)
  3. To answer your question, no, I'd feel pretty good about Paul or Eliezer or me being uploaded. If it was a random human being instead of one of those two, I'd still think things would probably be OK though there'd be a still-too-large chance of catastrophe.

humans talking to each other already has severe misalignment. ownership exploitation is the primary threat folks seem to fear from ASI: "you're made of atoms the ai can use for something else" => "you're made of atoms jeff bezos and other big capital can use for something else". I don't think point 1 holds strongly. youtube is already misaligned; it's not starkly superhuman, but it's much better at selecting superstimuli than most of its users. hard asi would amplify all of these problems immensely, but because they aren't new problems, I do think seeking formalizations of inter-agent safety is a fruitful endeavor.

3Daniel Kokotajlo2mo
Oh I agree with all that. I said "it helps a lot for their alignability" not "they are all aligned."
1the gears to ascenscion2mo
makes sense, glad we had this talk :thumbsup:

The misalignment problem is universal, extending far beyond AI research. We have to deal daily with misaligned artificial non-intelligent and non-artificial intelligent systems. Politicians, laws, credit score systems, and many other things around us are misaligned to some degree with the goals of the agents who created them or gave them the power. The AI Safety concern states that if an AGI system is disproportionately powerful, a tiny misalignment is enough to create unthinkable risks for humanity. It doesn't matter if the powerful but tiny-misaligned system is AI or not, but we believe that AGI systems are going to be extraordinarily powerful and are doomed to be tiny-misaligned.

You can do a different thought experiment. I'm sure that you, like any other standard agent, are slightly misaligned with the goals of the rest of humanity. Imagine you have a near-infinite power. How bad would that be for the rest of humanity? If you were somebody else in this world, a random person, would you still want to find yourself in the situation where now-you has near-infinite power? I certainly don't. 

1 Related Questions

1Answer by Aleksey Bykhun2mo
Okay, hold my gluten-free kefir, boys! Please let me say it in full first without arguments, and then I will try to find more relevant links for each claim. I promise it's relevant. INTRODUCTION – ENLIGHTENMENT? Lately, I have been into hardcore mindfulness practices (see book [https://www.mctb.org/mctb2/]) aimed at reaching "Enlightenment" in the sense on Buddha. There are some people who reliably claim they've succeeded and talk about their experience and how to reach there (e.g. see this talk [https://www.youtube.com/watch?v=K6kfcYBrKMc&t=1888s] and google each of the fellows if it resonates) My current mental model of "Enlightenment" is as follows: Evolutionally, we've had developed simple lizard brains first, mostly consisting of "register => process => decide => react" without much thought. Similar to the knee reflex, but sometimes a bit more complicated. Our intellectual minds capable of information processing, memory, superior pattern-matching; they have happened later. These two systems coexist, and first one possesses second. However, the hardware of our brains has general information processing capabilities, and doesn't require any "good-bad" instant decision reactionary mechanism. Even though it was "invented" earlier, it's ad-hoc in the system. My metaphor would be a GPU or an ASIC that short-circuits some of the execution to help CPU process info faster. However, makes a big difference in your subjective experience whether that first system being used or not. Un-winding this circuitry from your default information processing, which hand-wavily is "conscious attention", or the "central point"; is what mindfulness is about. "Enlightenment" is a moment when you relax enough so that your brain starts being able (but not required) to run information flows around the the lizard brain and experiencing sensory stimuli "directly". Similar "insight" moment happens when you realize that "money" is just paper, and not the Ultimate Human Value Leaderboard.
6Aleksey Bykhun2mo
Interestingly, have just discussed similar issue with a friend and came up with a solution. Obviously, aligned AI cares about people's subjective opinion, but that doesn't mean it's not allowed to talk/persuade them. Imagine a list of TED-style videos tailored specifically for you on each pressing issue that requires you changing your mind. On the one hand, it presumes that people trust the AI enough to be persuaded, but keep in mind that we're dealing with a smart restless agent. The only thing it asks is that you keep talking to it. The last resort would be to press people on "if you think that's a bad idea, are you ready to bet that this implemented is going to make the world worse?" and create a virtual prediction market between supporters P.S. This all implies that AI is non-violent communicator. There are many ways to pull people's strings to persuade them, I presume that we know how to distinguish between manipulative and informative persuasion. A hint on how to do that is that AI should care about people making INFORMED decisions about THEIR subjective future, not about getting their opinions "objectively" right.
Several people have suggested that a sufficiently smart AI, with the ability to talk to a human as much as it wanted, could persuade the human to "let it out of the box" and give it access to the things it needs to take over the world. This seems plausible to me, say at least 10% probability, which is high enough that it's worth trying to avoid. And it seems to me that, if you know how to make an AI that's smart enough to be very useful but will voluntarily restrain itself from persuading humans to hand over the keys to the kingdom, then you must have already solved some of the most difficult parts of alignment. Which means this isn't a useful intermediate state that can help us reach alignment. Separately, I'll mention my opinion that the name of the term "non-violent communication" is either subtle trolling or rank hypocrisy. Because a big chunk of the idea seems to be that you should stick to raw observations and avoid making accusations that would tend to put someone on the defensive... and implying that someone else is committing violence (by communicating in a different style) is one of the most accusatory and putting-them-on-the-defensive things you can do. I'm curious, how many adherents of NVC are aware of this angle on it?
1Aleksey Bykhun1mo
I don’t think NVC tries to put down an opponent, it’s mostly about how you present your ideas. I think it models an opponent as “he tries to win the debate without thinking about my goals. let me think of both mine and theirs goals, so i’m one step ahead”. Which is a bit prerogative and looking down, but not exactly accusatory
This seems way too vague to be useful. I voted agreement on both comments, because I seriously do not know what this question means by "aligned with humanity".
1Rana Dexsin2mo
That's part of the point, yes! My thought is that the parent question, while it's trying to focus on the amplification problem, kind of sweeps this ill-defined chunk under the rug in the process, but I'm not sure how well I can justify that. So I thought asking the subquestion explicitly might turn out to be useful. (I should probably include this as an edit or top-level comment after a bit more time has passed.)
2Answer by MSRayne2mo
I am skeptical there is any coherent collective values of humanity to be aligned to. But I am aligned to what I see as the most reasonable universal ethical principles; however, those do not place humanity significantly above other life forms, and if I became benevolent superhuman world sovereign, many people would be angry that they're not allowed to exploit animals anymore, and I would laugh at them.
1 comments, sorted by Click to highlight new comments since: Today at 8:45 PM

The question "what does a human do if they obtain a lot of power" seems only tangentially related to intent alignment. I think this largely comes down to (i) the preferences of that human in this new context, (ii) the competence of that person at behaving sanely in this new context.

I like to think that I'm a nice person who the world should be unusually happy to empower, but I don't think that means I'm "aligned with humanity;" in general we are aiming at a much stronger notion of alignment than that. Indeed, I don't think "humanity" has the right type signature for something to be aligned with. And on top of that, while there are many entities I would treat with respect, and while I would expect to quickly devolve the power I acquired in this magical thought experiment, I still don't think there exists any X (other than perhaps "Paul Christiano") that I am aligned with in the sense that we want our AI systems to be aligned with us.

New to LessWrong?