While you can read this post as a standalone, I'm writing it partially in response to Wei Dai's recent post Meta Questions about Metaphilosophy. I had already been considering writing a top-level post about metaethics and whether "solving" it is a tractable problem that can meaningfully be separated from AI alignment itself, and I see Wei's post as closely related to this topic. I recommend reading that (quite short) post too!

I define metaethics as the problem of how to decide on a "correct" ethical system. As an example, it seems like a good idea to give every "person" a "vote" about what ethical system they "want." But what if people say they want different things depending on how they're asked? How do we define a person? What voting system do we use? Metaethics is about finding answers to questions like these. Coherent Extrapolated Volition (CEV) is one attempt at a solution to metaethics. So the argument goes, solving AI alignment lets us put an arbitrary goal into an AI, but solving metaethics tells us what the goal should be.

I believe that metaethics is at least a major part of what Wei referred to as "metaphilosophy" in his post, although he also seems to be concerned about other things, like how to decide which decision theory to use. In this post, I'll only be talking about metaethics, as I'm not sure I understand Wei's concerns about other aspects of metaphilosophy enough to comment on them.

I used to feel more strongly about it being highly important to "solve metaethics" before we get transformative AI. Then I talked to someone who argued against this by basically saying "if we have an aligned AI that's at least as good as you at metaethics, we can just tell it to do all the thinking you're doing now/." After thinking about it some more, I mostly convinced myself that they were right. In this post, I'll give a more refined version of the argument for why we don't need to "solve metaethics," then explain the strongest remaining argument against the "just solve intent alignment" strategy that I know of.

Why solving metaethics might be unnecessary

Here's a concern I believe Wei Dai and others would endorse:

"Even if we solve intent alignment, our superintelligent AI may end up settling on bad ethics."

(With the proposed solution being to solve metaethics before we build the AI.)

Let's try to break this statement down, starting with what we mean by "intent alignment."

What is intent alignment?

Let's define an intent-aligned AI as a black box with an input and an output channel. A single user will give the AI a single natural-language command. From then until the end of time, regardless of the level of intelligence it reaches, the AI will try to output bits that cause the command to be executed according to what the user "intended." This is a much stronger version of alignment than would be required for the diamond-maximization problem, but I think this description accurately encapsulates what many AI alignment researchers are working towards. We're assuming for simplicity that the AI only gets only one "command," but this maintains generality - creating an AI that takes many commands from many people is equivalent to commanding the one-command AI, "from now on, obey commands from these people."

Now let's suppose we have an intent-aligned, superintelligent AI, and the user (Hassabis/Altman/Amodei/whoever) gives it the command: "maximize ethical goodness." I claim that, assuming the user already endorses some pretty common-sense democratic metaethics (say, CEV) and isn't knowingly lying to the world about it, this setup results in a pretty amazing world.

"But how can that be, if Hassabis/Altman/Amodei doesn't already have a solid metaethical system? Surely there are lots of missing pieces and caveats that they will later regret not adding to their command! What if their conscious mind believes in the principle that everyone should be treated equally, but some unconscious parts of their mind want to benefit their cronies to the exclusion of others? Or what if they later get corrupted by their absolute power and their definition of 'ethical goodness' shifts to 'whatever makes me happiest, disregarding anyone else'?"

At least according to my personal definition of "intend" (which I think is a very common-sense definition), what the user intended by "maximize ethical goodness" already had these notions fully accounted for. As a rule of thumb, you can imagine asking the user a question about what they intended by their command immediately after they gave it. Their internal monologue probably reflects what they "intended" quite accurately:

  • "Hey Demis Hassabis, could 'ethical goodness' include important parts of metaethics that you aren't currently aware of?'" "Yeah, sure."
  • "Sam Altman, does the definition of 'ethical goodness' include special terms for your friends and family?" "Nope."
  • "Dario Amodei, if you ever got corrupted by power, would the definition of 'ethical goodness' change?" "Uh, no."

As far as I can tell, this works against any argument that the intent-aligned AI would aim to do something obviously bad, if we allow the assumption that the user had good, reasonable intentions as they gave the command. Any clearly wrong choice the AI might make would also be clearly wrong to the user, so the AI shouldn't make it. If you accept this argument, it shows that intent alignment lets you very flexibly and robustly point an AI towards achieving a great future for humanity, if that's what you want.

What if the AI still falls far short of the user's intent?

Again, here's the statement I'm arguing against:

"Even if we solve intent alignment, our superintelligent AI may end up settling on bad ethics."

So even if you accept my arguments above about the power of intent alignment, you might object that even an intent-aligned superintelligent AI might accidentally settle on ethics that the user didn't "intend."

I won't say this is impossible, but this seems like a very weird situation. I think if I copy-pasted the previous section of this post into GPT-4, it would understand quite accurately what I meant by "the user's intention," and if you asked it a series of questions about kind of world most humans "intend" to live in, it would settle on something very non-apocalyptic. I wouldn't trust it to get every detail right, but I think it would be a much better world - no more cancer, no more poverty, no more depression, implemented in a way we actually want rather than some devious monkey's paw scenario.

If you grant me that, you'd then have to argue that an intent-aligned superintelligence, which would have the ability to actively try to improve its philosophy skills, would be worse at understanding our conception of ethics than GPT-4. This seems unlikely to me.

What are "bad ethics?"

Presumably, "bad ethics" here could be restated as "ethics that the person who made the statement above would 'intend' a superintelligent AI to avoid, if they could give such an AI a command right now." As in the thought experiment in the section about intent alignment, I think I could ask someone like Wei Dai questions about what he intends by "bad" in phrases like "the AI transition may go badly," and his answers would all make sense. His notion of "bad" already factors in things like being impartial between other people's ideas of badness, being open to aspects of badness that he hadn't considered, and so on.

If you accept everything I've argued so far, I think we've reduced the original statement from

"Even if we solve intent alignment, our superintelligent AI may end up settling on bad ethics."

to

"My ethics (today) might disagree with the ethics of the 'user' who will command the AI (at the time of the command)."

This is a very reasonable concern! Now, I'll briefly go over my own thinking about how this problem could manifest.

How intent alignment can still go wrong

The AI's user is selfish, malicious, or shortsighted

This one is pretty basic and has been covered elsewhere - the person who gets the AI supports a totalitarian regime, wants to maximize their personal profit, endorses a specific all-encompassing ideology, or just hates humans. These are all certainly possible, and we should think hard about who the heck to entrust with the AI's first command if we get that choice. Figuring this out falls squarely under AI governance.

The AI's user has slightly different metaethics than me

If the AI aligns to what its user "intends" according to the definition I gave above, there is no one I'd rather it be aligned to than me. You should also want the AI to align to you more than anyone else, even if you're the most unselfish person in the world, because an AI aligned to you will implement your particular version of unselfish metaethics more accurately than an AI aligned to someone else.

On the other hand, among people whose metaethics are pretty democratic and unselfish, I don't expect there to be a huge difference in outcome. Maybe in one person's interpretation of CEV, anyone who wants to can explore the limits of strange and alien transhuman mindspaces, but another person's interpretation ends up with some guardrails that keep everyone's mind vaguely human-shaped, and the preferences of the few people who want the first thing don't weigh enough to take the guardrails down. Maybe one of these worlds is sadder than the other, but either way, I think we end up with a pretty great outcome compared to the status quo.

By the time the AI gets its "command," people will all be pretty crazy

I could imagine there being weak AIs that could still cause enough havoc on the world to prevent humanity from reaching its potential, even if we get an intent-aligned superintelligence eventually. Wei Dai has mentioned this possibility, e.g. in this comment.

As a silly example, imagine that Coca-Cola were to launch a global ad campaign powered by AI so persuasive that any human put in charge of an intent-aligned AI would command it to fill the universe with Coca-Cola. Although this is obviously a terrible idea by our standards, the intent-aligned AI would follow the order if the user really meant it.

If there were really an AI powerful enough to convince everyone of something that crazy, it would obviously be superintelligent at social engineering, but you could imagine it being subhuman enough at long-term planning or other skills that it wouldn't just take over the world. This story also depends on the weak AI that makes everyone crazy being very obviously misaligned. I imagine there are better stories than this one, where the AI that makes everyone crazy is weaker or tries to do something closer to what its handlers actually want.

Conclusion

I presented a concern that is hopefully faithful to the views of Wei Dai and others:

"Even if we solve intent alignment, our superintelligent AI may end up settling on bad ethics."

Then I argued that only a weaker-sounding version of the original statement is true and expounded on the ways it might still be concerning:

"My ethics (today) might disagree with the ethics of the 'user' who will command the AI (at the time of the command)."

Recall that the original statement came packaged with the further assertion that the problem can be fixed if we solve metaethics in time. I'm not sure whether metaethics (or other kinds of metaphilosophy) would still be helpful for solving the concern in the weakened statement, or if they would only apply to the parts of the statement I tried to remove.

If the arguments in this post hold up, this would be good news in the sense that we hopefully wouldn't need to solve the additional problem of metaethics in order to get a good post-AI future, although unfortunately that doesn't make the intent alignment problem any easier. The concerns discussed in this post also help emphasize the importance of aiming for full-fledged intent alignment, rather than some weaker form of alignment like the kind that lets you maximize diamonds. I think this is what most alignment researchers are already aiming for.

Although it's somewhat tangential to this post, I'd really like to see more concrete stories for how the "AI makes humanity go crazy" scenario might play out, and ideas for how we could prevent any of the crucial steps in those stories from happening. I'd be especially interested to see a story where advances in metaethics are helpful. My Coca-Cola example isn't the most realistic, but I get the feeling there are plenty of scenarios that are more plausible.

New to LessWrong?

New Comment