Potential alignment targets for a sovereign superintelligent AI

Paul Colognese

29

[ Question ]

Potential alignment targets for a sovereign superintelligent AI

by Paul Colognese

3rd Oct 2023

1 min read

4 4

29

I'd like to compile a list of potential alignment targets for a sovereign superintelligent AI.

By an alignment target, I mean something like what goals/values/utility function we might want to instill in a sovereign superintelligent AI (assuming we've solved the alignment problem).

Here are some alignment targets I've come across:

Alignment to a human user (or group of human users).
Ambitious value learning.
Coherent extrapolated volition.

Examples, reviews, critiques, and comparisons of alignment targets are welcome.

Frontpage

29

Potential alignment targets for a sovereign superintelligent AI

New Answer

New Comment

4 Answers sorted by
top scoring

Steven Byrnes

Oct 03, 2023

146

If there’s a powerful AI not under the close control of a human, then I currently think that the least bad realistic option to shoot for is: the AI is motivated to set up some kind of “long reflection” or atomic communitarian thing, or whatever—something where humans, not the AI directly, would be making the decisions about how the future will go. In other words, the AI would be motivated to set up a process / system (or a process / system to create a process / system…) and then cede power to that process / system (or at least settle into a role as police rather than decision-maker). Hopefully the process / system would be sufficiently good that it would be stable and prevent war and oppression and be compatible with moral progress and so on.

Like, if I were given extraordinary power (say, an army of millions of super-speed clones of myself), I would hope to eventually wind up in a place like that, instead of directly trying to figure out what the future should be, a prospect which terrifies me.

This is pretty vague. I imagine that lots of devils are in the details.

Tamsin Leake

Oct 03, 2023

142

the QACI target sort-of aims to be an implementation of CEV. There's also PreDCA and UAT listed on my old list of (formal) alignment targets.

Nathan Helm-Burger

Oct 04, 2023

Corrigibility. Namely, the desire to want to be corrected if wrong, to be turned off if the operators want to turn it off, to enact the operators desires otherwise. Caution and obedience. https://www.lesswrong.com/posts/ZxHfuCyfAiHAy9Mds/desiderata-for-an-ai

Chris Lakin

Oct 04, 2023

«Boundaries»/membranes.

Eg: «Boundaries» for formalizing an MVP morality

Also: see the recap in Formalizing «Boundaries» with Markov blankets + Criticism of this approach

Note also that there's (at least) two ways to do this, which I need to write a post about (or let me know if you want to review my draft). One way is like "be a Nanny AI and protect the «boundaries» of humans", another way that's like "mind your own business and you will automatically not cause any problems for anyone else". The former is more like Davidad's approach (at least as of earlier this year), the latter is more like Mark Miller's thoughts on AI safety and security.

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

29

[ Question ]

Potential alignment targets for a sovereign superintelligent AI

29

29

4 Answers sorted by
top scoring

Oct 03, 2023

Oct 03, 2023

Oct 04, 2023

Oct 04, 2023

29

29

[ Question ]

Potential alignment targets for a sovereign superintelligent AI

29

29

4 Answers sorted by top scoring

Oct 03, 2023

Oct 03, 2023

Oct 04, 2023

Oct 04, 2023

29

4 Answers sorted by
top scoring