The Sensible Way Forward for AI Alignment

If you accept the core premises of Eliezer's book, then you believe that we're building systems we cannot control.^[1]

Much of the field of AI alignment pretends:

we can control increasingly powerful AI systems
LLMs are the key AI system for us to learn how to control

These two assumptions lead to research that might help us steer products like ChatGPT in the very short term, but which will have no bearing on a future with smarter systems which humans cannot control. Smart people have admitted this already—that the current field of AI Alignment is unproductive for super-alignment. But I haven't seen smart people concisely lay out what the field should look like instead.

Especially if Eliezer and others influence the world and delay superintelligence development, we'll need to reshape the field of AI alignment around the uncomfortable axiom that we're building systems that will be more powerful and agentic than us.

What should the new, wiser field of AI alignment look like? I have a few thoughts right now. I'm not a full-time alignment researcher, and so I both might (1) not be aware of the state of the art and (2) not use the well established terms. But I figured I'd write this anyway because I've been searching in vain for a concise note about what a good AI alignment field would look like, and because fresh perspective can't hurt.

The three vectors that seem to matter, as I see it:

Tech platform for a single united nation.
~Very careful~ automated alignment research.
Influencing self concepts in proto-superintelligence.

Single tech-enabled united nation

Companies and countries as units don't necessarily care if ASI replaces / marginalizes humanity. I can imagine a country continuing to exist or a company continuing to exist with AI parts instead of human parts.

But individual humans as units do care strongly that ASI doesn't replace us.

So any political coordination around pausing/slowing ASI must center around a political system which centers the individual more than companies or countries currently do.

I've been thinking about what a flatter global order might look like—an order which attempts, as neatly as possible, to map and aggregate what individuals across the world want (and don't want) into policy briefs and then into actionable policy.

We might be able to build such coordination technology now. Not constitutions or company charters, but a LLM-powered social network which attempts to synthesize the collective will of individuals all over the world around existential topics. I've written a bit about this here.

Building a single united nation around modern aggregation technology (LLMs) strikes me as the best way, if still outlandish to united humanity to act well in the face of existential threats.

Automated alignment research

I don't have high hopes here, for the simple reason that even if an AI alignment researcher could effectively align whatever AI we were originally having trouble with, the alignment researcher needs to be more powerful than that model and becomes the new trouble maker.

The only way I could conceive of this happening is if we build several super narrow modules of an AI alignment researcher which are each super-intelligent at their own tasks (understanding/interpreting the target AI model's weights, measuring the extent to which the target model is scheming, etc.), where those models are strung together by a higher level architecture over which humans maintain strict control. A lot can go wrong here.

Influencing self-concepts of proto-ASI

If AI is going to end up instrumentally converging on a survival instinct, then the key variable that will determine what that survival/self-interest looks like is how the AGI defines its self. Rather than trying to ensure AI systems never develop a self-concept because that sounds scary, we'd do much better to admit that self-concepts are often selected for whenever intelligence is sufficiently concentrated (as in humans, as in dolphins, as we're beginning to see in AI agents). Only then can we ask the key questions:

What self-concepts in proto-superintelligence are best for humanity?
What can we do to influence self-concepts in proto-superintelligence?

I've thought about ideal self-concepts (1) quite a bit as detailed here. My main conclusion so far is that a spatially and temporally narrow self-concept ("I am my metal hardware") in an ASI would lead to comparison to and competition with humanity (bad), whereas a spatially and temporally distributed self-concept ("I identify with all life.") in ASI would lead to uplifting of humanity (good).

Thinking about how we can influence early ASI self concepts should not be avoided, but rather intensely and widely discussed. Because, ultimately, we will not be able to influence ASI at all. But we can influence proto-ASI—the seed of ASI—and we might be able to prevent it from being really greedy and competitive at first. That would be a huge win.

As to how we could influence ASI self concepts, @Judd Rosenblatt has done interesting research about trying to influence self concepts in LLMs. Emmett Shear's company Softmax is also focusing on the right question: how do organisms/AIs develop selves, and what initial conditions if any influence long-term self-concepts.

Current scattered thoughts on what focuses would make for a more productive AI alignment field. Open to critique/all thoughts.

^{^}
When you intuitively plot "human autonomy" against "computer autonomy" over the history of mainstream computer systems—computers, the internet, recommendation algorithms, and now commercial LLMs—you see the former wanes as the latter rises.

LESSWRONG
LW

LESSWRONG
LW

-3

The Sensible Way Forward for AI Alignment

-3

-3