An Aphoristic Overview of Technical AI Alignment proposals

wassname

Alignment Aphorisms

Most alignment overviews are too long, but what if we rewrote one as a series of aphorisms?

I like Epictetus's confronting style: abrasive, clarifying. See my fuller post for links and nuance.

I.

Some problems can be solved by being smarter.
Some problems can only be solved by having help.
Aligning something smarter than you is the second kind.

So many proposals collapse to:
use AI to help supervise AI.
It sounds too simple. It's the only thing that scales.

II. On using AI to supervise AI

A teacher can grade a student slightly smarter than herself.
But not much smarter - unless she has tricks.

How does weak supervise strong?
Hard problems decompose.
Supervise each step; align the whole.
This is Iterated Amplification.

If two models argue, a weaker one can judge:
judging is easier than creating.
This is Debate.

Give the model principles and let it judge its own actions:
principles generalize where labels cannot.
This is Constitutional AI.

They already know what's good.
Weak teaching can unlock it.
This is Weak-to-Strong Generalization.

III. On not building one big AI

A corporation can be smarter than any employee.
Yet no employee optimizes the world.

Many bounded tools, no unified goal.
No unified agent; no unified misalignment.
This is CAIS.

IV. On making the objective less wrong

Make the AI uncertain about what we want.
Then it must ask.
Uncertainty makes cooperation optimal.
This is CIRL.

If it can't be satisfied, it won't stop.
Give it a notion of "enough", and it can rest.
A satisficer has no incentive to rewrite itself.
This is satisficing.

V. On control tools

Align them to want what we want.
Control will catch when they don't.
Both are needed. The weaker model needs eyes inside.

If the model lies, look past the output.
A weaker model can supervise a stronger one - if it can read its mind.
This is ELK.

Like us, models say one thing but value another.
Prompting shapes words. Steering shapes thoughts.
This is steering.

When models outsmart the weaker one,
their words can deceive.
See the reasons, not the results. Even the hidden thoughts.
This is interpretability.

VI. On understanding what we build

Tricks that work now might break at scale.
Understanding doesn't.

What is an agent? What is optimization?
If you can't define it, you can't align it.
This is agent foundations.

VII. On older proposals

Keep it in a box. (AI boxing)
But boxes leak.

Let it answer, not act. (Oracle AI)
But answers shape action.

Do what we'd want if we were wiser. (CEV)
But wiser toward what?

VIII. Beyond alignment (bonus)

The optimist case: AIs are white boxes. Brains are not.
Values are simple. They're learned early.
This is the optimist case.

IX. On who aligns the aligners (bonus)

Suppose we solve alignment perfectly.
Aligned to whom?

A safe AI in the wrong hands is still a problem.
This is governance risk.

Rewritten from my original draft with Claude. Compressed from Shallow review of live agendas and others.

Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.

One example is the line “principles must survive power”. Granted that’s a true description of a goal one might have in alignment, but the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.

I still like the motivating question, and I will check out Epictetus now!

Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.

Damn, I should review and refine it more then. “principles must survive power” was actually something I manually reviewed, and "power" was meant to aphoristically reflect that the constitutional principles must scale with capabilities. Yeah... it doesn't quite work, but it's hard to get to compress such complex things.

the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.

Hmm, yes it sounds like it did not capture the spirit of it, and aphorisms really should.

I'd like it if someone made in improved version 2, and would personally benefit from reading it, so feel free to make a new version or propose a better aphorism.

I still like the motivating question, and I will check out Epictetus now!

If you do, "How to be free" is a pleasant and short translation of his Enchiridion. I'd recommend it! Although a lot of people find "How to think like a Roman Emperor" is a better intro to the way of thinking.

I actually updated it based on your feedback, if you or anyone else has insight into the "spirit" of each proposal, I'd be grateful. Especially agent foundations.

Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.

I still like the motivating question, and I will check out Epictetus now!

Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.

the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.

Hmm, yes it sounds like it did not capture the spirit of it, and aphorisms really should.

I'd like it if someone made in improved version 2, and would personally benefit from reading it, so feel free to make a new version or propose a better aphorism.

I still like the motivating question, and I will check out Epictetus now!

I actually updated it based on your feedback, if you or anyone else has insight into the "spirit" of each proposal, I'd be grateful. Especially agent foundations.

LESSWRONG
LW

LESSWRONG
LW

11

An Aphoristic Overview of Technical AI Alignment proposals

11

Alignment Aphorisms

I.

II. On using AI to supervise AI

III. On not building one big AI

IV. On making the objective less wrong

V. On control tools

VI. On understanding what we build

VII. On older proposals

VIII. Beyond alignment (bonus)

IX. On who aligns the aligners (bonus)

11

11