Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.
One example is the line “principles must survive power”. Granted that’s a true description of a goal one might have in alignment, but the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.
I still like the motivating question, and I will check out Epictetus now!
Before reading your disclaimer that Claude helped with the aphorisms, I felt like the post felt a bit like AI slop.
Damn, I should review and refine it more then. “principles must survive power” was actually something I manually reviewed, and "power" was meant to aphoristically reflect that the constitutional principles must scale with capabilities. Yeah... it doesn't quite work, but it's hard to get to compress such complex things.
the spirit of constitutional ai is that the model has the capability of judging whether it acts in accordance with a principle even if it can’t always act correctly. So I’m not sure what that line is saying, and it felt a bit meaningless.
Hmm, yes it sounds like it did not capture the spirit of it, and aphorisms really should.
I'd like it if someone made in improved version 2, and would personally benefit from reading it, so feel free to make a new version or propose a better aphorism.
I still like the motivating question, and I will check out Epictetus now!
If you do, "How to be free" is a pleasant and short translation of his Enchiridion. I'd recommend it! Although a lot of people find "How to think like a Roman Emperor" is a better intro to the way of thinking.
Most alignment overviews are too long, but what if we rewrote one as a series of aphorisms?
I like Epictetus's confronting style: abrasive, clarifying. See my fuller post for links and nuance.
I
Some problems can be solved by being smarter.
Some problems can only be solved by having help.
Aligning something smarter than you is the second kind.
So many proposals collapse to:
use AI to help supervise AI.
It sounds simple. It's the only thing that works.
II. On using AI to supervise AI
If a weak model helps supervise a strong one,
and that one supervises a stronger one still:
this is Iterated Amplification.
The chain must hold.
If two models argue and we watch:
this is Debate.
Checking is easier than creating.
If we give the model principles and ask it to judge itself:
this is Constitutional AI.
Principles must survive power.
If we build AI to do alignment research:
this is Superalignment.
Train it. Validate it. Stress test it.
III. On not building one big AI
A corporation can be smarter than any employee.
Yet no employee wants to take over the world.
Many narrow tools, none with the full picture—
this is CAIS.
IV. On making the objective less wrong
Make the AI uncertain about what we want.
Then it must ask.
This is CIRL.
If it can't be satisfied, it won't stop.
Give it a notion of "enough", and it can rest.
This is satisficing.
V. On control tools
If the model lies, look past the output.
Find what it knows, not what it says.
This is ELK.
If prompting fails, steer the internals.
This is activation steering.
Output-based evaluation breaks
when models are smarter than evaluators.
Look for the thought, not the performance.
VI. On understanding what we build
You can't align what you don't understand.
This is agent foundations.
What is an agent?
What is optimization?
Confused concepts make confused safety arguments.
VII. On older proposals
Keep it in a box. (AI boxing)
But boxes leak.
Let it answer, not act. (Oracle AI)
But answers shape action.
Do what we'd want if we were wiser. (CEV)
But how do we extrapolate?
VIII. Beyond alignment (bonus)
The optimist case: the problem is easier than it looks.
It still takes correction.
This is the optimist case.
IX. On who aligns the aligners (bonus)
Suppose we solve alignment perfectly.
Aligned to whom?
A safe AI in the wrong hands is still a problem.
This is governance risk.
Rewritten from my original draft with Claude.