How the frame of training stories improves my thinking:

[Below are notes I wrote a few months ago after Reading the training story post from Evan Hubinger. I reflected on how it helped me think more clearly about alignment. I list some ways it informs my thinking about how to evaluate and prioritise alignment proposals and other ways of reducing x-risk that were not obvious to some people I talked with, so I want to share them here. ]

In short, the training story post says that when thinking about the alignment of an AI system, you should have a training story in mind. Specify the goal and then the training rationale (how you will end up with this goal), you then evaluate the story based on:

Training goal desirability: how good is the goal
Training goal competitiveness: how powerful would that AI be in a world with other competing AIs
Training rationale alignment: how robustly will the trained agent satisfy the desired goal
Training rationale competitiveness: how expensive would it be to undertake that training rationale

What this frame allows you to do in practice:

If you want to evaluate the alignment proposal X, you can ask

Tell me the best training story that ends up with a good AI that requires X as a step.
1. (…. -> B -> …. -> X -> …. -> C ) [training rationale] -> a form of good AI [training goal]
  1. (...) indicates steps, we roughly understand how they work.
  2. Capitalised letters represent problems that need to be solved for that training story to work.
Then one asks:
1. Is there a version of a training story that does not require X?
  1. E.g. (…. -> B -> …. --> …. -> C ) -> a form of good AI
2. Within that training story, why is step X the most important to work on?
  1. Is it because the others are not yet feasible, or is it because X is most likely to break, and so you want to test it first ….?
  2. Which step would be the most important to work on?
3. Evaluate the training story from above based on the 4 criteria.

If you want to come up with new research directions, you can:

Create many training stories that involve steps that you don't know how to take.
1. E.g. (…. -> K -> Q -> C …. ->N ) -> a form of aligned AI
Evaluate these different training stories based on
1. How hard do you expect it to be to solve all the missing steps (K, Q, C, N)
2. training goal competitiveness,
3. training goal desirability,
4. training rationale competitiveness, and
5. training rationale desirability.
Working on solving K is more promising, if
1. It is included in many training stories
2. It is included in the highest-ranked training stories
3. It also has information value:
  1. E.g. you will figure out whether K is impossible or very hard, and hence whether you should deprioritise this and other training stories.
4. …

Threat modelling improves training goal desirability and training rationale desirability estimates:

E.g. if you proved that the view on very quick self-improvement is correct, then you just changed the desirability of a bunch of training goals that are insufficient to be good in the quick self-improvement world.
You can evaluate threat modelling by how much you would change the promisingness of different training stories.

Governance work is partly about making the lack of training goal competitiveness and training rational competitiveness less costly:

If you have a large lead, then spending more money on compute for the aligned AI (lower training rationale competitiveness) is more likely to work in practice.
If labs work together, it is fine if you don't have that high training goal competitiveness.

Why all of this has helped me (and might also help you):

I think this helps me ask better questions about others’ research directions.
Writing this down, I noticed I don't know much work on how to make a lack of training goal competitiveness less costly.
It made me realise that I don't know many training stories and that this is even harder than I expected.

Potential limitations of the frame (I don't find them super convincing):

AI alignment is pre-paradigmatic, and training stories seem to ignore that fact. Doing robustly useful things may be better (but what is robustly useful? Others say maybe interpretability or agent foundations?).
1. Maybe, training stories are a very useful path to figuring out what is robustly useful.
2. But often, one may want to remain at a higher level, e.g. for abstractions work, one can argue that abstractions are a bottleneck for interpretability, and they tell us whether NAH is true - one assumes here that, e.g. interpretability is needed for many of the most promising training stories.

LESSWRONG
LW

CharlotteS's Shortform

New to LessWrong?

How the frame of training stories improves my thinking:

What this frame allows you to do in practice: