Epistemic status: The further you scroll, the more-important the points are.

I continue my deconfusion from my alignment timeline post. Goal: figure out which sub-problem to work on first in AI alignment.

Quick Notation

1 = steering

2 = goal

1.1k no steering found

1.2k poor steering found

1.3g steering lets us do pivotal

1.4g steering + existing goal inputs

1.5g steering not needed, alignment by default

2.1k no goal found, even with steering

2.2k poor goal found, combined with steering

2.2g goal is good/enough

The Logic

Let's say that my effort single-handedly changed the relative balance of research between 1 and 2. So we ignore scenarios where my work doesn't do anything. (By same intuition, we ignore 1.5g, since that doesn't require effort.)

If I research 1 (first) and X happens, what happens?

  • I find bad ideas I think are good. --> I overconfidently promote it, wasting time at best and causing 1.2k at worse.

  • I find good ideas. --> 1 is solved --> 1.3, 1.4, or 2 must happen.

    • 1.3g: I, or a group I trust, must be able to do pivotal act before 1's solution leaks/is independently re-discovered.

    • 1.4g: Good ending is easy to achieve by quickly coding an implementation of 1's solution.

    • 2: I, or a group I trust, must be able to solve 2 before 1's solution leaks/is independently re-discovered.

If I research 2 (first) and X happens, what happens?

  • I find bad ideas I think are good. --> I overconfidently promote it, wasting time at best and causing 2.2k at worst.

  • I find good ideas. --> 2 is solved --> 1 must also be solved, for 2.2g to happen.

    • 1: I, or a group I trust, must be able to solve 1.

The Bottom Line

I, personally, right now, should have the key research focus of "solve problem 2".

If I get interesting ideas about problem 1, I should write them down privately and... well, I'm not sure, but probably not publish them quickly and openly.

This will be the case until/unless something happens that would have made me change my above logic.

Some things that have not happened, but which I would expect to change my mind on the above points:

  • Problem 1 or problem 2 gets obviously-solved. --> Jump to the other respective branch of the logic tree.

  • A global AI pause actually occurs, in a way that actually constrains Anthropic and OpenAI and DeepMind and Meta s.t. AGI timelines can be pushed further out. --> Tentatively prioritize working on problem 1 more, due to its higher "inherent" difficulty than problem 2.

  • Cyborgism succeeds so well that it becomes possible to augment the research abilities of AI alignment researchers. --> Drop everything, go get augmented, reevaluate the alignment situation (including the existence of the augmentations!) with my newfound brainpower.

  • Some "new" Fundamental Fact comes to my attention, that makes me redraw the game tree itself or have different Alignment Timeline beliefs. --> I redraw the tree and try again.

  • I get feedback that my alignment work is unhelpful or actively counterproductive. --> I redraw the tree (if it's something minor), or I stop doing technical alignment research (if it's something serious and not-easily-fixable).

More footnotes about my Alignment Timeline specifically.

  • The failure mode of a typical "capabilities-frontier" lab (OpenAI, Anthropic, DeepMind) is probably either 1.1k, 1.2k, or 2.1k.

  • As far as I know, Orthogonal is the only group devoting serious effort to problem 2. Therefore, my near-term focus (besides upskilling and getting a grant) is to assist their work on problem 2.

  • Orthogonal's failure mode is probably 2.2k. In that scenario, we/they develop a seemingly-good formal-goal, give it to a powerful/seed AI on purpose, turn it on, and then the goal turns out to be lethal.

  • The components of my Timeline are orthogonal to many seemingly-"field-dividing" cruxes, including "scaling or algorithms?", "ML or math?", and "does future AI look more like ML or something else?". I have somewhat-confident answers to these questions, and so do other people, but the weird part is that I think others' answers are sometimes wrong, whereas they would think mine are either wrong or (at a first pass) mutually-exclusive.

    For example, I'm clearly going for theoretical-leaning work like MIRI and especially Orthogonal, and I also think future superhuman AI will be extremely ML-based. Many people think "ML is the paradigm AND formal alignment is unhelpful", or "ML is the wrong paradigm AND formal alignment is essential".

    I may write more about this, modulo if it seems worth it / I have time/energy at the time.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 12:36 AM

NOTE: I used "goal", "goals", and "values" interchangably in some writings such as this, and this was a mistake. A more consistent frame would be "steering vs target-selection" (especially as per the Rocket Alignment analogy).