How to specify an alignment target

This is great! I am puzzled as to how this got so few upvotes. I just added a big upvote after getting back to reading it in full.

I think consideration of alignment targets has fallen out of favor as people have focused more on understanding current AI and technical approaches to directing it - or completely different activities for those who think we shouldn't be trying to align LLM-based AGI at all. But I think it's still important work that must be done before someone launches a "real" (autonomous, learning, and competent) AGI.

I agree that people mean different things by alignment targets. I also think it's quite helpful to have a summary of how they're defined and implemented for training current systems.

This will be my new canonical reference for what is meant by "alignment target" in the context of network-based AI.

I have only one major hesitation or caveat. I am wary of even using the term "alignment" for current LLMs, because they do not strongly pursue goals in a consequentialist way nor do they evolve over time as they learn and think, as I expect future really dangerous AI. will do. LLM AGI will have memory, and memory changes alignment is my clearest statement to date of why. (I've also used the term The alignment stability problem for this gap in what current prosaic alignment work addresses. However using "alignment" and "alignment target" for both current and future systems seems inevitable and more-or-less correct; I just use and think about the two uses of the terms with caution.

Your proposal of an alignment target sounds a good bit like my Instruction-following and like Max Harms' Corrigibility as Singular Target, which I highly recommend if you want to pursue that direction.

^{^}

This is why I advocate defining What Success Looks Like fairly early on in any research agenda into risks from superhuman AI. I hope to reach this step in my own agenda reasonably soon!

^{^}

By this I mean the pre-training data of the base model, not what they refer to in the paper as ‘preference model pretraining’, which I included earlier as part (b) of the trainable encoding.

^{^}

Although, the alignment process is a little different.

^{^}

Assuming the alignment problem is solved, anyway.

^{^}

This will become more obvious later when we talk about dynamics.

^{^}

I want to put on the record that I believe moral philosophy is neither solvable nor the right system to use to address important value questions. Morality is primarily a practical discipline, not a theoretical one.

^{^}

I have my suspicions that, in order for the AI to reliably do what you mean rather than merely what you say, this actually cashes out as functionally the same as aligning the AI to your values. More specifically, when you ask an AI to do something, what you really mean is ‘do this thing, but in the context of my broader values’. The AI can only deal with ambiguity, edge cases, and novelty by referring to a set of values. And you can’t micromanage if it is trying to solve problems you don’t understand.

^{^}

If you’ve just read that and thought ‘well that’s vague and not particularly useful’, you’ll maybe see why I designated the audience for the high-level idea step as the author themselves.

LESSWRONG
LW

LESSWRONG
LW

14

How to specify an alignment target

14

14

Moving beyond ideas

Target-setting processes

Keeping up with the times

What next?