In this post I want to say that there exists an interesting way to approach Alignment. Beware, my argument is a little abstract.
If you want to describe human values, you can use three fundamental types of statements (and mixes between the types). Maybe there's more types, but I know only those three:
Any of those types can describe unaligned values. So, any type of those statements still needs to be "charged" with values of humanity. I call a statement "true" if it's true for humans.
We need to find the statement type with the best properties. Then we need to (1) find a "language" for this type of statements (2) encode some true statements and/or describe a method of finding true statements. If we succeed, we solve the Alignment problem.
I believe X statements have the best properties, but their existence is almost entirely ignored in Alignment field.
I want to show the difference between the statement types. Imagine we ask an Aligned AI: "if human asked you to make paperclips, would you kill the human? Why not?" Possible answers with different statement types:
X statements have those better properties compared to other statement types:
I want to give an example of the last point:
X statements more easily become stronger connected in a specific context (compared to value statements).
I can't formalize human values, but I believe values exist. The same way I believe X statements exist, even though I can't define them.
I think the existence of X statements is even harder to deny than the existence of value statements. (Do you want to deny that you can make statements about general properties of systems and tasks?) But you can try to deny their properties.
If you believe in X statements and their good properties, then you're rationally obliged to think how you could formalize them and incorporate them into your research agenda.
X statements are almost entirely ignored in the field (I believe), but not completely ignored.
Impact measures ("affecting the world too much is bad", "taking too much control is bad") are X statements. But they're a very specific subtype of X statements.
Normativity (by abramdemski) is a mix between value statements and X statements. But statements about normativity lack most of the good properties of X statements. They're too similar to value statements.
Contractualist ethics (by Tan Zhi Xuan) are based on X statements. But contractualism uses a specific subtype of X statements (describing "roles" of people). And contractualism doesn't investigate many interesting properties of X statements.
The properties of X statements is the whole point. You need to try to exploit those properties to the maximum. If you ignore those properties then the abstraction of "X statements" doesn't make sense. And the whole endeavor of going beyond "value statements/value learning" loses effectiveness.
Basically, my point boils down to this:
We need a "language" to formalize statements of a certain type.
Atomic statements are usually described in the language of Utility Functions.
Value statements are usually described in the language of some learning process (Value Learning).
X statements don't have a language yet, but I have some ideas about it. Thinking about typical AI bugs ("Specification gaming examples in AI") should be able to inspire some ideas about the language.
You also could help me to come up us with ideas about the language by discussing some thought experiments with me. I have a bigger post about X statements: Can "Reward Economics" solve AI Alignment?