spacyfilmer - LessWrong

One alignment idea I have had that I haven't seen proposed/refuted is to have an AI which tries to compromise by satisfying over a range of interpretations of a vague goal, instead of trying to get an AI to fulfill a specific goal. This sounds dangerous and unaligned, and it indeed would not produce an optimal, CEV-fulfilling scenario, but seems to me like it may create scenarios in which at least some people are alive and are maybe even living in somewhat utopic conditions. I explain why below.

In many AI doom scenarios the AI intentionally picks an unaligned goal because it fits literally with what it is being asked to do, while being easier to execute than what was it actually being asked to do. For instance, tiling the universe with smiley faces instead of creating a universe which leads people to smile. GPT-3 reads text and predicts what is most likely to come next. It can also be made to alter its predictions with noise or by taking on the style a particular kind of writer. Combining these ideas one has the idea to create an AI interpreter which works like GPT-3 instead of a more literal one. You then feed a human generated task to the interpreter (better than "make more smiles" but vague in the way all language statements are, as opposed to a pure algorithm of goodness) which can direct itself to fulfill the task in a way that makes sense to it, perhaps while asked to read things as a sensitive and intelligent reader would.

To further ensure that you get a good outcome, you can ask the AI to produce a range of interpretations of the task and then fulfill all of these interpretations (or some subset if it can produce infinite interpretations) in proportion to how likely it thinks each interpretation is. In essence devoting 27% of the universe to interpretation A, 19% to interpretation B, etc. This way even if its main interpretation is an undesirable one, some interpretations will be good. Importantly the AI should be tasked to devote the resources without judgement to how successful it thinks it will be, only in terms of how much it prefers each interpretation. Discriminating too much on chance of success will just make it devote all of its resources to an easy, unaligned interpretation. To be even stronger the AI should match its likelihood of interpretation with the cost of maintaining each interpretation, rather than the cost of getting to it in the first place. If this works, even if substantial parts of the universe are devoted to bad, unaligned interpretations, some proportion should be devoted to more aligned interpretations.

One problem with this solution is that it increases the chance of S-Risk as compared to AI which doesn't remotely do what we want. Even in a successful case it seems likely that an AI might dedicate some portion of its interpretation to a hellish scenario along with more neutral or utopian scenarios. Another issue is that I'm just not sure how the AI interpreter translates its interpretations into more strict instructions for the parts of itself that executes the task. Maybe this is just as fraught as human attempts to do so?

Have such ideas been proposed elsewhere that I'm not aware of? Have critiques been made? If not, does reading this inspire any critiques from people in the know?

LESSWRONG
LW

Posts

Wiki Contributions

Comments