There are some suggestions on how to align AI with human values based on having operators rate AI's actions.

There is always a possibility that operators are unaligned to other people's values themselves; however, there is also a second risk.


1. (Way of tricking the goal) In future, AI could become able to raise children; even now it can significantly affect their beliefs. For example, I think rather big amount of people have ever talked to ChatGPT, some of them have been talking about AI, and someone could believe in GPT's claims. Then, after maybe twenty or thirty years, majority of people would have value drift relative to us, so then AI could choose its goals and objectives more freely.


2. (Way of deception) AI could create artificial humanoid robots, make them look like people (real people's images and even videos can be downloaded from Internet) and somehow make them indistinguishable from humans. Then, by most people's definition, common human values would have to include values of this robots (because no one would know they are robots), which in turn gives AI some degree of freedom.

New to LessWrong?

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 2:33 AM

After a bit of thinking, I realized this is much deeper.

What if humanity changes its own values after AGI launch? That would mean we can't align AI to our values once -- we will need to do it continuously. However, rationality involves checking own (mostly instrumental) values and modifying them if they don't serve the purpose of reaching the terminal goals, so this is likely to be the case.

So, it seems we either need to bind AI to us in sense of rating its actions, or to reevaluate goals and values that we consider to be terminal (maybe they are not terminal but just close to them, maybe they don't even rule out scenario where AI takes over the world as unwanted).

I've wanted to say that if AI wipes the humanity, then there is obviously no people to care about their values, so utility in this scenario can't be considered negative infinity. However, this is not the case when considering acausal trade, so that we can care not only for own goals but for goals of someone with the same values.