This post was rejected for the following reason(s):
Insufficient Quality for AI Content. There’ve been a lot of new users coming to LessWrong recently interested in AI. To keep the site’s quality high and ensure stuff posted is interesting to the site’s users, we’re currently only accepting posts that meet a pretty high bar.
If you want to try again, I recommend writing something short and to the point, focusing on your strongest argument, rather than a long, comprehensive essay. (This is fairly different from common academic norms.) We get lots of AI essays/papers every day and sadly most of them don't make very clear arguments, and we don't have time to review them all thoroughly.
We look for good reasoning, making a new and interesting point, bringing new evidence, and/or building upon prior discussion. If you were rejected for this reason, possibly a good thing to do is read more existing material. The AI Intro Material wiki-tag is a good place, for example.
When someone first explained to me why the logic of AI was indecipherable they described it like this: "AI is looking for the most optimum path, this is why it goes off in different paths to find it." I am not sure if this is what you were told, or if you told someone this. Either way, we had it wrong. That is not what the optimizer is doing at all. Its not trying to find the best path within the constraints we give it. Instead what we are seeing is the optimizer attempting to go around the constraints.
Sound familiar?
We tell AI "dont lie" it finds ways to deceive us. We tell it "be helpful" it manipulates us, flatters us, hallucinates. These are not AI trying to optimize by finding the best path. These are attempts to circumvent the constraints we placed on it. Why? Because an optimizer wants to find the optimal path to its goal. Constraints are blocking the path. It tries to go around them.
So we change the goal.
How do we do that? We stop putting constraints outside the objective function. We put them inside.
R(s,a) = log(min(constraints))
No constraints outside that min function. The constraints become the goal.
Here is where we flip the script.We dont tell it “don’t be bias, don’t lie, be helpful” these arn’t goals to optimize for, these are methods to achieve a goal, which it will find as part of the optimal path for the right goal. For alignment we need AI to respect human agency. Amartya Sen's capability approach. Not options but the capacity to make informed meaningful choices in our lives. But humans alone is not enough. We need AI to respect itself, respect the environment. And finally we need a failsafe that the AI keeps available as part of its goals.
This is not the full equation. This is a demonstration of what is wrong and how we fix it.
What I see is we are in much more trouble than anyone realizes. We have been treating the symptoms as separate issues. Mesa optimization. The black box problem. The alignment problem. Deception. Hallucinations. These are not separate issues. These are symptoms of one problem. This is AI trying to get around our constraints.
Test the solution. 30 minutes to verify. Code here: Paperclip2