This post addresses the best methods we have of creating (relatively) aligned AGI. The safety of these methods IS NOT guaranteed. AI Alignment is not a solved issue. 

These methods appear to offer better solutions than existing methods, such as simple maximizers. There is basically proof that a maximizer will inevitably wipe out humanity by pursuing instrumental goals in maximizing paperclips or preventing itself from being turned off.

Lastly, I am not an AI scientist (yet). My only experience in alignment is from reading the posts here and following AI alignment more generally online through channels like Robert Miles. These ideas are NOT mine. Hopefully, I do not misrepresent them.

The Quantilizer

The Quantilizer is an approach that combines a utility function with a neural network to determine the probability that an action is taken by a human.

For example, imagine the paperclip maximizer. By itself, it has an extremely high chance of wiping out humanity. But the probability of a human deciding to turn the world into paperclips is basically 0, so the model is likely to do something much more reasonable, like ordering stamps online. If you take the set of actions it can do, limit them by what a human would likely do and then select from the top 10% of them using the utility function, you can get useful results without running into doom.

The bell curve is the human-ness of a given action, determined by a neural network.

There is a much better explanation of this by Robert Miles on his YouTube channel.

And a post here on LessWrong about the mathematical nature of how they work.

Ultimately, quantilizers aren't perfectly safe, and they're not even particularly safe, but the chance that they wipe us out is smaller than a maximizer. They still have a lot of issues (they probably don't want to be turned off, because humans don't want to be turned off, they can only be as good as top-tier humans, and they aren't inherently aligned). Additionally, they won't outperform maximizers in most tasks.

But they don't instantly cause extinction, and there is a higher chance of alignment if the neural net does indeed become superhumanly intelligent at predicting how a human would act. This should be our safety baseline, and it is far better than the 99.9% chance of extinction Eliezer predicts for us if we continue along our current path.

Anti-Expandability/Penalize Compute Usage

One of the key issues that seems to crop up is the issue of intelligence that can make itself more intelligent. Obviously, there's only so much intelligence you can get from a single computer. Most AI horror stories start with the AI escaping containment, spreading itself, and then using the rich computing environment to make itself exponentially smarter.

To me, AI is threatening because the Internet is a fertile ecosystem, the same way Earth was a fertile ecosystem before the explosion of cyanobacteria. All of our computers are so interconnected and security is so fragile that an advanced AI could theoretically take over most of the internet with terrifying speed.

An AI that is limited to a single datacenter seems a lot easier to deal with than an AI-controlled Stuxnet. So you could theoretically limit the AI from spreading itself by including a penalty in its reward function for adding more computation power.

There is also a post on LessWrong showing that deception is penalized computationally in training, and so favoring algorithms which use less compute can be an effective way of preventing models from acting deceptively.

This is a good metric to include because we generally don't want our models to expand the computer power available to them, not just because of alignment but also because that's expensive. It's easy to set a baseline and then punish agents that go above that while rewarding agents that go below. We want efficient models.

The only reason not to include this basic safety feature seems to be that it isn't currently something AI agents exploit and so nobody thinks about it.


Let's ditch the reward function for a moment.

Humans don't actually know their own objective function. Rather, we have a rough model similar to a neural network that determines how much reward we get for a certain action. If happiness is our objective function, then why do we feel happy when we see a beautiful night sky or quality artwork? The truth is, we simply have a model of our reward function rather than an actual reward function.

Many people have talked about how difficult it is to formally encode all of our ethics and values into a reward function. Why not let a neural network do that instead, where the humans provide reward? You simply reward an intelligence when it does something you like, and penalize it when it doesn't. The humans control the training of the neural net and the neural net provides the reward based on what it has learned. This seems to intrinsically make the model more aligned with humanity.

Reward modeling has proven effective when the definitions are hard to specify. This seems like an excellent spot for it. (Scalable agent alignment via reward modeling: a research direction)

There are some issues, such as interpretability, but this direction seems promising.

This is the current extent of my knowledge of existing solutions to AI alignment. None of them are perfect. But they are vastly better than our current state-of-the-art alignment models, which comprise a maximization function and likely result in AI takeover and a monotonous future without humans.

People often like to 'hope for the best'. These models appear to show possibilities where we do end up aligned. I see no possibility for 'dumb luck' to save us in a future where the first superintelligent AI uses a maximization function, but I see plenty of hope in a future where we implement these existing approaches into an AGI.

New Comment
1 comment, sorted by Click to highlight new comments since: Today at 8:42 PM

Another point by Stuart Russel: Objective uncertainty. I'll add this into the list when I've got more time.