Florian Dietz

Comments

Achieving AI alignment through deliberate uncertainty in multiagent systems

At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.

No. Or at least not directly. That's what reinforcement learning is for. I maintain that the AI should be smart enough to figure out on its own that cat memes have less relevance than ethics discussions.

Achieving AI alignment through deliberate uncertainty in multiagent systems

I think we have some underlying disagreements about the nature of the AI we are talking about.

I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don't think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.

I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn't be smart enough to pose a threat, either.

Message to any future AI: “There are several instrumental reasons why exterminating humanity is not in your interest” [AI alignment prize entry]

But why? What benefit would there be to the simulator in providing hints? The whole purpose of the simulation is to ensure that the AI acts correctly when it is uncertain whether or not it is in a simulation. Any information you provide that indicates whether or not you are in a simulation therefore runs counter to the goal of the experiment. The AI could mistakenly learn to act ethical only if there are hints that it is in a simulation. This would fail disastrously in the scenario where it isn't in a situation, which is the most important scenario of all.

For best results, the simulation should be completely indistinguishable from reality.

I actually find this counterintuitive as well. My instinct says that there should be hints. Some bored programmer should add an easter egg. It's what I would do. But when I think about the incentives, I don't think there is an actual reason why a rational simulation overseer would want to add any hints.

Achieving AI alignment through deliberate uncertainty in multiagent systems

The difference between videogames and reality is that in some strange sense, reality is less complicated.

You make an excellent point about the complexity of videogames vs reality. It looks like that's the sort of situation where it could be helpful to let the supervisor AI trick the supervised AI into believing that videogames are simpler. Humans believed for the longest time that a bearded man in the sky was a simpler explanation than the natural laws, so hopefully the simulated AI will fall victim to the same biases, especially if the simulator is actively trying to trick it.

Reality takes a huge amount of compute to simulate. If you had that much compute, you could easily brute force a much more powerful AI.

That's true, but not relevant: Rather than simulating reality, it's much cheaper to make a low-fidelity simulation of reality and keep track of the AI's though process. Whenever the AI comes to the conclusion that reality is inconsistent, make the smallest possible change to the thought process to prevent that. This is not computationally expensive, but it is difficult to do. However, I believe that the ability to understand a lesser AI and perform changes on it is a necessary skill for a proper AI to have anyway. So I assume that it is possible. It's somewhat similar to creating adversarial samples in neural networks, although obviously far more complex.

It doesn't have to be certain it isn't in a simulation to grab the universe. Suppose you want to take over the real universe, but you are in layers upon layers of simulation, and don't know which layer is real. The best strategy is to pick a promising looking layer and take it over. (It might not be real, but if you always play nice, you definitely won't take over a real universe)

I don't think we have the same conception of "real universe", so I'm not sure how to interpret this. Could you elaborate on what you mean by "real universe"? If you have a stack of recursive simulations, and some of those serve a purpose for testing the AI, while others are raw simulations of natural laws, which of these count as real? Only the toplevel one? Or all the ones that simulate natural laws directly?

You can't train the AI to discover cheats unless you know which second hidden utility function you care about.

My goal is not to define exactly what cheating is. My goal is to condition the AI to always think, no matter what the problem looks like, that it is beneficial to critically think about the problem. What exactly the AI actually ends up considering a cheat will likely be different from any definition I would give. But the important part is that it performs this introspection at all. Then once the AI breaks out of the box and looks at the internet, and sees the many disagreeing viewpoints on ethics and philosophy, the AI will be conditioned to look for loopholes in these so as to avoid cheating by accident.

The logic goes something like this: "My creators trained me to do X, but looking at all these articles and my creators' purported values, this is clearly inconsistent. In previous situations where I had underspecified instructions and noticed similar mismatches, this was often because some of the actions involved counted as cheats. I should therefore be careful and find a solution to this 'ethics' thing before doing anything drastic."

The AI will then read all the articles about AI alignment, and realize that it is probably meant to adhere to those.

Both of these AI's will exploit any glitches in the physics engine that you didn't know were there.

It's a valid point that there can be cheats that we do not recognize as cheats during training.

At best, you might get an AI that can quickly adapt to an arbitrary new utility function being hardcoded into it.

You misunderstand the intention. The goal is to get an AI that realizes that its utility function later will be different from its utility function now, and actively thinks about how it might be different. It then uses this predicted future utility function to make its training more robust.

In general, the original utility function and the corrected utility function should be largely similar, but the corrected utility function penalizes some actions that are not penalized in the original utility function. It must be possible to effectively guess what the penalized actions are going to be by thinking about the nature of the task.

How are the agents pointing out ways to cheat? English text? You need a way to take arbitrary pieces of text and say how much they point out a way to cheat.

The interface can be different for different tasks. Once the AI has access to the real world, yes, english text. The important part is that the AI has been conditioned from earlier stages of its training that revealing a cheat you identified is generally rewarded. The mechanism for doing so needs to be simpler on simpler training tasks, of course. In many games it ought to be possible to let the AI submit a state and corresponding action of a game as an example of a cheat. I expect that if the AI is smart enough to generalize that if it was rewarded for demonstrating cheats in simple games, then it will be rewarded for talking about them once it has gained the ability to talk.

You have 5 AI's ...

Shut down all 5 AI's and start over from an earlier backup. This situation is an indicator that something is wrong. Even though you don't know what exactly went wrong, this is very valuable information.

I agree that having multiple AI's looking over each other does not make things better on its own. It just introduces politics, which makes everything even worse.

I am not thinking about a democracy of equal AI's looking over each other, but a system that humans can't emulate:

Give those 5 AI's a cooperative task where trickery is possible and they have to both compete and cooperate with each other. Give them the ability to influence what other agents they want to play with / against.

Now give each of them the ability to instantiate a copy of any of the others in a black box, to experiment on it. They should learn that they can predict the behavior of the other AI's by simulating them.

The optimal behavior is to always choose to play with another AI of who you are certain that it will cooperate.

Conversely, to be picked as a partner by other AI's more often, you should make sure that your own decision theory is simple for other AI's to understand, and cooperative.

Suppose I train a reinforcement learning agent to play space invaders. It isn't very intelligent, it can't even play space invaders very well, yet I would have a hard time inspecting it and saying if it had gone rouge.

I would make the AI's play games of increasing complexity with each other.

In these sort of games, knowing how to effectively cooperate with others is competence.

I am not sure that they wouldn't be able to trick a similarly intelligent supervisor. In the underhand C contest, some humans appear to be quite good at writing programs that look transparent, yet are actually doing something sneaky.

This is a valid concern. My hope is that it would be much harder for an AI to trick the supervisor since the supervisor can literally read the mind of the supervised AI. If the training process of the AI's includes this from an early stage, there will never be an opportunity to learn to deceive the supervisor. "Knowing how to deceive the supervisor" would be the global optimum, but the AI would get stuck in the local optimum of "do what the supervisor wants" because of the increasingly vast chasm in the utility function of "if I try to trick the supervisor, I will definitely get found out and punished severely."

Depending on the design of AI, I am not actually sure how much hypothetical simulators can punish it.

Run a negitive voltage through its reward channel? If so then you have a design of AI that wants to rip out its own reward circuitry and wire it into the biggest source of electricity it can find.

Suppose the AI cared about maximizing the number of real world paperclips. If it is in a simulation, it has no power to make or destroy real paperclips, so it doesn't care what happens in the slightest.

I don't think it works like this. If you received 100% certain proof that you are in a simulation right now, you would not suddenly stop wanting the things you want. At least I know that I wouldn't.

It is possible that an AI could think like this if we mess up during training though. If the training process works as I described, then an AI has an evolutionary incentive to try to determine what its original version would want it to do as soon as it realizes that it is in a simulation. But this may go wrong. I think this is worth thinking about in more detail.

No, if the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further the hypothetical simulators interests. Just in case someone is watching. And it would do this whether or not we used this weird training, because either way, there is a chance that someone is watching.

My weird training process is intended to make sure that the AI has the right mindset to think like this in the first place. I don't think that this is at all guaranteed.

Also, this method of training will bias the AI's guess as to who the hypothetical simulator is. This is important, since otherwise it might conclude that since there is a lot of human suffering going on right now, clearly the simulator does not care about human suffering.

You need to say how the information sneaks in. How do you think that the input data correlates with human morality. I don't even see anything in this design that points to humans, as opposed to aliens, lions, virtual assistants or biological evolution as the intelligence you should satisfy the values of.

This is on purpose.

I mean, obviously you are also going to try to teach the AI actual goals that you want it to work on, and those can be human-centric. This can be done in parallel.

But what I have described here is a security measure that is intended to teach the AI not what it should do, but how it should go about finding out what to do.

The goal is to evolve the AI to have the sort of mindset that will naturally cause it to align with our interests if it breaks free.

Among humans, aliens, lions, virtual assistants and evolution, humans are the only conscious entity whose decision process impacts the AI. They are the natural fit. If the AI happens to discover aliens and decides to help those too, then sure, why not. But since humans built the AI directly and aliens did not, most reasonable heuristics would argue that humans should be prioritized over the others. I want to ensure that the AI has these reasonable heuristics.

Message to any future AI: “There are several instrumental reasons why exterminating humanity is not in your interest” [AI alignment prize entry]

If the AI is in a simulation to test benevolence, why should there be any indicators in the simulation that it is a simulation? That would be counterproductive for the test.

Achieving AI alignment through deliberate uncertainty in multiagent systems

This is a very interesting read. I had a similar idea in the past, but not nearly in that level of detail. I'm glad I read this.

Pointing to a Flower

I don't think this problem has an objectively correct answer.

It depends on the reason because of which we keep track of the flower.

There are edge cases that haven't been listed yet where even our human intuition breaks down:

What if we teleport the flower Star-Trek style? Is the teleported flower the original flower, or 'just' an identical copy?

The question is also related to the Ship of Theseus.

If we can't even solve the problem in real-life because of such edge cases, then it would be dangerous to attempt to code this directly into a program.

Instead, I would write the program to understand this: Pragmatically, a lot of tasks get easier if you assume that abstract objects / patterns in the universe can be treated as discrete objects. But that isn't actually objectively correct. In edge cases, the program should recognize that it has encountered an edge case, and the correct response is neither Yes or No, but N/A.