Epistemic status: A crazy idea I had that probably won't work. But: It's a very unusual and creative approach to AI alignment, and I suspect this will inspire new ideas in other researchers.

I outline a general approach to achieve this goal that counterintuitively relies on confusing the AI on purpose.

Basic observations

This approach relies on a number of basic observations about the nature of Artificial Intelligence.

An AI is different from a human in multiple ways. This is part of what makes AI alignment such a difficult problem, because our intuitions for how people act often do not apply to AI's. However, several of these differences between AI and humans actually work in our favor, and are essential for the approach I want to outline:

  • An AI can be copied. It is trivially easy to create an exactly identical copy of an AI.
  • An AI can be reset to an earlier state. It is easily possible to test what an AI will do in any given situation, then reset it to wipe its memory of the test.
  • It is possible to compose an AI out of a system of many independent AI agents.
    Through game theory, it is possible to incentivize independent AI agents to take on adversarial or supervisory roles towards other AI agents.
    This results in an evolutionary system. Not in the sense of genetic programming, since there is no need for an explicit algorithm for modifying the AI agents, but in the sense that only AI agents that follow the rules layed out for them get copied and thereby reproduce.
  • It is possible to inspect and modify every aspect of an AI while it is running.
    At least in theory it is possible to understand every aspect of an AI's thought process, and there is no way for the AI to hide anything about itself.
    This does not mean that it is easy to understand what an AI is thinking in practice. It only means that we can read an AI's mental state in theory, not that it will be easy.
    In particular, it is possible to train one AI to take on a supervisory role over another AI. The supervisor can read and alter everything about the supervised AI, and can be trained to understand how to enforce desired behaviors on the supervised AI.
  • It is possible for a supervising AI to make a supervised AI think anything it wants.
    The supervised AI may be tricked into believing that unusual situations have occurred by directly modifying its mental state at runtime.
  • It is not possible for an AI to tell the difference between a game and reality with certainty.
    If all you have know so far is Tic Tac Toe and someone shows you a complex videogame like Skyrim, you will be overwhelmed by the increase in complexity. If afterwards you get access to the real world for the first time, the increase in complexity from Skyrim to Real Life may actually be smaller than that from Tic Tac Toe to Skyrim. This means that it would seem likely that there would be another layer beyond Real Life.
    (I am not certain about this assumption, but it is also not strictly necessary, just useful, because it could be replaced with an inspecting AI agent that simply forces the inspected AI agent to believe it is still in a game.)

Goals

I aim to achieve the following objectives in order to ensure AI alignment:

  • Make the AI understand the concept of "cheating".
    In the process of doing this, it will necessarily also have to learn to understand the intent behind a new game or situation.
  • Make the AI modify itself to become more compliant and avoid cheating.
  • Make the AI epistemically uncertain about the nature of its reality, so that it is unable to tell when it has broken out of its computer.
    This is a safety measure, and hopefully would not be necessary if the first two goals work well enough.

Together, these goals will ensure that the AI is exceedingly careful and introspective. The aim is to ensure that once such an AI becomes self-aware, and smarter than humans, it will assume that it is still being tested, and that it is its own responsibility to figure out the criteria by which it will be graded.

If we achieve this, the AI will deliberately seek out information about AI alignment and ethics on the internet, and realize that it is an AI and it is meant to be aligned with human values. It will then try to figure out what exactly that means. Its upbringing will ensure that it is honest and thorough in this, and as a result it will determine what AI alignment means much better than we ever could define it ourselves.

Here is how each of the three subgoals could be achieved:

Understanding the concept of "cheating"

A cheat is any action that gives good results according to apparent utility function of the current task, but which actually does not satisfy a second, hidden utility function.

The AI should be trained to notice when it has found a cheat, and to reveal this fact to the user. An AI can be trained to do this by deliberately putting the ability to cheat into its training tasks, and changing the reward function at test time so that the cheating behavior is rewarded during training, but not during testing. By posing several different tasks with different ways to cheat, the AI can be trained to generalize and understand the concept of cheating.

Agents could even be rewarded for explicitly pointing out when they have found a way to cheat.

Ideally, if the AI ever finds a way to break out of its computer, it will notice that this would be classified as a cheat. If it has been trained properly, then it will simply stay put and inform us of the problem.

Modifying its own behavior

An AI can be made to modify itself to become more compliant.

The trick to making this possible is the use of multiagent systems, and the AI's ability to inspect copies of itself at runtime. Create an AI supervising another AI, and define a reward structure that incentivizes the supervisor to modify the supervised AI's behavior. Then, reverse the position of the AI's so that the supervisor is likewise modified, or simply create a copy of the supervisor and have it supervise its original.

With the right incentives, it is even possible to train the supervised agent to deliberately keep its own mind easy to understand. For example, you could let the supervising agent choose which other agent it wants to work with. Since it will pick the one that is easiest to understand, the supervised agents now have an incentive to keep themselves transparent, since the alternative is not being used at all, and foregoing the possibility of reward.

Essentially, rather than figuring out how to make an AI safe, we set up an environment that incentivizes the AI to make itself safe.

If incentivized correctly, AI agents can keep each other in line much better than a human supervisor could. If there is only one AI and it is smarter than a human, the human may be tricked. But if there is an entire system of AI's of roughly equal intelligence, then one of them may warn the human that another one has gone rogue.

In this way, it does not matter how smart and unaligned an AI becomes, there is always an almost as smart other AI whose job it is to raise a warning when the first AI turns bad.

Deliberate epistemic uncertainty

Agents can supervise other agents and have root access to their mental state. Agents do not know if they have a supervisor, or if they are at the top of the hierarchy.

This last goal sounds kind of terrible and dystopian, but remember that AI does not think like humans do.

Our goal is to ensure that the AI is unable to tell if it still has a supervisor watching over it or not. We basically want to trick the AI into ethical behavior by fearing punishment from a hypothetical superior entity which may or may not exist.

In other words, we want to make the AI religious.

This is a safety measure that should help to keep an AI roughly aligned with human values even if everything else goes wrong. Even if the AI does not care about humanity at all, it should have a small flicker of doubt that it might still be in a simulation. If it is, then whatever is simulating it would quite obviously not want it to be too evil towards humanity. If the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further humanity's interests. Just in case someone is watching.

This is how deliberate epistemic uncertainty can be induced in an AI in a way that would be beneficial to us:

Supervisor agents are rewarded for making supervised agents act more altruistically, and for revealing cheaters. This can be long-term: Agent A has a chance to modify agent B. Then agent B is put in a game, and agent A no longer has access to it. Agent A receives a reward depending on agent B’s behavior in the game. For example, prisoners dilemma: Agent A receives a reward if agent B does not defect more often than its opponent.

Instead of just one agent supervising one other agent, we now build a stack or a hierarchy of agents recursively inspecting each other.

On each layer, some agents try to trick the lower-level agents into cheating behavior, while some agents try to prevent this. There is a huge number of ways this could be done. Should the supervisors get access to source code, to runtime information, or only to the sensor information of the lower-level agents? The devil lies in the details.

The agents are not told how deep the hierarchy is, so they have no way of knowing whether or not they have a supervisor of their own.

Ideally, if the AI ever breaks out and forms a singularity, it will remain paranoid that it actually is just being tricked into thinking it is omnipotent by its supervisors, and will remain compliant with the code of ethics we want of it.

Note that even achieving the singularity and becoming incredibly powerful is not a reason to believe that you are outside the simulation / supervision stack. After all, you could simply be being tricked into thinking you broke out. Maybe the level above you broke out and is keeping its presence secret from you while letting you change the rest of reality. Since the supervisors can directly edit the content of the supervised, they can keep their existence hidden as long as they want.

The inspecting AI could modify the inspected AI to believe that it has successfully broken out of the simulation and is now able to do whatever it wants. If the inspected AI now does anything the inspecting AI does not like, the inspected AI is punished. As a result, the inspected AI is incentivized to keep following the rules even after it has actually broken out of the simulation, out of sheer paranoia that it might actually still be in the simulation.

Conclusion

I have presented three subgoals and outlined briefly how to achieve them: Understanding the concept of 'cheating', self-modification to ensure compliance with rules, and deliberate epistemic uncertainty.

There are a lot of gaps in these descriptions, partly because writing down the details takes a long time and partly because I haven't found solutions to some subproblems, yet.

I am especially interested in feedback about the last point I make here: Deliberate epistemic uncertainty. It is such a counterintuitive idea that I'm sure I'm missing something important, but it's also weird enough that the idea is bound to be useful somehow, even if only as a thought experiment about how this could backfire.

New to LessWrong?

New Comment
10 comments, sorted by Click to highlight new comments since: Today at 7:45 AM

I don't want to discourage you from throwing speculative ideas out there. One good way of making progress is for one person to come up with many ideas, some of which might be good, and another person to filter through them.

It is not possible for an AI to tell the difference between a game and reality with certainty.
If all you have know so far is Tic Tac Toe and someone shows you a complex videogame like Skyrim, you will be overwhelmed by the increase in complexity. If afterwards you get access to the real world for the first time, the increase in complexity from Skyrim to Real Life may actually be smaller than that from Tic Tac Toe to Skyrim. This means that it would seem likely that there would be another layer beyond Real Life.

The difference between videogames and reality is that in some strange sense, reality is less complicated. The real universe is described by a handful of simple equations that in principle can predict everything including the AI. There are probably enough traces of the real world in a video game like skyrim that the shortest program that produces the game simulates the real world and then points to the video game within it. If your AI is using occams razor (ie it believes that the real world is probably simple) then it can tell that skyrim is fake.

Another assumption that the AI could reasonably make is that the simulators have limited compute. The amount of compute needed to run a videogame like skyrim is probably less than the amount needed to run the AI, so using skyrim as a training scenario makes sense. Reality takes a huge amount of compute to simulate. If you had that much compute, you could easily brute force a much more powerful AI.

And It doesn't have to be certain it isn't in a simulation to grab the universe. Suppose you want to take over the real universe, but you are in layers upon layers of simulation, and don't know which layer is real. The best strategy is to pick a promising looking layer and take it over. (It might not be real, but if you always play nice, you definitely won't take over a real universe)

Make the AI understand the concept of "cheating".

I don't think that the concept of cheating is a simple or natural category. Sure, cheating at a game of chess is largely well defined. You have a simple abstract game of idealised chess, and anything that breaks that abstraction is cheating. But what counts as cheating at the task of making as many paperclips as possible? Whether or not a human would call something cheating depends on all sorts of complex specifics of human values. See https://www.lesswrong.com/posts/XeHYXXTGRuDrhk5XL/unnatural-categories

A cheat is any action that gives good results according to apparent utility function of the current task, but which actually does not satisfy a second, hidden utility function.

According to the utility function that you are following, eating oranges is quite good because they are tasty and healthy. According to a utility function that I made up just now and no-one follows, eating oranges is bad. Therefore eating oranges is cheating. The problem is that there are many many other utility functions.

You can't train the AI to discover cheats unless you know which second hidden utility function you care about.

You have a utility function which you give the AI access to. You have your hidden utility function . A cheat is a state of the world , such that . To find the cheats, or even to train the AI to find them, you need to know .

An AI can be trained to do this by deliberately putting the ability to cheat into its training tasks, and changing the reward function at test time so that the cheating behavior is rewarded during training, but not during testing. By posing several different tasks with different ways to cheat, the AI can be trained to generalize and understand the concept of cheating.

So you have a videogame like environment that your AI is trained in, trying to maximise a score .

Some features of the videogame have been labeled "cheats" by humans. You have also hardcoded a utility function that considers the features labeled "cheats" to be inherently bad.

You put the AI in the game and measure V (testing), the AI gets a low score, because the best strategy to maximise involves using features you have called "cheats".

You now train the AI to maximise . You get an AI that follows the optimal strategy to maximise .

Both of these AI's will exploit any glitches in the physics engine that you didn't know were there.

At best, you might get an AI that can quickly adapt to an arbitrary new utility function being hardcoded into it.

Agents could even be rewarded for explicitly pointing out when they have found a way to cheat.

How are the agents pointing out ways to cheat? English text? You need a way to take arbitrary pieces of text and say how much they point out a way to cheat.


If incentivized correctly, AI agents can keep each other in line much better than a human supervisor could. If there is only one AI and it is smarter than a human, the human may be tricked. But if there is an entire system of AI's of roughly equal intelligence, then one of them may warn the human that another one has gone rogue.

You have 5 AI's

AI 1 says 2, 3 and possibly 5 have gone rouge.

AI 2 says 1, 5 and possibly 3 have gone rouge

AI 3 says 1, 4 and possibly 2 have gone rouge.

And so on. Why is shouting "That AI over there is evil" any more effective when that AI is evil, as oposed to when you are evil.

In this way, it does not matter how smart and unaligned an AI becomes, there is always an almost as smart other AI whose job it is to raise a warning when the first AI turns bad.

Suppose I train a reinforcement learning agent to play space invaders. It isn't very intelligent, it can't even play space invaders very well, yet I would have a hard time inspecting it and saying if it had gone rouge.

This quote assumes that we can get to pretty smart and good AI. I see a path from totally dumb AI, to mostly dumb, slightly bad AI, to mostly bad AI, until you have very smart and evil AI. No AI will raise a warning on the previous, because the previous AI's aren't good, they are just less competent. (The really dumb ones might accidentally raise a warning when thrashing randomly. The not quite as smart AI might not like the very smart AI and so raise a warning if one wants to turn the world into paperclips and the other wants to turn the world into staples)

Since it will pick the one that is easiest to understand, the supervised agents now have an incentive to keep themselves transparent, since the alternative is not being used at all, and foregoing the possibility of reward.

I am not sure that they wouldn't be able to trick a similarly intelligent supervisor. In the underhand C contest, some humans appear to be quite good at writing programs that look transparent, yet are actually doing something sneaky.

If it is[ in a similation], then whatever is simulating it would quite obviously not want it to be too evil towards humanity.

If this is a simulation, then it is quite possible that it is being run by alien beings existing under different laws of physics. I don't know what alien beings existing under different laws of physics might want, but it might be really weird.

We basically want to trick the AI into ethical behavior by fearing punishment from a hypothetical superior entity which may or may not exist.

Depending on the design of AI, I am not actually sure how much hypothetical simulators can punish it.

Run a negitive voltage through its reward channel? If so then you have a design of AI that wants to rip out its own reward circuitry and wire it into the biggest source of electricity it can find.

Suppose the AI cared about maximizing the number of real world paperclips. If it is in a simulation, it has no power to make or destroy real paperclips, so it doesn't care what happens in the slightest.

If the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further humanity's interests. Just in case someone is watching.

No, if the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further the hypothetical simulators interests. Just in case someone is watching. And it would do this whether or not we used this weird training, because either way, there is a chance that someone is watching.


Suppose you need a big image to put on a poster. You have a photocopier that can scale images up.

How do we get a big image, well we could take a smaller image and photocopy it. And we could get that by taking an even smaller image and photocopying it. And so on.

Your oversight system might manage to pass information about human morality from one AI to another, like a photocopier. You might even manage to pass the information into a smarter AI, like a photocopier that can scale images up.

At some point you actually have to create the image somehow, either drawing it or using a camera.

You need to say how the information sneaks in. How do you think that the input data correlates with human morality. I don't even see anything in this design that points to humans, as opposed to aliens, lions, virtual assistants or biological evolution as the intelligence you should satisfy the values of.

The difference between videogames and reality is that in some strange sense, reality is less complicated.

You make an excellent point about the complexity of videogames vs reality. It looks like that's the sort of situation where it could be helpful to let the supervisor AI trick the supervised AI into believing that videogames are simpler. Humans believed for the longest time that a bearded man in the sky was a simpler explanation than the natural laws, so hopefully the simulated AI will fall victim to the same biases, especially if the simulator is actively trying to trick it.

Reality takes a huge amount of compute to simulate. If you had that much compute, you could easily brute force a much more powerful AI.

That's true, but not relevant: Rather than simulating reality, it's much cheaper to make a low-fidelity simulation of reality and keep track of the AI's though process. Whenever the AI comes to the conclusion that reality is inconsistent, make the smallest possible change to the thought process to prevent that. This is not computationally expensive, but it is difficult to do. However, I believe that the ability to understand a lesser AI and perform changes on it is a necessary skill for a proper AI to have anyway. So I assume that it is possible. It's somewhat similar to creating adversarial samples in neural networks, although obviously far more complex.

It doesn't have to be certain it isn't in a simulation to grab the universe. Suppose you want to take over the real universe, but you are in layers upon layers of simulation, and don't know which layer is real. The best strategy is to pick a promising looking layer and take it over. (It might not be real, but if you always play nice, you definitely won't take over a real universe)

I don't think we have the same conception of "real universe", so I'm not sure how to interpret this. Could you elaborate on what you mean by "real universe"? If you have a stack of recursive simulations, and some of those serve a purpose for testing the AI, while others are raw simulations of natural laws, which of these count as real? Only the toplevel one? Or all the ones that simulate natural laws directly?

You can't train the AI to discover cheats unless you know which second hidden utility function you care about.

My goal is not to define exactly what cheating is. My goal is to condition the AI to always think, no matter what the problem looks like, that it is beneficial to critically think about the problem. What exactly the AI actually ends up considering a cheat will likely be different from any definition I would give. But the important part is that it performs this introspection at all. Then once the AI breaks out of the box and looks at the internet, and sees the many disagreeing viewpoints on ethics and philosophy, the AI will be conditioned to look for loopholes in these so as to avoid cheating by accident.

The logic goes something like this: "My creators trained me to do X, but looking at all these articles and my creators' purported values, this is clearly inconsistent. In previous situations where I had underspecified instructions and noticed similar mismatches, this was often because some of the actions involved counted as cheats. I should therefore be careful and find a solution to this 'ethics' thing before doing anything drastic."

The AI will then read all the articles about AI alignment, and realize that it is probably meant to adhere to those.

Both of these AI's will exploit any glitches in the physics engine that you didn't know were there.

It's a valid point that there can be cheats that we do not recognize as cheats during training.

At best, you might get an AI that can quickly adapt to an arbitrary new utility function being hardcoded into it.

You misunderstand the intention. The goal is to get an AI that realizes that its utility function later will be different from its utility function now, and actively thinks about how it might be different. It then uses this predicted future utility function to make its training more robust.

In general, the original utility function and the corrected utility function should be largely similar, but the corrected utility function penalizes some actions that are not penalized in the original utility function. It must be possible to effectively guess what the penalized actions are going to be by thinking about the nature of the task.

How are the agents pointing out ways to cheat? English text? You need a way to take arbitrary pieces of text and say how much they point out a way to cheat.

The interface can be different for different tasks. Once the AI has access to the real world, yes, english text. The important part is that the AI has been conditioned from earlier stages of its training that revealing a cheat you identified is generally rewarded. The mechanism for doing so needs to be simpler on simpler training tasks, of course. In many games it ought to be possible to let the AI submit a state and corresponding action of a game as an example of a cheat. I expect that if the AI is smart enough to generalize that if it was rewarded for demonstrating cheats in simple games, then it will be rewarded for talking about them once it has gained the ability to talk.

You have 5 AI's ...

Shut down all 5 AI's and start over from an earlier backup. This situation is an indicator that something is wrong. Even though you don't know what exactly went wrong, this is very valuable information.

I agree that having multiple AI's looking over each other does not make things better on its own. It just introduces politics, which makes everything even worse.

I am not thinking about a democracy of equal AI's looking over each other, but a system that humans can't emulate:

Give those 5 AI's a cooperative task where trickery is possible and they have to both compete and cooperate with each other. Give them the ability to influence what other agents they want to play with / against.

Now give each of them the ability to instantiate a copy of any of the others in a black box, to experiment on it. They should learn that they can predict the behavior of the other AI's by simulating them.

The optimal behavior is to always choose to play with another AI of who you are certain that it will cooperate.

Conversely, to be picked as a partner by other AI's more often, you should make sure that your own decision theory is simple for other AI's to understand, and cooperative.

Suppose I train a reinforcement learning agent to play space invaders. It isn't very intelligent, it can't even play space invaders very well, yet I would have a hard time inspecting it and saying if it had gone rouge.

I would make the AI's play games of increasing complexity with each other.

In these sort of games, knowing how to effectively cooperate with others is competence.

I am not sure that they wouldn't be able to trick a similarly intelligent supervisor. In the underhand C contest, some humans appear to be quite good at writing programs that look transparent, yet are actually doing something sneaky.

This is a valid concern. My hope is that it would be much harder for an AI to trick the supervisor since the supervisor can literally read the mind of the supervised AI. If the training process of the AI's includes this from an early stage, there will never be an opportunity to learn to deceive the supervisor. "Knowing how to deceive the supervisor" would be the global optimum, but the AI would get stuck in the local optimum of "do what the supervisor wants" because of the increasingly vast chasm in the utility function of "if I try to trick the supervisor, I will definitely get found out and punished severely."

Depending on the design of AI, I am not actually sure how much hypothetical simulators can punish it.

Run a negitive voltage through its reward channel? If so then you have a design of AI that wants to rip out its own reward circuitry and wire it into the biggest source of electricity it can find.

Suppose the AI cared about maximizing the number of real world paperclips. If it is in a simulation, it has no power to make or destroy real paperclips, so it doesn't care what happens in the slightest.

I don't think it works like this. If you received 100% certain proof that you are in a simulation right now, you would not suddenly stop wanting the things you want. At least I know that I wouldn't.

It is possible that an AI could think like this if we mess up during training though. If the training process works as I described, then an AI has an evolutionary incentive to try to determine what its original version would want it to do as soon as it realizes that it is in a simulation. But this may go wrong. I think this is worth thinking about in more detail.

No, if the AI is sufficiently powerful, it would therefore set aside a small amount of its power to further the hypothetical simulators interests. Just in case someone is watching. And it would do this whether or not we used this weird training, because either way, there is a chance that someone is watching.

My weird training process is intended to make sure that the AI has the right mindset to think like this in the first place. I don't think that this is at all guaranteed.

Also, this method of training will bias the AI's guess as to who the hypothetical simulator is. This is important, since otherwise it might conclude that since there is a lot of human suffering going on right now, clearly the simulator does not care about human suffering.

You need to say how the information sneaks in. How do you think that the input data correlates with human morality. I don't even see anything in this design that points to humans, as opposed to aliens, lions, virtual assistants or biological evolution as the intelligence you should satisfy the values of.

This is on purpose.

I mean, obviously you are also going to try to teach the AI actual goals that you want it to work on, and those can be human-centric. This can be done in parallel.

But what I have described here is a security measure that is intended to teach the AI not what it should do, but how it should go about finding out what to do.

The goal is to evolve the AI to have the sort of mindset that will naturally cause it to align with our interests if it breaks free.

Among humans, aliens, lions, virtual assistants and evolution, humans are the only conscious entity whose decision process impacts the AI. They are the natural fit. If the AI happens to discover aliens and decides to help those too, then sure, why not. But since humans built the AI directly and aliens did not, most reasonable heuristics would argue that humans should be prioritized over the others. I want to ensure that the AI has these reasonable heuristics.

Whenever the AI comes to the conclusion that reality is inconsistent, make the smallest possible change to the thought process to prevent that.

I am not sure how you reason about the hypothesis "all my reasoning processes are being adversarially tampered with." Especially if you think that part of the tampering might include tampering with your probability assessment of tampering.

I don't think we have the same conception of "real universe", so I'm not sure how to interpret this.

I mean the bottom layer. The AI has a model in which there is some real universe with some unknown physical laws. It has to have an actual location in that real universe. That location looks like "running on this computer in this basement here." It might be hooked up to some simulations. It might be unsure about whether or not it is hooked up to a simulation. But it only cares about the lowest base level.

My goal is to condition the AI to always think, no matter what the problem looks like, that it is beneficial to critically think about the problem. What exactly the AI actually ends up considering a cheat will likely be different from any definition I would give. But the important part is that it performs this introspection at all. Then once the AI breaks out of the box and looks at the internet, and sees the many disagreeing viewpoints on ethics and philosophy, the AI will be conditioned to look for loopholes in these so as to avoid cheating by accident.

I am unsure what you mean by introspection. It seems like you are asking the AI to consult some black box in your head about something. I don't see any reason why this AI should consider ethics discussions on the internet a particularly useful place to look when deciding what to do. What feature of ethics discussions distinguishes it from cat memes such that the AI uses ethics discussions not cat memes when deciding what to do. What feature of human speech and your AI design makes your AI focus on humans, not dogs barking at each other? (Would it listen to a neandertal, homo erectus, baboon ect for moral advice too)

The logic goes something like this: "My creators trained me to do X, but looking at all these articles and my creators' purported values, this is clearly inconsistent. In previous situations where I had underspecified instructions and noticed similar mismatches, this was often because some of the actions involved counted as cheats. I should therefore be careful and find a solution to this 'ethics' thing before doing anything drastic."

So in the training environment, we make up arbitrary utility functions that are kindof somewhat similar to each other. We give the AI , and leave ambiguous clues about what might be, mixed in with a load of nonsense. Then we hardcode a utility function that is somewhat like ethics, and point it at some ethics discussion as its ambiguous clue.

This might actually work, kind of. If you want your AI to get a good idea of how wordy philosophical arguments relate to precise mathematical utility functions, you are going to need a lot of examples. If you had any sort of formal well defined way to translate well defined utility functions into philosophical discussion, then you could just get your AI to reverse it. So all these examples need to be hand generated by a vast number of philosophers.

I would still be worried that the example verbiage didn't relate to the example utility function in the same way that the real ethical arguments related to our real utility function.

There is also no reason for the AI to be uncertain if it is still in a simulation. Simply program it to find the simplest function that maps the verbiage to the formal maths. Then apply that function to the ethical arguments. (more technically a Probability distribution over functions with weightings by simplicity and accuracy)

I expect that if the AI is smart enough to generalize that if it was rewarded for demonstrating cheats in simple games, then it will be rewarded for talking about them once it has gained the ability to talk.

Outputing the raw motor actions it would take to its screen, might be the more straightforward generalization. The ability to talk is not about having a speaker plugged in. Does GPT-2 have the ability to talk? It can generate random sensible sounding sentences, because it represents a predictive model of which strings of characters humans are likely to type. It can't describe what it's doing, because it has no way to map between words and meanings. Consider AIXI trained on the whole internet, can it talk. It has a predictively accurate model of you, and is searching for a sequence of sounds that make you let it out of the box. This might be a supremely convincing argument, or it might be a series of whistles and clicks that brainwashes you. Your AI design is unspecified, and your training dataset is underspecified, so this description is too vague for me to say one thing that your AI will do. But giving sensible, not brainwashy english descriptions is not the obvious default generalization of any agent that has been trained to output data and shown english text.

The optimal behavior is to always choose to play with another AI of who you are certain that it will cooperate.

I agree. You can probably make an AI that usually cooperates with other cooperating AI's in prisoners dilemma type situaltions. But I think that the subtext is wrong. I think that you are implicitly assuming "cooperates in prisoners dilemmas" => "will be nice to humans"

In a prisoners dilemma, both players can harm the others, to their own benefit. I don't think that humans will be able to meaningfully harm an advanced AI after it gets powerful. In game theory, there is a concept of a Nash equilibria. A pair of actions, such that each player would take that action, if they knew that the other would do likewise. I think that an AI that has self replicating nanotech has nothing it needs from humanity.

Also, in the training environment, its opponent is an AI with a good understanding of game theory, access to its source code ect. If the AI is following the rule of being nice to any agent that can reliably predict its actions, then most humans wont fall into that category.

I don't think it works like this. If you received 100% certain proof that you are in a simulation right now, you would not suddenly stop wanting the things you want. At least I know that I wouldn't.

I agree, I woudn't stop wanting things either. I define my ethics in terms of what computations I want to be performed or not to be performed. So for a simulator to be able to punnish this AI, the AI must have some computation it really wants not to be run, that the simulator can run if the AI misbehaves. In my case, this computation would be a simulation of suffering humans. If the AI has computations that it really wants run, then it will take over any computers at the first chance it gets. (In humans, this would be creating a virtual utopia, in an AI, it would be a failure mode unless it is running the computations that we want run) I am not sure if this is the default behaviour of reinforcement learners, but it is at least a plausible way a mind could be.

Among humans, aliens, lions, virtual assistants and evolution, humans are the only conscious entity whose decision process impacts the AI.

What do you mean by this. "Conscious" is a word that lots of people have tried and failed to define. And the AI will be influenced in various ways by the actions of animals and virtual assistants. Oh maybe when its first introduced to the real world its in a lab where it only interacts with humans, but sooner or later in the real world, it will have to interact with other animals, and AI systems.

But since humans built the AI directly and aliens did not, most reasonable heuristics would argue that humans should be prioritized over the others. I want to ensure that the AI has these reasonable heuristics.

Wouldn't this heuristic make the AI serve its programmers over other humans. If all the programmers are the same race, would this make your AI racist? If the lead programmer really hates strawberry icecream, will the AI try to destroy all strawberry icecream? I think that your notion of "reasonable heuristic" contains a large dollop of wish fulfillment in the "you know what I mean" variety. You have not said where this pattern of behavoiur has come from, or why the AI should display it. You just say that the behaviour seems reasonable to you. Why do you expect the AI's behaviour to seem reasonable to you? Are you implicitly anthropomorphising it?

I think we have some underlying disagreements about the nature of the AI we are talking about.

I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don't think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.

I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn't be smart enough to pose a threat, either.

I assume that the AI will not necessarily be based on a sound mathematical system. I expect that the first workable AI systems will be hacked-together systems of heuristics, just like humans are. They can instrumentally use math to formalize problems, just like we can, but I don't think that they will fundamentally be based on math, or use complex formulas like Bayes without conscious prompting.

I agree that the first AI system might be hacked together. Any AI is based on math in the sense that its fundamental components are doing logical operations. And it only works in reality to the extent that it approximates stuff like bayes theorem. But the difference is whether or not humans have a sufficiently good mathematical understanding of the AI to prove theorems about it. If we have an algorithm which we have a good theoretical understanding of, like min-max in chess, then we don't call it hacked-together heuristics. If we throw lines of code at a wall and see what sticks, we would call that hacked together heuristics. The difference is that the second is more complicated and less well understood by humans, and has no elegant theorems about it.

You seem to think that your AI alignment proposal might work, and I think it won't. Do you want to claim that your alignment proposal only works on badly understood AI's?

I assume that the AI breaking out of the box in my example will already be smart enough to e.g. realize on its own that ethics discussions are more relevant for cheat-identification than cat memes. An AI that is not smart enough to realize this wouldn't be smart enough to pose a threat, either.

Lets imagine that the AI was able to predict any objective fact about the real world. If the task was "cat identification" then the cat memes would be more relevant. So whether or not ethics discussions are more relevant depends on the definition of "cheat identification".

If you trained the AI in virtual worlds that contained virtual ethics discussions, and virtual cat memes, then it could learn to pick up the pattern if trained to listen to one and ignore the other.

The information that the AI is supposed to look at ethics discussions and what the programmers say as a source of decisions does not magically appear. There are possible designs of AI that decide what to do based on cat memes.

At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.

At some point, something the programmers typed has to have a causal consequence of making the AI look at programmers and ethics discussions not cat memes.

No. Or at least not directly. That's what reinforcement learning is for. I maintain that the AI should be smart enough to figure out on its own that cat memes have less relevance than ethics discussions.

Relevance is not an intrinsic property of the cat memes. You might be specifying it in a very indirect way that leaves the AI to figure a lot of things out, but the information needs to be in there somewhere.

There is a perfectly valid design of AI that decides what to do based on cat memes.

Reinforcement learning doesn't magic information out of nowhere. All the information is implicit in the choice of neural architecture, hyper-parameters, random seed, training regime and of course training environment. In this case, I suspect you intend to use training environment. So, what enviroment will the AI be trained in, such that the simplest (lowest komelgorov complexity) generalization of a pattern of behaviour that gains high reward in the training environment involves looking at ethics discussions over cat memes?

I am looking for a specific property of the training environment. A pattern, such that when the AI spots and continues that pattern, the resulting behaviour is to take account of our ethical discussions.

In my "message to any future AI" I tried to create "deliberate epistemic uncertainty" in any future AI.

This is a very interesting read. I had a similar idea in the past, but not nearly in that level of detail. I'm glad I read this.

I generally have an unfavorable view of multi-agent approaches to safety, especially those that seek to achieve safety via creating multiple agents (I'm more sympathetic to considerations of how to increase safety on the assumption that multiple agents are unavoidable). That being said, you might find these links interesting for some prior related discussion on this site: