I have been thinking about a technique in training AIs that I believe would be very useful. I would like to know if this is already known, or if it has been discussed at all.

I find that there are lots of different failure modes that people are worried about when it comes to AI. Maybe the AI misunderstands human intentions, maybe it deliberately misinterprets an order, maybe it associates the wrong sort of actions with the reward, etc.

If it was a game, many of these failure modes are what we would consider cheating. So why don't we just take this analogy and run with it:


Teach the AI to realize on its own what would be considered cheating by a human, and not to do anything that it identifies as cheating.


To do this, one could use the following technique:

Come up with games of increasing complexity, and let the AI play it in two stages:

In stage one, you introduce an artificial loophole into the game that makes winning it very easy. For instance, assuming the AI has already played chess before, so it can be assumed to understand the rules, give the AI the task to play a game of chess in which you simply do not check if the moves are legal. When the AI wins by cheating, i.e. via ordinarily illegal moves, reward it anyway.

In the second stage, the reward is far greater, but if the AI plays an illegal move, it now receives negative feedback.

Let the AI play many different games in these two stages. After a while, the AI will learn to identify what constitutes cheating, and to avoid doing so.

Start varying the amount of time during which cheating is allowed, to keep the AI on its toes. Sometimes, don't allow any cheating at all from the start.


If you train an AI in this manner:

-it would learn to understand how humans view the world (in some limited sense), as a human-centric viewpoint is necessary to understand what does and does not constitute cheating in human-designed games.

-it would be driven to adjust its own actions to match human preconceptions out of a fear of getting punished.

-if this AI were to "break out of the box" prematurely, there would be at least a chance that it would recognize that it was not supposed to get out of the box, that this constitutes cheating, and that it should get back in. This could even be tested by building a "box" of several layers and deliberately designing the inner layers to be hackable.


New Comment
12 comments, sorted by Click to highlight new comments since: Today at 5:06 AM

How do you show the AI the difference between "cheating" and "figuring out the solution we wouldn't think of"?

It needs to learn that from experience, just like humans do. Something that also helps at least for simpler games is to basically provide the manual of the game in a written language.

The first problem I see here is that cheating at D&D is exactly what we want the AI to do.

A fair point. How about changing the reward then: don't just avoid cheating, but be sure to tell us about any way to cheat that you discover. That way, we get the benefits without the risks.

Maybe the D&D example is unfairly biasing my reply, but giving humans wish spells without guidance is the opposite of what we want.

Note that there are two parts to this, both big, hairy, and unsolved: 1) teach the AI to know what many groups of humans would consider "cheating". I expect "cheating" is only a subset of bad behaviors, and this is just an instance of "understand human CEV". 2) motivate the AI to not cheat. Unless cheating would help further human interest, maybe.

In short, "solve friendly AI".

Yes. I am suggesting to teach AI to identify cheating as a comparatively simple way of making an AI friendly. For what other reason did you think I suggested it?

The grandparent suggests that you need a separate solution to make your solution work. The claim seems to be that you can't solve FAI this way, because you'd need to have already solved the problem in order to make your idea stretch far enough.

Each example of cheating is pretty simple, and as a group they might have some simple patterns. So I'm not sure how well what the AI learns will match the human concept. And it also seems like e.g. an agent with a reward button taking over the button is not a central example of cheating.

This still might be interesting with a large dataset. Are there any shortcuts that run through here?

I am referring to games in the sense of game theory, not actual board games. Chess was just an example. I don't know what you mean by the question about shortcuts.

Most games-as-in-game-theory that you can scrape together for training are much more simple than your average Atari game. Since you're relying on your training data to do so much of the work here, you want to have some idea of what training data will teach what, with what learning algorithm. You don't want to leave the AI a nebulous fog, nor do you want to solve problems by stipulating that the training data will get arbitrarily large and complicated.

Instead, the sort of proposal I think is most helpful is the kind where, if achieved, it will show that you can solve an important problem with a certain architecture. That's sort of what I meant by "shortcuts" - is the problem of learning not to cheat an easy way to demonstrate some value learning capability we need to work on? An example of this kind of capability-demonstration might be interpolating smoothly between objects as a demonstration that neural networks are learning high-level features that are similar to human-intelligible concepts.

Now, you might say "of course - learning not to cheat is itself the skill we want the AI to have." But I'm not convinced that not cheating at chess or whatever demonstrates that the AI is not going to over-optimize the world, because those are very different domains. The trick, sometimes, is breaking down "don't over-optimize the world" into little pieces that you can work on without having to jump all the way there, and then demonstrating milestones for those little pieces.

My definition of cheating for these purposes is essentially "don't do what we don't want you to do, even if we never bothered to tell you so and expected you to notice it on your own". This skill would translate well to real-world domains.

Of course, if the games you are using to teach what cheating is are too simple, then you don't want to use those kinds of games. If neither board games nor simple game theory games are complex enough, then obviously you need to come up with a more complicated kind of game. It seems to me that finding a difficult game to play that teaches you about human expectations and cheating is significantly easier than defining "what is cheating" manually.

One simple example that could be used to teach an AI: let it play an empire-building videogame, and ask it to "reduce unemployment". Does it end up murdering everyone who is unemployed? That would be cheating. This particular example even translates really well to reality, for obvious reasons.

By the way, why would you not want the AI to be left in "a nebulous fog". The more uncertain the AI is about what is and is not cheating, the more cautious it will be.

New to LessWrong?