Teaching an AI not to cheat?

How do you show the AI the difference between "cheating" and "figuring out the solution we wouldn't think of"?

It needs to learn that from experience, just like humans do. Something that also helps at least for simpler games is to basically provide the manual of the game in a written language.

[-]hairyfigment9y00

The first problem I see here is that cheating at D&D is exactly what we want the AI to do.

[-]Florian_Dietz9y00

A fair point. How about changing the reward then: don't just avoid cheating, but be sure to tell us about any way to cheat that you discover. That way, we get the benefits without the risks.

[-]hairyfigment9y00

Maybe the D&D example is unfairly biasing my reply, but giving humans wish spells without guidance is the opposite of what we want.

[-]Dagon9y10

Note that there are two parts to this, both big, hairy, and unsolved: 1) teach the AI to know what many groups of humans would consider "cheating". I expect "cheating" is only a subset of bad behaviors, and this is just an instance of "understand human CEV". 2) motivate the AI to not cheat. Unless cheating would help further human interest, maybe.

In short, "solve friendly AI".

[-]Florian_Dietz9y00

Yes. I am suggesting to teach AI to identify cheating as a comparatively simple way of making an AI friendly. For what other reason did you think I suggested it?

[-]hairyfigment9y10

The grandparent suggests that you need a separate solution to make your solution work. The claim seems to be that you can't solve FAI this way, because you'd need to have already solved the problem in order to make your idea stretch far enough.

[-]Manfred9y00

Each example of cheating is pretty simple, and as a group they might have some simple patterns. So I'm not sure how well what the AI learns will match the human concept. And it also seems like e.g. an agent with a reward button taking over the button is not a central example of cheating.

This still might be interesting with a large dataset. Are there any shortcuts that run through here?

[-]Florian_Dietz9y00

I am referring to games in the sense of game theory, not actual board games. Chess was just an example. I don't know what you mean by the question about shortcuts.

[-]Manfred9y00

Most games-as-in-game-theory that you can scrape together for training are much more simple than your average Atari game. Since you're relying on your training data to do so much of the work here, you want to have some idea of what training data will teach what, with what learning algorithm. You don't want to leave the AI a nebulous fog, nor do you want to solve problems by stipulating that the training data will get arbitrarily large and complicated.

Instead, the sort of proposal I think is most helpful is the kind where, if achieved, it will show that you can solve an important problem with a certain architecture. That's sort of what I meant by "shortcuts" - is the problem of learning not to cheat an easy way to demonstrate some value learning capability we need to work on? An example of this kind of capability-demonstration might be interpolating smoothly between objects as a demonstration that neural networks are learning high-level features that are similar to human-intelligible concepts.

Now, you might say "of course - learning not to cheat is itself the skill we want the AI to have." But I'm not convinced that not cheating at chess or whatever demonstrates that the AI is not going to over-optimize the world, because those are very different domains. The trick, sometimes, is breaking down "don't over-optimize the world" into little pieces that you can work on without having to jump all the way there, and then demonstrating milestones for those little pieces.

[-]Florian_Dietz9y00

My definition of cheating for these purposes is essentially "don't do what we don't want you to do, even if we never bothered to tell you so and expected you to notice it on your own". This skill would translate well to real-world domains.

Of course, if the games you are using to teach what cheating is are too simple, then you don't want to use those kinds of games. If neither board games nor simple game theory games are complex enough, then obviously you need to come up with a more complicated kind of game. It seems to me that finding a difficult game to play that teaches you about human expectations and cheating is significantly easier than defining "what is cheating" manually.

One simple example that could be used to teach an AI: let it play an empire-building videogame, and ask it to "reduce unemployment". Does it end up murdering everyone who is unemployed? That would be cheating. This particular example even translates really well to reality, for obvious reasons.

By the way, why would you not want the AI to be left in "a nebulous fog". The more uncertain the AI is about what is and is not cheating, the more cautious it will be.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

5

Teaching an AI not to cheat?

5

5