[ Parent Question — What are some good examples of incorrigibility? ]

What are some good examples of gaming that is hard to detect?

by SoerenMind1 min read16th May 20193 comments

5

Personal Blog

For example, an RL agent that learns a policy that looks good to humans but isn't. Adversarial examples that only fool a neural nets wouldn't count.

New Answer
Ask Related Question
New Comment

1 Answers

For example, an RL agent that learns a policy that looks good to humans but isn't. Adversarial examples that only fool a neural nets wouldn't count.

[This comment is no longer endorsed by its author]
2 comments, sorted by Highlighting new comments since Today at 6:41 PM

Could you clarify this a bit? I assume you are thinking about subsets of specification gaming that would not be obvious if they were happening?

If so, then I guess all the adversarial examples in image classification comes to mind, which fits specification gaming pretty well and required quite a large literature to understand.

Thanks, updated.