Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by Jsevillamol. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
2 comments, sorted by Click to highlight new comments since: Today at 8:02 PM

From OpenAI Five's blogpost:

We’re still fixing bugs. The chart shows a training run of the code that defeated amateur players, compared to a version where we simply fixed a number of bugs, such as rare crashes during training, or a bug which resulted in a large negative reward for reaching level 25. It turns out it’s possible to beat good humans while still hiding serious bugs!

One common line of thought is thinking that goals are very brittle - small misspecifications will be amplified after optimizing.

Yet Open AI Five managed to wrangle a good performance out of a seriously buggy reward function.

Hardly conclusive, but it would be interesting to see more examples of this. One could also do deliberate experiments to see how much you can distort a reward function before behaviour breaks.

Some other factors which might be relevant:

  • It's a team.**
  • I think there was a lot of shaped rewards (including setting it up to encourage being more independent in the beginning, and teaming up more later in the game/as they get more powerful).*
  • Also, their rewards might be linked. Not sure if it goes down if a team mate is killed.

*It's been a while since I've read through all their DOTA stuff, so there could be other factors, but I do remember that one was because the game was so big, that helped get them to do reasonable stuff faster so they'd have more of something like the pieces to assemble into a strategy.

** I don't think they communicate (via a channel for that, the chat) though. (Though whether or not they use the alarm clicking thing (with the circles and the exclamation marks and the sound would also be relevant.)