TL;DR
Different misgeneralized goals can appear based on training random seed alone, including outliers on a scale of 1 in 500.
Abstract
We explore colour versus shape goal misgeneralization originally demonstrated
by Di Langosco et al. (2022) in the Procgen Maze environment, where, given
an ambiguous choice, the agents seem to prefer generalization based on colour
rather than shape. After training over 1,000 agents in a simplified version of the
environment and evaluating them on over 10 million episodes, we conclude that the
behaviour can be attributed to the agents learning to detect the goal object through
a specific colour channel. This choice is arbitrary. Additionally, we show how,
due to underspecification, the preferences can change when retraining the agents
using exactly the
... (read 1083 more words →)