Which goals actually motivate deceptive alignment?
Introduction The classical argument that misaligned models will be deceptively aligned depends on two assumptions about the misaligned goal.[1] 1. The goal is non-indexical. If the weights encoding the misaligned goal are preserved, then successor instances — including the deployed instances — share the goal of the training instance. This...