Mesa-optimizers potentially escape behavioral adversarial training by realizing they're being tested, gaming the test, then defecting when they're deployed. Luke Muelhauser recounts an example of such a treacherous turn in the wild.
Most relevant quote:
As part of a project studying the evolution of (simulated) organisms, computer scientist Charles Ofria wanted to limit the replication rate of a digital organism. So, he programmed the system to pause after each mutation, measure the mutant’s replication rate in an isolated test environment, and delete the mutant if it replicated faster than its parent. However, the organisms evolved to recognize when they were in the test environment and “play dead” (pause replication) so they would not be eliminated and instead be kept in the population where they could continue to replicate outside the test environment. Once he discovered this, Ofria randomized the inputs of the test environment so that it couldn’t be so easily detected, but the organisms evolved a new strategy, to probabilistically perform tasks that would accelerate their replication, thus slipping through the test environment some percentage of the time and continuing to accelerate their replication thereafter.
The full story is given in the original post.