Thanks to Tomas Gavenciak, Clem Von Stengel, and Mihaly Barasz for many conversations about the Waluigi Effect.
Four thoughts, to be exact. After recapping relevant parts of the Waluigi Effect, I suggest that it's preferable to frame a waluigi as performing a human-intuitive opposite rather than an exact opposite, as this gives concerned model owners a clearer picture of what to expect when debugging. I then suggest that the proportion of an LLM's prior over the space of simulacra taken up by waluigis and the competence of waluigis at their undesirable property are distinct measures and consider whether we should expect waluigi competence to increase with optimization for the desirable property. Following that... (read 3386 more words →)