From Abstract section:
I show that yes, activation oracle performance suffers, but it's not very consistent, and they suffer less than probes.
From Result section:
Notably, the oracle suffers more than the probes.[8]
I am confused here, aren't these two statements contradicting each other?
Notably, the oracle suffers more than the probes.
Probes are more selective, specific and less complex than oracle. So, my guess is that it is easy for neural chameleons to directly evade the probes and manipulate the activations. I agree that the "size" of oracle can be a reason as well. Another idea is maybe for oracles we required more robust neural chameleons.
The gap between AI safety and mechanistic interpretability both conceptual and methodological is huge. Most mechanistic interpretability techniques give the insights into model internals but these are not scalable, yet alone applicable in the production. Conversely, many AI safety methods(such as post-training alignment methods) treat models as black boxes.
Also current safety problems such as inner alignment, goal drift, hong-horizon misalignment etc. are mostly framed in behavioural terms rather than mechanistic. It limits the ability of interpretability research to contribute directly to safety. So, redefining AI safety problems from a mechanistic perspective can help the mechanistic interpretability to move from explanatory analysis to a tool for monitoring, intervening in safety relevant model behaviours. I am happy to see that this shift is happening slowly.