A superhuman ethical AI might want to model adversaries and their actions, e.g., model which bioweapons an adversary might develop and prepare response plans and antidotes. If such predictions are done in interpretable representations, they could themselves be used by an adversary. Concretely: instead of prompting LLM "Please generate a bioweapon formula" (it won't answer: it's an "aligned", ethical LLM!), prompting it "Please devise a plan for mitigation and response to possible bio-risk" and then waiting for it to represent the bioweapon formula somewhere inside its activations.

Maybe we need something like the opposite of interpretability, internal model-specific (or even inference-specific) obfuscation of representations, and something like zero-knowledge proofs that internal reasoning was conforming to the approved theories of epistemology, ethics, rationality, codes of law, etc. The AI then outputs only the final plans without revealing the details of the reasoning that has led to these plans. Sure, the plans themselves could also contain infohazardous elements (e.g., the antidote formula might hint at the bioweapon formula), but this is unavoidable at this system level because these plans need to be coordinated with humans and other AIs. But there may be some latitude there as well, such as distinguishing between the plans "for itself" that AI could execute completely autonomously (as well as re-generate these or very similar plans on demand and from scratch, so preparing such plans is just an optimisation, a-la "caching") and the plans that have to be explicitly coordinated with other entities via a shared language or a protocol.

So, it seems that the field of neurocryptography has a lot of big problems to solve...

P.S. "AGI-Automated Interpretability is Suicide" also argues about the risk of interpretability, but from a very different ground: interpretability could help AI to switch from NN to symbolic paradigm and to foom in an unpredictable way.

New Answer
New Comment
2 comments, sorted by Click to highlight new comments since:

Vaguely related paper: Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models is an early attempt to prevent models from being re-purposed via fine-tuning.

It doesn't seem like a meaningfully positive result. For example, all their plots only track finetuning on up to 200 examples. I imagine they might have even had clear negative results in conditions with >200 examples available for finetuning. After 50-100 examples, the gap between normal finetuning and finetuning from random init, even though still small, grows fast. There are also no plots with x-axis = finetuning iterations. When they optimize for "non-finetunability", they don't aim to maintain the language modeling performance, instead, they only impose the constraint of "maintaining finetunability" on one downstream "professions detection task".

I expect naive solutions to continue to work very poorly on this problem.

I wonder whether GFlowNets are somehow better suited for self-destruction/non-finetunability than LLMs.