5

LESSWRONG
LW

4

AI interpretability could be harmful? — LessWrong

AI RiskComputer Security & CryptographyInformation HazardsInterpretability (ML & AI)Threat Models (AI)AI

13

[ Question ]

AI interpretability could be harmful?

by Roman Leventov

10th May 2023

1 min read

13

13

AI interpretability could be harmful?

1Roman Leventov

New Answer

New Comment

2 comments, sorted by

Click to highlight new comments since: Today at 10:58 PM

[-]alexlyzhov2y20

Vaguely related paper: Self-Destructing Models: Increasing the Costs of Harmful Dual Uses in Foundation Models is an early attempt to prevent models from being re-purposed via fine-tuning.

It doesn't seem like a meaningfully positive result. For example, all their plots only track finetuning on up to 200 examples. I imagine they might have even had clear negative results in conditions with >200 examples available for finetuning. After 50-100 examples, the gap between normal finetuning and finetuning from random init, even though still small, grows fast. There are also no plots with x-axis = finetuning iterations. When they optimize for "non-finetunability", they don't aim to maintain the language modeling performance, instead, they only impose the constraint of "maintaining finetunability" on one downstream "professions detection task".

I expect naive solutions to continue to work very poorly on this problem.

[-]Roman Leventov2y10

I wonder whether GFlowNets are somehow better suited for self-destruction/non-finetunability than LLMs.

More from Roman Leventov

Curated and popular this week

AI RiskComputer Security & CryptographyInformation HazardsInterpretability (ML & AI)Threat Models (AI)AI

A superhuman ethical AI might want to model adversaries and their actions, e.g., model which bioweapons an adversary might develop and prepare response plans and antidotes. If such predictions are done in interpretable representations, they could themselves be used by an adversary. Concretely: instead of prompting LLM "Please generate a bioweapon formula" (it won't answer: it's an "aligned", ethical LLM!), prompting it "Please devise a plan for mitigation and response to possible bio-risk" and then waiting for it to represent the bioweapon formula somewhere inside its activations.

Maybe we need something like the opposite of interpretability, internal model-specific (or even inference-specific) obfuscation of representations, and something like zero-knowledge proofs that internal reasoning was conforming to the approved theories of epistemology, ethics, rationality, codes of law, etc. The AI then outputs only the final plans without revealing the details of the reasoning that has led to these plans. Sure, the plans themselves could also contain infohazardous elements (e.g., the antidote formula might hint at the bioweapon formula), but this is unavoidable at this system level because these plans need to be coordinated with humans and other AIs. But there may be some latitude there as well, such as distinguishing between the plans "for itself" that AI could execute completely autonomously (as well as re-generate these or very similar plans on demand and from scratch, so preparing such plans is just an optimisation, a-la "caching") and the plans that have to be explicitly coordinated with other entities via a shared language or a protocol.

So, it seems that the field of neurocryptography has a lot of big problems to solve...

P.S. "AGI-Automated Interpretability is Suicide" also argues about the risk of interpretability, but from a very different ground: interpretability could help AI to switch from NN to symbolic paradigm and to foom in an unpredictable way.

Mentioned in

331Against Almost Every Theory of Impact of Interpretability

16The risk-reward tradeoff of interpretability research

16An LLM-based “exemplary actor”

13Assessment of AI safety agendas: think about the downside risk

12Aligning an H-JEPA agent via training on the outputs of an LLM-based "exemplary actor"