Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a special post for quick takes by VojtaKovarik. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

New to LessWrong?

5 comments, sorted by Click to highlight new comments since: Today at 1:29 AM

[% of loss explained] isn't a good interpretability metric [edit: isn't enough to get guarantees].
In interpretability, people use [% of loss explained] as a measure of the quality of an explanation. However, unless you replace the system-being-explained by its explanation, this measure has a fatal flaw.

Suppose you have misaligned superintelligence X pretending to be a helpful assistant A --- that is, acting as A in all situations except those where it could take over the world. Then the explanation "X is behaving as A" will explain 100% of loss, but actually using X will still kill you.

For [% of loss explained] to be a useful metric [edit: robust for detecting misalignment], it would need to explain most of the loss on inputs that actually matter. And since we fundamentally can't tell which ones those are, the metric will only be useful (for detecting misaligned superintelligences) if we can explain 100% of loss on all possible inputs.

The main use of % loss recovered isn't to directly tell us when a misaligned superintelligence will kill you. In interpretability we hope to use explanations to understand the internals of a model, so the circuit we find will have a "can I take over the world" node. In MAD we do not aim to understand the internals, but the whole point of MAD is to detect when the model has new behavior not explained by explanations and flag this as potentially dangerous.

A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven't solved everything, you must have made a bunch of progress.

For algorithmic tasks where humans just know an algorithm which performs well, I think you need to use something like causal scrubbing which checks the correspondence.

A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven't solved everything, you must have made a bunch of progress.

Right, I agree. I didn't realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like "[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However, it is somewhat misleading in that 100% loss explained does not mean you understand what is going on inside the system."

I rephrased that now. Would be curious to hear whether you still have objections to the updated phrasing.

Agreed.

That said, if you train an AI on some IID training dataset and then explain 99.9% of loss validated as fully corresponding (via something like causal scrubbing), then you probably understand almost all the interesting stuff that SGD put into the model.

You still might die because you didn't understand the key 0.1% or because some stuff was put into the model other than via SGD (e.g. gradient hacking or someone put in a backdoor).

Typical stories of deceptive alignment imply that to explain 99.9% of loss with a truely human understandable explanation, you'd probably have to explain the key AI machinery to a sufficient extent that you can understand if the AI is deceptively aligned (as the AI is probably doing reasoning about this on a reasonably large fraction of inputs).