Lurking in the Noise

The Dead Biologist

Suppose you're a biologist working with a bacterium. It's a nasty little bug, so you want to make sure your disinfecting protocol is effective. You have a rather good model of how the bacterium works, and it goes like this:

The bacterium contains DNA, which is like a master copy of the instructions as to how it works
This DNA is transcribed to RNA, which are little working copies of parts of the instructions and are disseminated throughout the bacterium
This RNA is translated into proteins, which actually do the stuff that makes the bacterium function

Yes, there are a few special cases where RNA actually does something, and your model seems to be missing some things---for one, the amount of protein produced from each RNA molecule seems to vary, and you can only account for about 90% of this variance; the other 10% appears to be noise. But overall it's really very good.

You test every protein the bacterium is producing. It has proteins that sense and respond to cold, to acid, to low oxygen, to desiccation, but not to heat. You notice that none of the bacterium's proteins work above 40 °C (around 107 F), so you heat up everything to 40 °C to sterilize it.

The bacterium survives, and kills you.

The Thing in the Noise

What happened? The bacterium transcribed an RNA which didn't get translated into protein. It was folded into a little knot. Above 40 °C, this knot wiggled open and got translated into a protein which went and got some more proteins produced, which protected the bacterium from heat. Your model didn't account for RNA having complex structure, which was also the 10% unexplained variance in protein yield from earlier.

"How unfair!" you lament. "What are the chances that, in that little 10% noise in my model, there was a thing that killed me?"

Quite a lot actually. A pathogenic bacterium is a thing which has evolved to kill you. Every single part of it has been under optimization pressure to kill you. And furthermore noise exists in the map, not the territory.

You can think of a bacterium in two ways, and live: you can think of it as a blob which is optimized to kill you; or you can think of it as a set of clean abstractions which are optimized to kill you, plus some noise which is also optimized to kill you. If you try to think of it as the clean abstractions which are optimized to kill you, plus some noise which is unimportant, it will kill you.

Maybe if you designed a cell, you wouldn't use RNA to sense temperature. Maybe for your design, the RNA structure really would have just been a +/-10% modifier on protein yield. Maybe that approach is even more elegant, in a way. But evolution didn't do that. Evolution stumbled around blindly in RNA-space until it found a way to do the job of temperature sensing using a hack.

By using properties of the RNA, it avoids having to produce specific temperature-sensing proteins, so it can do things more cheaply. Most stimuli (like acid, desiccation, etc.) can't be sensed by just RNA, so they aren't.

Heavy Pressure

Bacteria are under heavy optimization pressure, but viruses are under heavier pressure still. As expected, we see even weirder hacks in viruses:

Genes which are sometimes translated fully, and sometimes translated only half-way, with each protein doing a different function (such as two different parts of the RNA replication process)
Genes which can be read both forwards and backwards, which do different functions
Genes where the translation machinery slips by one 'base' (A, G, C, or T), which is kind of like shifting your binary code over by half a byte, since the bases are read in groups of three. Somehow, this results in a different functional protein.

And yet, there are things in the world under even heavier pressure still. LLMs and other deep learning models. What do we see there? Similar hacks. LayerNorm being used for computation, etc.

Almost-All is not All

Mechanistic interpretability techniques can now explain almost all of a model's behaviour, at least in principle. Transcoders seem able to build up some very in-depth circuits.

Unfortunately, almost-all is not all. There are still unexplained dynamics. Those massive meshes of transcoders still need little bits of adjustment from the base model. And that noise will be optimized.

Going from a model to a sparse coding system means unravelling the dimensions. We might unravel eight features from three dimensions, and split them apart cleanly. But imagine rolling them back up. This re-introduces interference between the features. Noise. If they are arranged in a symmetrical square antiprism, there are a lot of ways to arrange them^[1], and each way will lead to different patterns of interference. Which one will the model pick? The one where the interference is beneficial. The noise in your model is still being optimized.^[2]

For sufficiently smart models, every part of a model is optimized to kill you. This includes the noise.

The Most Forbidden Technique

Training an AI using interpretability techniques is forbidden. Now we can see why. However you set up your interpretability, your interpretation will be incomplete. There will be noise.

When you optimize your whole AI to kill you, and the interpretable parts to not kill you, you push the danger into the uninterpretable parts. What's left is just as deadly, but harder to understand. It doesn't matter if you use the bluntest of chain-of-thought monitoring, or the finest of weights-based methods. As long as there is a part of the AI that you aren't accounting for, you are not safe.

Noise in your model of an optimized system hides threats. Noise in a system optimized by your model is a threat that you've pushed out of sight. The Simurg's song. Because you can't see a problem, you can convince yourself that you'll survive. Right up until you don't.

Beyond Imperfect Models

So what even is the use of imperfect interpretability?

At best, we can use it to check that our entire setup is working. If we---having trained a model from scratch using a rather clever training scheme which we think might just work---find that our interpretability techniques set off no alarm bells, we might be able to infer something about our setup. Even then, the noise is still there, and the AI itself might well be trying to trick us.

So where do we go from here? Perhaps it is still worth using imperfect methods. The better the methods, the better of a check they provide on our "alignment" techniques.

Maybe interpretability can help us sound the alarm, if and when we do cook up something that really is trying to kill us.

Or perhaps we need methods of interpretability. One which leaves none of the model unaccounted for. A true nose-to-tail^[3] understanding of an AI would leave no noise, but we seem to be rather far from that.

Coda

When the territory is trying to kill you, beware the blank spots in your map.

^{^}
The answer is 2520 unique ways I believe, which is equal to , with 8 being the number of points on the square antiprism, and 16 being the order of the symmetry group of the square antiprism.
^{^}
I suspect this is part of why SAE feature directions tend to cluster in semantically-relevant groups---it makes the "noise" more useful.
^{^}
To extend a rather un-vegan metaphor: it is not enough to carve up the model at its joints, we have to boil the sinews into broth; turn the skin to crackling; and fry up the liver and the sweetbreads.

LESSWRONG
LW