Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

As a writing exercise, I'm writing an AI Alignment Hot Take Advent Calendar - one new hot take, written every day for 25 days. Or until I run out of hot takes, which seems likely.

Any approach to building safe transformative AI, or even just auditing possibly-safe TAI, which relies on reverse-engineering neural networks into fine-grained pseudocode based on mechanistic understanding should keep its ambitions very modest.

This hot take is to some extent against ambitious "microscope AI," and to some extent against a more underlying set of intuitions about the form and purpose of interpretability research. (A somewhat related excellent background post is Neel's list of theories of impact for interpretability.) So I should start by explaining what those things are and why they might be appealing.

Webster's Dictionary defines microscope AI as "training systems to do complex tasks, then interpreting how they do it and doing it ourselves." Prima facie, this would help with transformative AI. Suppose we're building some AI that's going to have a lot of power over the world, but we're not sure if it's trustworthy - what if some of its cognition is about how to do things we don't want it to be doing? If we can do microscope AI, we can understand how our first AI is so clever, and build a second AI that's just as clever and that we're sure isn't doing things it shouldn't,  like running a search for how best to deceive us.

Microscope-powered auditing is easier -  if it's hard to assemble the second AI that does good things and not bad things, how about just checking that the first AI is trustworthy? To check an AI's trustworthiness in this microscope-AI-like framing of the issues, we might want to understand how its cognitive processes work in fine-grained detail, and check that none of those processes are doing bad stuff.

When I say I'm against this, I don't mean auditing is impossible. I mean that it's not going to happen by having humans understand how the AI works in fine-grained detail.

As an analogy, you can figure out how curve detectors work in InceptionV1. Not just in the sense that "oh yeah, that neuron is totally a curve detector," but in terms of how the whole thing works. It's yet more difficult to figure out that other neurons are not curve detectors - typically at this point we fall back on data-based methods like ablating those neurons and then trying to get the network to recognize rainbows, rather than first-principles arguments. But we can more or less figure out that InceptionV1 has an intermediate state where it detects curves, by an understandable algorithm and for human-understandable reasons.

If we wanted to figure out how InceptionV1 tells dogs from cats, we might hope to gradually hack away at the edges - use what we know to expand the circle of knowledge a little more, and then repeat. Use the example of curve detectors to figure out spike detectors. Use spike-detectors to figure out fur-texture detectors, and curve detectors to figure out nose-shape detectors. Then we can learn how fur texture and nose shape play into deciding on dog vs. cat. At each step we can use data to test our understanding, but the basic goal is to be able to write down the flow of information between features in a human-comprehensible way. It's not just about giving neurons english-language labels, it's about giving them sensible algorithms where those labels play the expected role.

The biggest problem with this plan is that neural networks leak. Many things are connected to many other things, weakly, in ways that are important for their success. I recently was at a talk that showed how the vast majority of attention heads in a transformer have lots of weak connections - and so the talk was about zooming in on the ones that have strong connections because they're the ones we have the best shot at interpreting.

Another way of putting this is that neural networks are much bigger than your human-understandable model of them, and they're not completely wasting that size. They're full of detail, and the detail crosses multiple scales - there are big obvious information flows that do a lot of the work, and then built on top of those are smaller perturbations that do a little more work, and on top of those perturbations there are even more, even smaller perturbations.

I expect our mechanism-level understanding of TAI to be woefully centered on the low-hanging fruit. Sure, we can probably gesture at big obvious stuff it's doing, at least in cases where it's easy to spotlight the behavior we want in a test dataset, but we're not going to run down every single crenelation of the computational landscape and put in the effort to understand it. And what this means for auditing is that we can detect some bad stuff, and we can detect the biggest most obvious bad stuff the best, but there's a microcosmos of computation going on in the small perturbations that we can't be expected to understand, and we can't make any promises about what it's doing.

In terms of InceptionV1 and cats vs. dogs, what this would look like is that we could assign some plausible labels to several high-level features, and make plausible circuits connecting them, but when we try to test this understanding by e.g. swapping our AI's perception of dogs and cats, we'd find that that the intervention is brittle, and even when we get to tune some parameters to make our intervention work like we want on a small dataset, there's some information that leaks through that we didn't account for.

What to do in light of this?

  • Keep low expectations for humans deeply understanding what's going on inside a NN. Yes, it's nice to figure things out. Yes, it's an intellectual outrage that we're building all these cool toys and they're basically black boxes. But we shouldn't saddle humans with the burden of having to figure out the whole AI to make sure it's safe.
  • Before we even start a training run, we should try to have actually good abstract arguments about alignment properties of the AI. Interpretability work is easier if you're just trying to check details relevant to those arguments, rather than trying to figure out the whole AI.
  • Some of those abstract arguments should be about whether we can gain confidence in the AI design by testing it on toy problems - it sure would be convenient if we could do this, so we should try to make it possible.
  • There are plenty of interpretability schemes that don't live up the the high standard of generating mechanistic understanding. Such work can still be useful - for example, we might try to use human feedback to modify a NN in a way that's semantically meaningful even if we don't precisely understand what all the weight changes are doing. We need to get our confidence in such schemes from yet more abstract arguments, which should also hopefully help filter out the large number of non-useful interpretability schemes.

Phew. Tomorrow I've gotta find a shorter hot take. I'm open to suggestions.

New Comment
4 comments, sorted by Click to highlight new comments since:

Before we even start a training run, we should try to have *actually good *abstract arguments about alignment properties of the AI. Interpretability work is easier if you're just trying to check details relevant to those arguments, rather than trying to figure out the whole AI.

Thanks for the post! I particularly appreciated this point

Broadly, I agree with this. We are never going to have a full mechanistic understanding of literally every circuit in a TAI model in time for it to be alignment relevant (we may have fully reversed engineered some much smaller 'model organisms' by this time though). Nor are individual humans ever going to understand all the details of exactly how such models function (even small models). 

However, the arguments for mechanistic interpretability in my view are as follows:

1.) Model capacities probably follow some kind of Pareto principle -- 20% or the circuits do 80% of the work. If we can figure out these circuits in a TAI model then we stand a good chance of catching many alignment-relevant behaviours such as deception, which necessarily require large-scale coordination across the network.

2.) Understanding lots of individual circuits and networks provide a crucial source of empirical bits about network behaviour and alignment at a mechanistic level which we can't get just by theorycrafting about alignment all day. To have a reasonable shot at actually solving alignment we need direct contact with reality and interpretability is one of the main ways to get such contact.

3.) If we can figure out general methods for gaining mechanistic understanding of NN circuits, then we can design automated tools for performing interpretability which substantially reduces the burden on humans. For instance, we might be able to make tools that can rapidly identify the computational substrate of behaviour X, or all parts of the network which might be deceptive, or things like this. This then massively narrows down the search space that humans have to look at to check for safety.

Yeah, I think these are good points. However, I think that #1 is actually misleading. If we measure "work" in loss or in bits, then yes absolutely we can probably figure out the components that reduce loss the most. But lots of very important cognition goes into getting the last 0.01 bits of loss in LLMs, which can have big impacts on the capabilities of the model and the semantics of the outputs. I'm pessimistic on human-understanding based approaches to auditing such low-loss-high-complexity capabilities.

Webster's Dictionary defines microscope AI as "training systems to do complex tasks, then interpreting how they do it and doing it ourselves."

best as I can tell, this is a confabulation - webster's dictionary does not provide that definition.