Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

ARC’s current plan for solving ELK (and maybe also deceptive alignment) involves three major challenges:

  1. Formalizing probabilistic heuristic argument as an operationalization of “explanation”
  2. Finding sufficiently specific explanations for important model behaviors
  3. Checking whether particular instances of a behavior are “because of” a particular explanation

All three of these steps are very difficult, but I have some intuition about why steps #1 and #3 should be possible and I expect we’ll see significant progress over the next six months. Unfortunately, there’s no simple intuitive story for why step #2 should be tractable, so it’s a natural candidate for the main technical risk.

In this post I’ll try to explain why I’m excited about this plan, and why I think that solving steps #1 and #3 would be a big deal, even if step #2 turns out to be extremely challenging.

I’ll argue:

  • Finding explanations is a relatively unambitious interpretability goal. If it is intractable then that’s an important obstacle to interpretability in general.
  • If we formally define “explanations,” then finding them is a well-posed search problem and there is a plausible argument for tractability.
  • If that tractability argument fails then it may indicate a deeper problem for alignment.
  • This plan can still add significant value even if we aren’t able to solve step #2 for arbitrary models.

Our approach requires finding explanations for key model behaviors like “the model often predicts that a smiling human face will appear on camera.” These explanations need to be sufficiently specific that they distinguish (the model actually thinks that a human face is in front of the camera and is predicting how light reflects off of it) from (the model thinks that someone will tamper with the camera so that it shows a picture of a human face).

Our notion of “explanation” is informal, but I expect that most possible approaches to interpretability would yield the kind of explanation we want (if they succeeded at all). As a result, understanding when finding explanations is intractable may also help us understand when interpretability is intractable.

As a simple caricature, suppose that we identify a neuron representing the model’s beliefs about whether there is a person in front of the camera. We then verify experimentally that (i) when this neuron is on it leads to human faces appearing on camera, (ii) this neuron tends to fire under the conditions where we’d expect a human to be in front of the camera.

I think that finding this neuron is the hard part of explaining the face-generating-behavior. And if this neuron actually captures the model’s beliefs about humans, then it will distinguish (human in front of camera) from (sensors tampered with). So if we can find this neuron, then I think we can find a sufficiently specific explanation of the face-generating-behavior.

In reality I don’t expect there to be a “human neuron” that leads to such a simple explanation, but I think the story is the same no matter how complex the representation is. If beliefs about humans are encoded in a direction then both tasks require finding the direction; if they are a nonlinear function of activations then both tasks require understanding that nonlinearity; and so on..

The flipside of the same claim is that ARC’s plan effectively requires interpretability progress. From that perspective, the main way ARC’s research can help is by identify a possible goal for interpretability. By making a goal precise we may have a better chance of automating it (by applying gradient descent and search, as discussed in section III), and even if we can’t automate it then a clearer sense of the goal could guide experimental or theoretical work on interpretability. But it doesn’t obviate the need for solving some of the same core problems people are working on in mechanistic interpretability.

I say that this is a relatively unambitious goal for interpretability because I think interpretability researchers are often trying to accomplish many other goals. For example, they are often looking for explanations that are small or human-comprehensible. I think “find a human-comprehensible explanation” is likely to be a significantly higher bar than “find any explanation at all.” As an even more extreme example, I think you would have to solve interpretability in a qualitatively different sense in order to “just retarget the search.”

Of course our goal could also end up being more ambitious than traditional goals in interpretability. In particular, it’s not clear that an intuitively valid “explanation” will actually be a formally valid heuristic argument in the sense required by our approach. It seems tough to evaluate that claim precisely without having a better formalization of heuristic argument. But the basic intuition about computational difficulty, as well as the kinds of counterexamples and obstructions I’m thinking about, seem to apply similarly to both kinds of explanation.

Overall, I’m currently tentatively optimistic that (i) likely forms of mechanistic interpretability would suffice for ARC’s plans, (ii) obstructions to ARC’s plans are likely to translate to analogous obstructions for mechanistic interpretability.

II. Searching for explanations is a well-posed and plausibly tractable search problem

If we have a formal definition of explanation and verifier for explanations, then actually finding explanations is a search problem with an easy-to-compute objective. That doesn’t mean the problem is easy, but it does open up many new angles of attack.

A very simple hope might be that explanations are smaller (i.e. involve fewer parameters) than the model they are trying to explain. For example, given a model like GPT-3 with 175B parameters, and a simple definition of a behavior like “The words ‘happy’ and ‘smile’ are correlated,” we might hope that we can specify a probabilistic heuristic argument for the behavior using at most 175B parameters.

This is much too large for a human to understand, but it’s small enough that we could imagine searching for the explanation in parallel with searching for the model:

  • If we were using a random or exhaustive search, then this implies that finding the explanation would take no longer than finding the model.
  • If we were using a local search, where each iteration involves randomly searching for perturbations to a model that improve the loss, then we would need to make the same argument stepwise — that if you have a good enough argument at step N, and want to find an argument at step N+1, the size of the argument perturbation is no larger than the size of the model perturbation.
  • It is more complicated to analyze something like gradient descent, but if we can handle local search then I think it’s very plausible we can handle gradient descent.

Unfortunately, the claim that “explanations are smaller than models” isn’t quite plausible. For example, consider the simple game of life case. Although the game of life is described by very simple rules, explanations for regularities can involve calculating the properties of complicated sets of cells. The complexity of explanations can be unboundedly larger than the complexity of the underlying physics — the game of life can be expressed in perhaps 200 bits, while a certain correlation might only be explained in terms of the behavior of a particular pattern of 250 cells.

However, in this case there’s a different way that we can find an explanation. Consider the case of gliders as an explanation for A-B patterns. Gliders can only create a large correlation because the model is big enough that gliders often emerge at random. So if you spend the same amount of compute searching for explanations as the model spends simulating random cells, then you can find gliders-as-explanation just as quickly as gliders emerge from the random soup. So although the description complexity of gliders is larger than the description complexity of the game of life itself, such that we can’t hope to find gliders by gradient descent in parallel with learning the model, we can still hope to find them by doing a search which is computationally cheaper than a forward pass of the model.

This discussion elides many complexities, but at a high level I consider the following plausible:

  • If we succeed at formalizing what we mean by explanation, then finding explanations for model behaviors becomes a well-posed search problem.
  • The complexity of finding explanations is bounded by the complexity of finding and running the model itself.
  • So we can efficiently learn explanations in parallel with learning the model-to-be-explained.

Obviously the key conjecture here is the bound on the complexity of finding explanations, and all I’ve really said is that the conjecture looks plausible to me so far — we haven’t yet found clear counterexamples.

III. If this search problem is intractable it may be a much deeper problem for alignment

The feasibility of searching for explanations is closely related to an even more fundamental requirement for alignment.

Consider the distinction between good actions, which the model predicts will keep humans safe, and bad actions, which the model predicts would tamper with sensors in order to make humans appear safe. If keeping humans safe continues to get harder (e.g. as adversarial AI systems become increasingly sophisticated) then we eventually expect bad actions to be more common than good actions. Thus any attempt to select good actions based on a powerful search against predicted consequences needs to be able to distinguish good and bad actions.

Our hope is that if we have an AI which is able to make detailed predictions about the consequences of good and bad actions (including e.g. the dynamics of sensor tampering), then it can also tell the difference between them. In past work I’ve mostly glossed over this assumption because it seems so uncontroversial.

But ultimately this conjecture is very similar to the conjecture from the last section:

  • [Tractability of explanation] We can efficiently find explanations that are specific enough to distinguish good and bad actions
  • [Tractability of discrimination] We can efficiently find a discriminator between good and bad actions.

If tractability of discrimination fails, then we have an even deeper problem than ELK: even if we had perfect labels for arbitrarily complex situations, then we still couldn’t learn a reporter that tells you whether the humans are actually safe! It would no longer be correct to describe the problem as “eliciting” the knowledge, the problem is that there is a deep sense in which the model doesn’t even “know” that it’s tampering with the sensors.

(Note: given our approach based on anomaly detection, I’m inclined to generalize both of these conjectures to the case of arbitrary distinctions between “clearly different” mechanisms for a behavior, rather than considering any special features of the particular distinction between good and bad actions. Though if it turns out that these conjectures are false, then we will start looking for additional structure in the good vs bad distinction rather than trying to solve mechanistic anomaly detection in full generality.)

Right now I feel like we have no strong argument that either of these conjectures holds in the worst case, nor do we have compelling counterexamples to either.

So my current focus is on deeply understanding and arguing for tractability of discrimination. If this conjecture is false we have bigger problems, and if we understand why it is true then my intuition is that a very similar argument will more likely than not suggest that explanation is also tractable. See this comment for some discussion of cases where it wasn’t a priori obvious that either explanation or discrimination would be easy (although in each case I ultimately believe it is).

IV. I’m excited about ARC’s plan even if we can’t solve every step for arbitrary models

I’m interested in searching for decisive solutions to alignment, by which I roughly mean: articulating a set of robust assumptions about ML, proving that under those assumptions our solution will have some desirable properties, and convincingly arguing that these desirable properties completely defuse existing reasons to be concerned that AI may deliberately disempower humanity.

I think decisive solutions are plausibly possible and have a big expected impact; I also think that focusing on them is a healthy research approach that will help us iterate more efficiently and do better work. If a decisive solution was clearly impossible then I think ARC should change how it does research and thinks about research and it would be a major push to pivot to more empirical work.

But despite that, I think that decisive alignment solutions still represent a minority of ARC’s total expected impact, and so it’s worthwhile to talk about about how ARC’s plan can help even if we don’t get to that ultimate goal.

If step #2 fails but the rest of our plan works (a big if!), then we could still get a bunch of nice consolation prizes:

  • Algorithms for finding explanations in practice (even if they don’t work in the worst case) and insight that can help guide interpretability research (even if interpretability is impossible in the worst case).
  • Solutions to a whole bunch of other problems in AI safety. Mechanistic interpretability is a big problem, and explaining behavior is a big subset of mechanistic interpretability, but solving “the rest” of the problem still seems like a big deal.
  • A precise goal for interpretability research and a way to measure whether interpretability is succeeding at that goal. This could let us figure out whether we are solving interpretability well enough to be OK in practice, which is useful even if we know there are possible situations where we wouldn’t be OK.
  • Concrete cases in which finding explanations appears to be intractable, or clearer arguments for why finding explanations should be hard. These can help point to the hard core for interpretability research.

One reason you could be skeptical about any of these advantages is if ARC’s research is just hiding the whole hard part of alignment inside the subproblem of “finding explanations.” If we’re just playing a shell game to move around the main difficulty, then we shouldn’t expect anything good to happen from solving the other “problems.”

I think the most robust counterargument (but far from the only counterargument) is that if we succeed at formalizing and using explanations then finding explanations becomes a well-posed search problem. The traditional conception of AI alignment focuses on serious philosophical and conceptual difficulties that get in the way of us even defining what we want. So a reduction to a well-posed problem seems like it addresses some part of the fundamental difficulty, even if it turns out that the search problem is intractable.


ARC plans to spend a significant fraction of our effort looking for algorithms that can automatically explain model behavior (or looking for arguments that it is impossible in general). That activity is likely to be more like 30% of our research than 70% of our research, despite the elevated technical risk.

A major motivation is that it’s way easier to talk about “can you find explanations?” with a better definition of what you mean by “explanation.” Hopefully this post helps explains the remainder of the motivation, and why we think it’s not a disaster that we are spending a lot of time working on steps #1 and #3 without knowing whether step #2 will work.

The main path forward I see on tractability of explanation is to find an argument or counterexample for tractability of discrimination. After that I expect we’ll be in a much better position to assess tractability of explanation.

My current approach to tractability of discrimination is to both (i) search for potential cases where discrimination is hard, (ii) try to figure out whether we can automatically do discrimination in existing examples, e.g. whether we can mechanically turn a test for probable primes into a probabilistic test for primes.

New to LessWrong?

New Comment
3 comments, sorted by Click to highlight new comments since: Today at 10:47 AM

You are talking about a "verifier for explanations". I don't know how an explanation could be verified under constructivist epistemology and pragmatist meta-epistemology.

I've recently thought about the relationship between GFlowNets and constructivism. Here're some excerpts, unedited, but hopefully could be helpful in some way to someone.

It’s interesting that GFlowNets suggest constructing a trajectory, i. e., an explanation, rather than sampling it via Markov Chain Monte Carlo (MCMC) methods (e. g. Monte-Carlo Tree Search), as suggested in the current ActInf-based AI architectures (Fountas et al. 2020). This can either be explanation for oneself, justifying the most proximate action to take (as in Active Inference: an agent creates a plan, then takes a first step (action) from it, and then create a new plan), or explanation for others, explaining some actions that have already been made (P_B(\tau|x) in GFlowNets).

So, it seems that GFlowNets fit Deutsch’s account of creative explanations rather well. Deutsch (together with Pearl) contrapose constructed, subjective, individualised explanations to emiricism, Bayesianism (in the sense of associational reasoning, “rung one of the causality ladder”, per Pearl) and “counting averages and population stats”, which seems to correspond MCMC methods. Bengio also explains why GFlowNets are statistically superior to variational (Bayesian) inference methods (though I don’t understand what does he mean by “mode-following”, “mean-following”, and “high variance gradients”):

The typical variational inference objective (the ELBO or reverse-KL) leads to mode-following (focussing on one mode) and the forward KL leads to mean-following (overly conservative, sampling too broadly) and annoying variance when implemented with importance sampling. Instead the off-policy GFlowNet objectives (e.g., with a tempered version of P_F as training policy) seem to strike a different balance and tend to recover more of the modes without the down-side of the forward-KL variational inference variants (mean-following and high variance gradients).

With regard to this idea of “construction of explanations”, it’s teasing to try to find some parallels with Core Constructive Ontology, Baez and Stay’s ideas about system construction, and Deutsch and Marletto’s Constructor Theory, and together with GFlowNet call these trends in epistemology, ontology, and philosophy of language “the constructive turn”, by analogy with the pragmatic turn in philosophy.

GFlowNets don't assume a static causal graph, but stochastically construct one from Bayesian posterior (given the past evidence) in the space of all possible causal graphs.

GFlowNet seem to more relate to language generation and consciousness contents: humans seem to generate these “randomly”, indeed. Humans come up with “stochastic” justifications for past events and actions when they are not stabilised in their heads in reference frames.

Another thing that Bengio suggests, “hypergraph sampling”, doesn’t feel neurobiologically plausible (or not?), but it’s not clear whether Bengio suggests it as a neurobiological explanation or as architecture for intelligence (in which case the fact that human’s causal graphs are simple graphs, rather than hypergraphs, is our inductive prior).

Extending Bengio’s idea of sampling from the Bayesian posterior over the space of causal graphs in linguistic explanations, constructing explanatory theories in service of constructing an action trajectory to achieve a certain goal (the pragmatic stance) from the Bayesian posterior over the space of all possible theories (not only causal graphs, but also any other sorts of theories, from formal theories written as closed-form equations, attached to executable diagrams, to as unreliable “theories” as induction rules, heuristics, and intuitions) corresponds to epistemological pluralism, which, John Krakauer thinks, is also how the brain actually works: “There is pluralism in how the nervous system sees the control problem and the representations it uses, and that is the ontological truth of pluralism, and there is a mapping onto epistemological pluralism. This may well be the reason why we have psychology and neuroscience, and we have psychiatry and neurology.”

The selection of the explanatory stance (the perspective, the level of emergence) would be one of the core steps in constructing an explanatory theory for a pragmatic purpose. For instance, we can call either a psychiatrist or a neurologist if we have a goal of diagnosing and then curing the illness in the patient. So, there couldn’t be one “right” perspective on any object. Instead, any intelligent agent always chooses the perspective most suitable (that is, minimising the expected free energy) for reaching a particular goal. There is also a normative imperative to improve the quality of these choices (which can amount to training in selecting from a set of coherent theories which already exist as well as attempts to create and criticise new coherent explanatory theories).

To clarify: by a "verifier for explanations" I mostly mean something like a heuristic estimator as introduced in Formalizing the Presumption of Independence (or else something even further from formality that would fill a similar role).

I think that adding new types of systems and agents to the universe changes the optimal "applied ethics" in the situation (I wrote about this here, in the "PS.", the last paragraph), so they only hope for the discriminator to be 1) a general intelligence; 2) using a scale-free, naturalistic theory of ethics as a theoretical discipline for evaluating any applied ethics theories in any situations and contexts.

Also, hopefully, the "least wrong" scale-free ethics is "aligned" with humans, in the sense that it "saves" us from oblivion. For example, a version of theoretical scale-free ethics could just favour increasing the amount of consciousness in the universe while destroying as little existing consciousness as possible. Let's say something like IIT is right. So the best plan that AGI should come up with is to engineer some mind upload scheme for humans and integrate our consciousnesses together, to form a single planetary-scale mega-consciousness (which must be more ethically valuable precisely because of the integration; by the same token, a brain is more ethically valuable than all neurons when they are isolated).