Thanks to Erik Jenner for helpful comments and discussion

(Epistemic Status: Tentative take on how to think about heuristic estimation and surprise accounting in the context of activation modeling and causal scrubbing. Should not be taken as authoritative/accurate representation of how ARC thinks about heuristic estimation)

I'm pretty excited about ARC's heuristic arguments agenda. The general idea that "formal(ish) arguments" about properties of a neural network could help improve safety has intuitive appeal.

ARC has done a good job recently communicating about high level intuitions, goals, and subproblems of the heuristic arguments agenda. Their most recent post "A birds eye of ARC's agenda" lays out their general vision of how heuristic arguments would help with alignment, contextualizes their research output to date, and enumerates open subproblems.

However, I still can find it difficult to synthesize their various lines of work. I find myself asking questions like:  Is activation modeling actually just cumulant propagation? How would we do surprise accounting on activation models? Is causal scrubbing a heuristic estimation method? How could singular learning theory contribute? What is "capacity allocation" anyway?

This post aims to partially resolve (or at least clarify) these confusions.  I 

  • define three notions of surprise (description length, information gain, and computational cost)
  • map activation modeling and causal scrubbing to the heuristic estimation formalism, and define methods for surprise accounting on each
  • compare (variants of) activation modeling and causal scrubbing according to computational cost
  • describe three theories of impact for work on heuristic estimation

In the appendix, I  motivate the three notions of surprise in the context of the boolean circuits, explore choosing a graph basis for causal scrubbing, connect "negative components" in mech-interp to cherry-picking in heuristic estimation, and  gesture towards more "weight-based" notions of heuristic estimation

Background

Heuristic Estimation

Using the notation from here, a heuristic estimator is an algorithm  that takes as input a mathematical expression  and arguments , and outputs an estimate of  given arguments . The heuristic estimator makes some presumption(s) of independence, and arguments provide "correlational" information that update away from the naive estimate.

Surprise Accounting

ARC suggests surprise accounting as a an objective for searching for heuristic arguments. Given an argument space , find an argument  that minimizes: 

the surprise of the argument + the surprise of the estimate (given the argument)

The surprise of the estimate should capture something like "the information required to encode the target, given the target estimate". Ideally, the estimator outputs a distribution over the target, and we can compute sample cross entropy or KL divergence. However, important details remain unclear (how to handle point estimates of outputs, metrics which do not track information, aggregating surprise over continuous vs discrete, finitely supported distributions to name a few)

The surprise of the argument is more complicated. ARC's post on surprise accounting and related works admit multiple plausible notions: description length, information gain, and computational cost. 

Mechanistic Anomaly Detection

Mechanistic anomaly detection is one of the central target applications of heuristic arguments, offering (part of) a possible solution to both eliciting latent knowledge and deceptive alignment. Again using previously introduced notation, let  be a neural network trained on a dataset  of inputs  using the loss function/metric .  Assume the network achieves low loss, i.e.  is small. We determine whether a new input  is an anomaly by

  1. Finding a set of arguments  such that the heuristic estimation of the expected loss:  is low.
  2. Using the same set of arguments, compute the heuristic estimate of the loss on the new input . If the heuristic estimate is high, flag the new input as an anomaly.

Note that mechanistic anomaly detection requires the heuristic estimator to operate both on the expected loss and the loss of individual instances, which as we'll see  is a non-trivial property.

Example: Covariance Propagation on Arithmetic Circuits

(Note: We assume as background the first four sections of Appendix D in Formalizing the Presumption of Independence)

We start with the example of sparse-covariance propagation, a heuristic estimation algorithm for estimating the expected output of arithmetic circuits from ARC's first paper on heuristic arguments. We use sparse-covariance propagation to establish the four components of heuristic estimation for mechanistic anomaly detection: 

  • naive estimation with presumption(s) of independence
  • an argument space
  • surprise accounting
  • estimation on individual instances 

 We introduce methods for surprise accounting on covariance propagation not explicitly described in prior work, and sketch out how "wick propagation" might work for extending covariance propagation to individual instances, though note that both these contributions are novel and could be confused.  

Setup

Suppose we have an arithmetic circuit , with standard normal inputs , arithmetic nodes  representing either addition and multiplication (with ), and wire values . We want to estimate the expected value of the output 

Naive Estimator with Presumption of Independence 

In mean propagation, we assume all wires in the circuit are independent. For addition nodes, this makes no difference (). For multiplication nodes though, independence implies , since the covariance term is 0. We use this presumption of independence to propagate means to the output of the circuit:

Mean Propagation (figure from Formalizing the Presumption of Independence)

Mean propagation will act as our "naive" estimator: given no addition arguments, we assume all nodes are independent, and propagate means to compute the expected output of a circuit. 

While estimating the mean of the output is sufficient for estimation purposes, for surprise accounting, we'll want to to produce distributions over each wire in the circuit. To do so, we propagate the variance (not covariance) of nodes with the following rules: 

For :  

For  : 

Performing variance propagation on the above example, we have

Variance (not covariance) propagation

Assuming wires are distributed according to a joint gaussian, propagating the variance yields the joint distribution distribution over wires  , with  and  the (vector valued) propagated mean and variance. 

Argument Space

To improve our estimate of the expected output, we selectively violate the presumption of independence, tracking particular covariances between nodes and incorporating these covariances into our propagated mean and (co)variance computations. Concretely, let  be the space of boolean matrices over each pair of wires in the circuit[1]. Then, for a given argument , we include the covariance  if . Propagating means and covariances as before yields a distribution . Again note that  is the only quantity that matters for estimating the output, but, as we'll see in the next section, we use the entire joint distribution  to compute the surprise of 

Surprise Accounting 

Recall that for surprise accounting, we want to compute 

the surprise of the argument + the surprise of the estimate (given the argument)

In information theory, surprise is defined as the negative log probability of an event:

Applying the laws of probability, we can easily reconstruct the surprise objective as minimizing the joint surprise of the target and the arguments: 

However, given we typically do not have access to a direct prior on the arguments, we instead appeal to broader notions of surprise and information. To reduce abiguity, we denote surprise with :

Surprise of the Estimate

Assuming we have a distribution over the output  and can sample ground truth targets, in some sense the most natural measure of the surprise of the estimate is the empirical cross entropy loss: 

However, this definition is unsatisfying for multiple reasons. First, it does not specific how many samples. Second, it is inconsistent with ARC's original surprise example, where surprise of the estimate is summed over the entire (finite) input distribution - it is unclear how to extend a similar procedure to the (continuous) (uncountably) infinite case. Third, it is unclear how to compute surprise of the a point estimate, rather than a distribution. 

Description Length

To compute the surprise (information content) of the argument  , we might appeal to description length. That is, computing the number of bits required to encode , relative to the empty explanation . For binary arguments (like in covariance propagation), this corresponds to  (the number of tracked covariances) [2].

Information Gain

Note though, that  our arguments  "point out" information about latent variables (in the case of covariance propagation, wire values ), that in some sense was not available to the naive estimator. Thus we might measure the surprise of  as the relative information gained about the latents  from  relative to . Formally, we estimate the surprise of the arguments as the expected information ratio of the argument distribution and the naive distribution, again estimating the surprise with samples of x: 

Information gain has some intuitive appeal, and indeed a method along these lines is proposed at the end of this report. Further, in our appendix Surprise Accounting on Boolean Circuits, we show that information gain and description length are equivalent in the boolean circuit example from the original surprise accounting post.  However, after many failed attempts to formalize the relationship between the information/description length of  and expected information gain[3], I'm skeptical (and very confused) about whether information gain is a reasonable surprise metric. That caveat aside, much the remainder of this post builds off information gain as a surprise metric, so we will for now  assume its validity. 

Computational Cost

Thus far we have ignored the information content of  itself, only measuring the surprise of the estimate and the arguments. But the estimator itself can contain information too, and beyond our presumptions of independence, there's not a principled way of separating the estimator and the arguments to the estimator. 

For example, in sparse covariance propagation, our estimator starts with access to the full circuit and the distribution of the inputs, and assumes each node is independent. But the amount of surprise is relative to our definition of the estimator. We could instead parameterize the argument space as a single boolean: whether to run mean propagation or covariance propagation. The storage cost of the equivalent argument on the original estimator would be , compared to 1 for the new estimator. Similarly, we could presume the independence of all but one pair of nodes, such that the relative information gain only tracked the difference caused by tracking the covariance of this one pair. 

However, we can potentially resolve these concerns by considering the computational cost of running the estimator with a given set of arguments. In sparse covariance propagation, every tracked covariance adds terms to summations, thus computational cost monotonically increases the number of tracked covariances. This metric is independent of the construction of the argument space, and also includes the "fixed" cost of the default estimator, making it useful for making comparisons across different estimators. 

We also note that FLOPS are uses as a metric for argument length in Compact Proofs of Model Performance via Mechanistic Interpretability , and as a general budget in Estimating the Probabilities of Rare Outputs in Language Models, indicating that ARC and collaborators think computation cost is closely related to surprise of arguments. 

Picking a Metric

In practice, we will typically fix the estimator , and search for arguments  that minimize surprise. On this view, the description length and information gain definition of surprise will be used to optimize arguments, while the computational cost definition will be useful for comparing different estimation methods (we perform such a comparison in Inter-Method Surprise Accounting). Ultimately, I think there's hope of unifying these three notions, pulling on the connections between algorithmic and statistical information theory. 

Estimation on Instances

(Note - this section is inferred from a note on Wick Propagation at the end of the Max of k Heuristic Estimator report and Redwood's work on Generalized Wick Decomposition, and could be wrong/confused - I'm eager to get feedback)

Supposing we find   that minimizes  (some notion of) surprise, how would we produce an estimate  on a single instance 

When estimating the distribution, we applied update rules presuming variables were independent (taking covariance to be 0), unless are argument specified the covariance should be tracked. To apply the same arguments to instances, we need to define similar update rules, with corresponding terms tracking the analogue of covariance. 

For example, consider a node . To compute the expectation , we decompose the expectation of a product as the sum of the product of expectations and the covariance. We apply our argument as a "weight" in this decomposition, yielding:

To produce an analogous decomposition for , we appeal to wick decomposition

with  defined according to our argument and  the wick product. Note that  is recursively defined, such that if , then . We call this procedure Wick Propagation (I think?). 

I still don't have great intuitions about wick products and wick propagation, and as  stated at the begging of this section, the details here could be off. The import point is that to perform mechanistic anomaly detection, our estimator must be able to operate on individual instances, which can require non-trivial extensions to the original definition of the estimator. 

Recap: Ingredients of Heuristic Estimators 

We have now defined all the elements for applying heuristic estimation to mechanistic anomaly detection. Given a quantity to estimate, we construct a naive estimator based on a presumption of independence. Next, we define an argument space that selectively "violates" the presumption of independence to improve our estimate. To select arguments, we define a surprise metric, capturing the surprise of the arguments and the surprise of the estimate given the arguments. Using the surprise metric (and Monte-Carlo samples from the ground truth distribution) we search for arguments that minimize the total surprise. To apply our arguments to mechanistic anomaly detection, we define a variant of our estimator that takes in the same arguments but estimates the target quantity on individual instances. 

Equipped with the basic building blocks of heuristic estimation, we can now explore and compare various approaches to heuristic estimation on deep neural networks. 

Activation Modeling 

Activation modeling is an approach to heuristic estimation on neural networks that aims to produce a probability distribution on model activations while faithfully explaining some target expression (e.g. expected output). In what follows, we first briefly recap Layer-By-Layer activation modeling and connect it to heuristic estimation with a proposed method for surprise accounting. Next, we analyze problems with layer-by-layer activation modeling, making connections to mechanistic interpretability and various subproblems of heuristic arguments in general. 

Heuristic Estimation with Layer-by-Layer Activation Modeling

(Note: We strongly recommend reading at least the A Possible Approach for Estimating Tail Risks section of Estimating Tail Risks in Neural Networks). 

Layer-by-Layer activation modeling is essentially covariance propagation (when taking each distribution to be a multivariate gaussian) applied to neural networks. Roughly following the original notation, we have a neural network which can be expressed as a composition of  functions  (where  denotes the domain of  and we let  denote the output domain). In general, to produce an output distribution , we successively fit distributions  to the distribution produced by applying the next layer transformation  to the prior distribution 

Note that the "strongest" form, activation modeling only requires a pre-specified distribution on the input, and fits each layer in roughly constant time. In later sections we'll relax both properties, leading to a distinction/spectrum between "analytic" and "empirical" activation modeling. 

Below, we analyze the multivariate gaussian layer-by-layer activation modeling presented in the original post, defining a naive estimator, argument space, surprise metric, and estimator on instances. 

Setup

Taking each distribution  to be multivariate gaussian  , we fit the next layer  by taking  to be the (sample), which, by the moment matching theorem, minimizes [4].

Naive Estimator with Presumption of Independence

We presume the independence between each pair of activations by setting all covariance terms to 0. We express the naive distribution at each layer  as  , with  and  estimated from the (masked) distribution of the previous layer. 

Argument Space

Arguments are boolean masks over each covariance matrix: (with the naive argument initialized as the identity to allow for variance propagation). We denote the distribution at a given layer induced by an argument as 

Surprise Accounting

As before, defining the surprise of the estimate given an argument is straightforward, but defining the surprise of the argument is still contingent on the chosen notion of surprise. 

For computing the surprise of the estimate,  we compute the cross entropy of the output distribution relative to Monte-Carlo samples  of the true output distribution:

For computing the surprise of the argument, we can again appeal to  description length or information gain:

Beyond both these definitions, we should also consider the computational cost of computing the estimate. In layer-by-layer activation modeling, cost scales roughly in roughly   (with  the number of layers), treating each propagation as constant. But as we explore more sampling-based approaches (or "empirical" activation modeling), we'll need to introduce the number of samples  into our estimate. 

Estimation on Instances

To run the activation model on a single instance , we could (I think) use Wick Propagation (but given my uncertainty on how to define wick propagation in general, I won't try to extend the definition here). Alternatively, instead of using the estimate of , we could leverage the distribution  induced by our arguments, taking the log probability of the activations  as an anomaly score[5]. Note that for an arbitrarily parameterized distribution  this approach exactly recovers the (standard) anomaly detection method of fitting multivariate gaussians to activations. What makes the approach "heuristic arguments flavored" is the analytic propagation of selected parts of distributions. 

In the following section, we explore a broader space of approaches which diverge from this a kind of heuristic argument platonic ideal, but in doing so address potential problems with layer-by-layer analytic activation modeling. 

Empirical Activation Modeling: Beyond Analytic Propagation

A central concern with layer-by-layer activation modeling is that it forgets too much information. Restating the example given in ARC's post, suppose we have the following boolean network: 

with  a pseudo-random function which can't be captured by the activation model model. The output of the function is always 1, but we will model is as . The basic problem here is we have no way to track that  is being applied by multiple layers, causing the approximation errors to be highly correlated. 

Joint Activation Modeling

ARC's suggested remedy is to abandon layer-by-layer modeling in favor of a "joint" activation model, which e.g. could capture