Thanks to Erik Jenner for helpful comments and discussion
(Epistemic Status: Tentative take on how to think about heuristic estimation and surprise accounting in the context of activation modeling and causal scrubbing. Should not be taken as authoritative/accurate representation of how ARC thinks about heuristic estimation)
I'm pretty excited about ARC's heuristic arguments agenda. The general idea that "formal(ish) arguments" about properties of a neural network could help improve safety has intuitive appeal.
ARC has done a good job recently communicating about high level intuitions, goals, and subproblems of the heuristic arguments agenda. Their most recent post "A birds eye of ARC's agenda" lays out their general vision of how heuristic arguments would help with alignment, contextualizes their research output to date, and enumerates open subproblems.
However, I still can find it difficult to synthesize their various lines of work. I find myself asking questions like: Is activation modeling actually just cumulant propagation? How would we do surprise accounting on activation models? Is causal scrubbing a heuristic estimation method? How could singular learning theory contribute? What is "capacity allocation" anyway?
This post aims to partially resolve (or at least clarify) these confusions. I
define three notions of surprise (description length, information gain, and computational cost)
map activation modeling and causal scrubbing to the heuristic estimation formalism, and define methods for surprise accounting on each
compare (variants of) activation modeling and causal scrubbing according to computational cost
describe three theories of impact for work on heuristic estimation
In the appendix, I motivate the three notions of surprise in the context of the boolean circuits, explore choosing a graph basis for causal scrubbing, connect "negative components" in mech-interp to cherry-picking in heuristic estimation, and gesture towards more "weight-based" notions of heuristic estimation
Background
Heuristic Estimation
Using the notation from here, a heuristic estimator is an algorithm G that takes as input a mathematical expression Y and arguments π1...πm, and outputs an estimate of Y given arguments π1...πm: G(Y|π1...πm). The heuristic estimator makes some presumption(s) of independence, and arguments provide "correlational" information that update away from the naive estimate.
Surprise Accounting
ARC suggests surprise accounting as a an objective for searching for heuristic arguments. Given an argument space Π, find an argument π that minimizes:
the surprise of the argument + the surprise of the estimate (given the argument)
The surprise of the estimate should capture something like "the information required to encode the target, given the target estimate". Ideally, the estimator outputs a distribution over the target, and we can compute sample cross entropy or KL divergence. However, important details remain unclear (how to handle point estimates of outputs, metrics which do not track information, aggregating surprise over continuous vs discrete, finitely supported distributions to name a few)
The surprise of the argument is more complicated. ARC's post on surprise accounting and related works admit multiple plausible notions: description length, information gain, and computational cost.
Mechanistic Anomaly Detection
Mechanistic anomaly detection is one of the central target applications of heuristic arguments, offering (part of) a possible solution to both eliciting latent knowledge and deceptive alignment. Again using previously introduced notation, let M be a neural network trained on a dataset D of inputs x using the loss function/metric L(x,M(x)). Assume the network achieves low loss, i.e. Ex∼D[L(x,M(x))] is small. We determine whether a new input x∗ is an anomaly by
Finding a set of arguments π=π1...,πm such that the heuristic estimation of the expected loss: G(Ex∼D[L(x,M(x))]|π) is low.
Using the same set of arguments, compute the heuristic estimate of the loss on the new input x∗:G(L(x∗,M(x∗))|π). If the heuristic estimate is high, flag the new input as an anomaly.
Note that mechanistic anomaly detection requires the heuristic estimator to operate both on the expected loss and the loss of individual instances, which as we'll see is a non-trivial property.
Example: Covariance Propagation on Arithmetic Circuits
We start with the example of sparse-covariance propagation, a heuristic estimation algorithm for estimating the expected output of arithmetic circuits from ARC's first paper on heuristic arguments. We use sparse-covariance propagation to establish the four components of heuristic estimation for mechanistic anomaly detection:
naive estimation with presumption(s) of independence
an argument space
surprise accounting
estimation on individual instances
We introduce methods for surprise accounting on covariance propagation not explicitly described in prior work, and sketch out how "wick propagation" might work for extending covariance propagation to individual instances, though note that both these contributions are novel and could be confused.
Setup
Suppose we have an arithmetic circuit C, with standard normal inputs z1,...,zn, arithmetic nodes x1...xm representing either addition and multiplication (with C=xm), and wire values xk(z1...zk). We want to estimate the expected value of the output EN(0,1)[C].
Naive Estimator with Presumption of Independence
In mean propagation, we assume all wires in the circuit are independent. For addition nodes, this makes no difference (E[xi+xj]=E[xi]+E[xj]). For multiplication nodes though, independence implies E[xixj]=E[xi]E[xj], since the covariance term is 0. We use this presumption of independence to propagate means to the output of the circuit:
Mean Propagation (figure from Formalizing the Presumption of Independence)
Mean propagation will act as our "naive" estimator: given no addition arguments, we assume all nodes are independent, and propagate means to compute the expected output of a circuit.
While estimating the mean of the output is sufficient for estimation purposes, for surprise accounting, we'll want to to produce distributions over each wire in the circuit. To do so, we propagate the variance (not covariance) of nodes with the following rules:
For xk=xi+xj: σ2k=σ2i+σ2j
For xk=xixj : σ2k=σ2iσ2j+σ2iμ2j+σ2jμ2i
Performing variance propagation on the above example, we have
Variance (not covariance) propagation
Assuming wires are distributed according to a joint gaussian, propagating the variance yields the joint distribution distribution over wires q(x):=N(μ,I⋅σ2), with μ and σ2 the (vector valued) propagated mean and variance.
Argument Space
To improve our estimate of the expected output, we selectively violate the presumption of independence, tracking particular covariances between nodes and incorporating these covariances into our propagated mean and (co)variance computations. Concretely, let Π=Bm×m be the space of boolean matrices over each pair of wires in the circuit[1]. Then, for a given argument π∈Π, we include the covariance σij if πij=1. Propagating means and covariances as before yields a distribution pπ(x)=N(μπ,Σπ). Again note that μπm is the only quantity that matters for estimating the output, but, as we'll see in the next section, we use the entire joint distribution pπ(x) to compute the surprise of π.
Surprise Accounting
Recall that for surprise accounting, we want to compute
the surprise of the argument + the surprise of the estimate (given the argument)
In information theory, surprise is defined as the negative log probability of an event:
I(x)=−logp(x)
Applying the laws of probability, we can easily reconstruct the surprise objective as minimizing the joint surprise of the target and the arguments:
I(Y,π)=I(Y|π)+I(π)
However, given we typically do not have access to a direct prior on the arguments, we instead appeal to broader notions of surprise and information. To reduce abiguity, we denote surprise with S:
S(Y,π)=S(Y|π)+S(π)
Surprise of the Estimate
Assuming we have a distribution over the output pπ(y) and can sample ground truth targets, in some sense the most natural measure of the surprise of the estimate is the empirical cross entropy loss:
S(Y|π)≈1N∑i−logpπ(yi)
However, this definition is unsatisfying for multiple reasons. First, it does not specific how many samples. Second, it is inconsistent with ARC's original surprise example, where surprise of the estimate is summed over the entire (finite) input distribution - it is unclear how to extend a similar procedure to the (continuous) (uncountably) infinite case. Third, it is unclear how to compute surprise of the a point estimate, rather than a distribution.
Description Length
To compute the surprise (information content) of the argument S(π) , we might appeal to description length. That is, computing the number of bits required to encode π, relative to the empty explanation ∅. For binary arguments (like in covariance propagation), this corresponds to |π| (the number of tracked covariances) [2].
Information Gain
Note though, that our arguments π "point out" information about latent variables (in the case of covariance propagation, wire values (x1,...,xm)), that in some sense was not available to the naive estimator. Thus we might measure the surprise of π as the relative information gained about the latents x from π relative to ∅. Formally, we estimate the surprise of the arguments as the expected information ratio of the argument distribution and the naive distribution, again estimating the surprise with samples of x:
S(π)=Ex[logpπ(x)q(x)]≈1N∑ilogpπ(xi)q(xi)
Information gain has some intuitive appeal, and indeed a method along these lines is proposed at the end of this report. Further, in our appendix Surprise Accounting on Boolean Circuits, we show that information gain and description length are equivalent in the boolean circuit example from the original surprise accounting post. However, after many failed attempts to formalize the relationship between the information/description length of π and expected information gain[3], I'm skeptical (and very confused) about whether information gain is a reasonable surprise metric. That caveat aside, much the remainder of this post builds off information gain as a surprise metric, so we will for now assume its validity.
Computational Cost
Thus far we have ignored the information content of G itself, only measuring the surprise of the estimate and the arguments. But the estimator itself can contain information too, and beyond our presumptions of independence, there's not a principled way of separating the estimator and the arguments to the estimator.
For example, in sparse covariance propagation, our estimator starts with access to the full circuit and the distribution of the inputs, and assumes each node is independent. But the amount of surprise is relative to our definition of the estimator. We could instead parameterize the argument space as a single boolean: whether to run mean propagation or covariance propagation. The storage cost of the equivalent argument on the original estimator would be m×m, compared to 1 for the new estimator. Similarly, we could presume the independence of all but one pair of nodes, such that the relative information gain only tracked the difference caused by tracking the covariance of this one pair.
However, we can potentially resolve these concerns by considering the computational cost of running the estimator with a given set of arguments. In sparse covariance propagation, every tracked covariance adds terms to summations, thus computational cost monotonically increases the number of tracked covariances. This metric is independent of the construction of the argument space, and also includes the "fixed" cost of the default estimator, making it useful for making comparisons across different estimators.
In practice, we will typically fix the estimator G, and search for arguments π that minimize surprise. On this view, the description length and information gain definition of surprise will be used to optimize arguments, while the computational cost definition will be useful for comparing different estimation methods (we perform such a comparison in Inter-Method Surprise Accounting). Ultimately, I think there's hope of unifying these three notions, pulling on the connections between algorithmic and statistical information theory.
Supposing we find π∗ that minimizes (some notion of) surprise, how would we produce an estimate G(C(x∗)|π∗) on a single instance x∗?
When estimating the distribution, we applied update rules presuming variables were independent (taking covariance to be 0), unless are argument specified the covariance should be tracked. To apply the same arguments to instances, we need to define similar update rules, with corresponding terms tracking the analogue of covariance.
For example, consider a node xc=xaxb. To compute the expectation μc, we decompose the expectation of a product as the sum of the product of expectations and the covariance. We apply our argument as a "weight" in this decomposition, yielding:
μc=μaμb+πabσab
To produce an analogous decomposition for xc, we appeal to wick decomposition:
xc=μc+μaϵ(xa)+μbϵ(xa)+πabϵ(xa,xb)
with μc defined according to our argument and ϵ the wick product. Note that ϵ(xa,xb) is recursively defined, such that if πab=1, then xic=xiaxib. We call this procedure Wick Propagation (I think?).
I still don't have great intuitions about wick products and wick propagation, and as stated at the begging of this section, the details here could be off. The import point is that to perform mechanistic anomaly detection, our estimator must be able to operate on individual instances, which can require non-trivial extensions to the original definition of the estimator.
Recap: Ingredients of Heuristic Estimators
We have now defined all the elements for applying heuristic estimation to mechanistic anomaly detection. Given a quantity to estimate, we construct a naive estimator based on a presumption of independence. Next, we define an argument space that selectively "violates" the presumption of independence to improve our estimate. To select arguments, we define a surprise metric, capturing the surprise of the arguments and the surprise of the estimate given the arguments. Using the surprise metric (and Monte-Carlo samples from the ground truth distribution) we search for arguments that minimize the total surprise. To apply our arguments to mechanistic anomaly detection, we define a variant of our estimator that takes in the same arguments but estimates the target quantity on individual instances.
Equipped with the basic building blocks of heuristic estimation, we can now explore and compare various approaches to heuristic estimation on deep neural networks.
Activation Modeling
Activation modeling is an approach to heuristic estimation on neural networks that aims to produce a probability distribution on model activations while faithfully explaining some target expression (e.g. expected output). In what follows, we first briefly recap Layer-By-Layer activation modeling and connect it to heuristic estimation with a proposed method for surprise accounting. Next, we analyze problems with layer-by-layer activation modeling, making connections to mechanistic interpretability and various subproblems of heuristic arguments in general.
Heuristic Estimation with Layer-by-Layer Activation Modeling
Layer-by-Layer activation modeling is essentially covariance propagation (when taking each distribution to be a multivariate gaussian) applied to neural networks. Roughly following the original notation, we have a neural network M:X0→Xnwhich can be expressed as a composition of n functions f0,f1,...,fn−1 (where Xi denotes the domain of fi and we let Y:=Xn denote the output domain). In general, to produce an output distribution Pn, we successively fit distributions Pi+i to the distribution produced by applying the next layer transformation fi to the prior distribution Xi.
Note that the "strongest" form, activation modeling only requires a pre-specified distribution on the input, and fits each layer in roughly constant time. In later sections we'll relax both properties, leading to a distinction/spectrum between "analytic" and "empirical" activation modeling.
Below, we analyze the multivariate gaussian layer-by-layer activation modeling presented in the original post, defining a naive estimator, argument space, surprise metric, and estimator on instances.
Setup
Taking each distribution Pi to be multivariate gaussian N(μi,Σi), we fit the next layer i+1 by taking μi+1,Σi+1 to be the (sample), which, by the moment matching theorem, minimizes KL(fi(N(μi,Σi)||N(μi+1,Σi+1))[4].
Naive Estimator with Presumption of Independence
We presume the independence between each pair of activations by setting all covariance terms to 0. We express the naive distribution at each layer i as Qi(x):=N(μi,I⋅σi), with μi and σi estimated from the (masked) distribution of the previous layer.
Argument Space
Arguments are boolean masks over each covariance matrix: Π:={Bdim(Xi)×dim(Xi)}i=0...n(with the naive argument initialized as the identity to allow for variance propagation). We denote the distribution at a given layer induced by an argument as Piπ(x)
Surprise Accounting
As before, defining the surprise of the estimate given an argument is straightforward, but defining the surprise of the argument is still contingent on the chosen notion of surprise.
For computing the surprise of the estimate, we compute the cross entropy of the output distribution relative to Monte-Carlo samples y1,...,yN of the true output distribution:
S(EN(μ0,Σ0)[M]|π)=1N∑Nj=1−logPnπ(yj)
For computing the surprise of the argument, we can again appeal to description length or information gain:
SDL(π)=∑ni=0|πi|−dim(Xi)
SIG(π)=∑ni=0Ex∼Pi(X)[logPiπ(x)−logQi(x)]
Beyond both these definitions, we should also consider the computational cost of computing the estimate. In layer-by-layer activation modeling, cost scales roughly in roughly O(n) (with n the number of layers), treating each propagation as constant. But as we explore more sampling-based approaches (or "empirical" activation modeling), we'll need to introduce the number of samples N into our estimate.
Estimation on Instances
To run the activation model on a single instance x∗, we could (I think) use Wick Propagation (but given my uncertainty on how to define wick propagation in general, I won't try to extend the definition here). Alternatively, instead of using the estimate of M(x∗), we could leverage the distribution Pπ(x) induced by our arguments, taking the log probability of the activations ∑ni=0−logPiπ(x∗i) as an anomaly score[5]. Note that for an arbitrarily parameterized distribution Pπ this approach exactly recovers the (standard) anomaly detection method of fitting multivariate gaussians to activations. What makes the approach "heuristic arguments flavored" is the analytic propagation of selected parts of distributions.
In the following section, we explore a broader space of approaches which diverge from this a kind of heuristic argument platonic ideal, but in doing so address potential problems with layer-by-layer analytic activation modeling.
A central concern with layer-by-layer activation modeling is that it forgets too much information. Restating the example given in ARC's post, suppose we have the following boolean network:
f1:(x)→(x,h(x))
f2:(x,y)→(1 iff h(x)==y)
with h(x) a pseudo-random function which can't be captured by the activation model model. The output of the function is always 1, but we will model is as Bern(0.5). The basic problem here is we have no way to track that h(x) is being applied by multiple layers, causing the approximation errors to be highly correlated.
Joint Activation Modeling
ARC's suggested remedy is to abandon layer-by-layer modeling in favor of a "joint" activation model, which e.g. could capture h
Thanks to Erik Jenner for helpful comments and discussion
(Epistemic Status: Tentative take on how to think about heuristic estimation and surprise accounting in the context of activation modeling and causal scrubbing. Should not be taken as authoritative/accurate representation of how ARC thinks about heuristic estimation)
I'm pretty excited about ARC's heuristic arguments agenda. The general idea that "formal(ish) arguments" about properties of a neural network could help improve safety has intuitive appeal.
ARC has done a good job recently communicating about high level intuitions, goals, and subproblems of the heuristic arguments agenda. Their most recent post "A birds eye of ARC's agenda" lays out their general vision of how heuristic arguments would help with alignment, contextualizes their research output to date, and enumerates open subproblems.
However, I still can find it difficult to synthesize their various lines of work. I find myself asking questions like: Is activation modeling actually just cumulant propagation? How would we do surprise accounting on activation models? Is causal scrubbing a heuristic estimation method? How could singular learning theory contribute? What is "capacity allocation" anyway?
This post aims to partially resolve (or at least clarify) these confusions. I
In the appendix, I motivate the three notions of surprise in the context of the boolean circuits, explore choosing a graph basis for causal scrubbing, connect "negative components" in mech-interp to cherry-picking in heuristic estimation, and gesture towards more "weight-based" notions of heuristic estimation
Background
Heuristic Estimation
Using the notation from here, a heuristic estimator is an algorithm G that takes as input a mathematical expression Y and arguments π1...πm, and outputs an estimate of Y given arguments π1...πm: G(Y|π1...πm). The heuristic estimator makes some presumption(s) of independence, and arguments provide "correlational" information that update away from the naive estimate.
Surprise Accounting
ARC suggests surprise accounting as a an objective for searching for heuristic arguments. Given an argument space Π, find an argument π that minimizes:
the surprise of the argument + the surprise of the estimate (given the argument)
The surprise of the estimate should capture something like "the information required to encode the target, given the target estimate". Ideally, the estimator outputs a distribution over the target, and we can compute sample cross entropy or KL divergence. However, important details remain unclear (how to handle point estimates of outputs, metrics which do not track information, aggregating surprise over continuous vs discrete, finitely supported distributions to name a few)
The surprise of the argument is more complicated. ARC's post on surprise accounting and related works admit multiple plausible notions: description length, information gain, and computational cost.
Mechanistic Anomaly Detection
Mechanistic anomaly detection is one of the central target applications of heuristic arguments, offering (part of) a possible solution to both eliciting latent knowledge and deceptive alignment. Again using previously introduced notation, let M be a neural network trained on a dataset D of inputs x using the loss function/metric L(x,M(x)). Assume the network achieves low loss, i.e. Ex∼D[L(x,M(x))] is small. We determine whether a new input x∗ is an anomaly by
Note that mechanistic anomaly detection requires the heuristic estimator to operate both on the expected loss and the loss of individual instances, which as we'll see is a non-trivial property.
Example: Covariance Propagation on Arithmetic Circuits
(Note: We assume as background the first four sections of Appendix D in Formalizing the Presumption of Independence)
We start with the example of sparse-covariance propagation, a heuristic estimation algorithm for estimating the expected output of arithmetic circuits from ARC's first paper on heuristic arguments. We use sparse-covariance propagation to establish the four components of heuristic estimation for mechanistic anomaly detection:
We introduce methods for surprise accounting on covariance propagation not explicitly described in prior work, and sketch out how "wick propagation" might work for extending covariance propagation to individual instances, though note that both these contributions are novel and could be confused.
Setup
Suppose we have an arithmetic circuit C, with standard normal inputs z1,...,zn, arithmetic nodes x1...xm representing either addition and multiplication (with C=xm), and wire values xk(z1...zk). We want to estimate the expected value of the output EN(0,1)[C].
Naive Estimator with Presumption of Independence
In mean propagation, we assume all wires in the circuit are independent. For addition nodes, this makes no difference (E[xi+xj]=E[xi]+E[xj]). For multiplication nodes though, independence implies E[xixj]=E[xi]E[xj], since the covariance term is 0. We use this presumption of independence to propagate means to the output of the circuit:
Mean propagation will act as our "naive" estimator: given no addition arguments, we assume all nodes are independent, and propagate means to compute the expected output of a circuit.
While estimating the mean of the output is sufficient for estimation purposes, for surprise accounting, we'll want to to produce distributions over each wire in the circuit. To do so, we propagate the variance (not covariance) of nodes with the following rules:
For xk=xi+xj: σ2k=σ2i+σ2j
For xk=xixj : σ2k=σ2iσ2j+σ2iμ2j+σ2jμ2i
Performing variance propagation on the above example, we have
Assuming wires are distributed according to a joint gaussian, propagating the variance yields the joint distribution distribution over wires q(x):=N(μ,I⋅σ2), with μ and σ2 the (vector valued) propagated mean and variance.
Argument Space
To improve our estimate of the expected output, we selectively violate the presumption of independence, tracking particular covariances between nodes and incorporating these covariances into our propagated mean and (co)variance computations. Concretely, let Π=Bm×m be the space of boolean matrices over each pair of wires in the circuit[1]. Then, for a given argument π∈Π, we include the covariance σij if πij=1. Propagating means and covariances as before yields a distribution pπ(x)=N(μπ,Σπ). Again note that μπm is the only quantity that matters for estimating the output, but, as we'll see in the next section, we use the entire joint distribution pπ(x) to compute the surprise of π.
Surprise Accounting
Recall that for surprise accounting, we want to compute
the surprise of the argument + the surprise of the estimate (given the argument)
In information theory, surprise is defined as the negative log probability of an event:
I(x)=−logp(x)
Applying the laws of probability, we can easily reconstruct the surprise objective as minimizing the joint surprise of the target and the arguments:
I(Y,π)=I(Y|π)+I(π)
However, given we typically do not have access to a direct prior on the arguments, we instead appeal to broader notions of surprise and information. To reduce abiguity, we denote surprise with S:
S(Y,π)=S(Y|π)+S(π)
Surprise of the Estimate
Assuming we have a distribution over the output pπ(y) and can sample ground truth targets, in some sense the most natural measure of the surprise of the estimate is the empirical cross entropy loss:
S(Y|π)≈1N∑i−logpπ(yi)
However, this definition is unsatisfying for multiple reasons. First, it does not specific how many samples. Second, it is inconsistent with ARC's original surprise example, where surprise of the estimate is summed over the entire (finite) input distribution - it is unclear how to extend a similar procedure to the (continuous) (uncountably) infinite case. Third, it is unclear how to compute surprise of the a point estimate, rather than a distribution.
Description Length
To compute the surprise (information content) of the argument S(π) , we might appeal to description length. That is, computing the number of bits required to encode π, relative to the empty explanation ∅. For binary arguments (like in covariance propagation), this corresponds to |π| (the number of tracked covariances) [2].
Information Gain
Note though, that our arguments π "point out" information about latent variables (in the case of covariance propagation, wire values (x1,...,xm)), that in some sense was not available to the naive estimator. Thus we might measure the surprise of π as the relative information gained about the latents x from π relative to ∅. Formally, we estimate the surprise of the arguments as the expected information ratio of the argument distribution and the naive distribution, again estimating the surprise with samples of x:
S(π)=Ex[logpπ(x)q(x)]≈1N∑ilogpπ(xi)q(xi)
Information gain has some intuitive appeal, and indeed a method along these lines is proposed at the end of this report. Further, in our appendix Surprise Accounting on Boolean Circuits, we show that information gain and description length are equivalent in the boolean circuit example from the original surprise accounting post. However, after many failed attempts to formalize the relationship between the information/description length of π and expected information gain[3], I'm skeptical (and very confused) about whether information gain is a reasonable surprise metric. That caveat aside, much the remainder of this post builds off information gain as a surprise metric, so we will for now assume its validity.
Computational Cost
Thus far we have ignored the information content of G itself, only measuring the surprise of the estimate and the arguments. But the estimator itself can contain information too, and beyond our presumptions of independence, there's not a principled way of separating the estimator and the arguments to the estimator.
For example, in sparse covariance propagation, our estimator starts with access to the full circuit and the distribution of the inputs, and assumes each node is independent. But the amount of surprise is relative to our definition of the estimator. We could instead parameterize the argument space as a single boolean: whether to run mean propagation or covariance propagation. The storage cost of the equivalent argument on the original estimator would be m×m, compared to 1 for the new estimator. Similarly, we could presume the independence of all but one pair of nodes, such that the relative information gain only tracked the difference caused by tracking the covariance of this one pair.
However, we can potentially resolve these concerns by considering the computational cost of running the estimator with a given set of arguments. In sparse covariance propagation, every tracked covariance adds terms to summations, thus computational cost monotonically increases the number of tracked covariances. This metric is independent of the construction of the argument space, and also includes the "fixed" cost of the default estimator, making it useful for making comparisons across different estimators.
We also note that FLOPS are uses as a metric for argument length in Compact Proofs of Model Performance via Mechanistic Interpretability , and as a general budget in Estimating the Probabilities of Rare Outputs in Language Models, indicating that ARC and collaborators think computation cost is closely related to surprise of arguments.
Picking a Metric
In practice, we will typically fix the estimator G, and search for arguments π that minimize surprise. On this view, the description length and information gain definition of surprise will be used to optimize arguments, while the computational cost definition will be useful for comparing different estimation methods (we perform such a comparison in Inter-Method Surprise Accounting). Ultimately, I think there's hope of unifying these three notions, pulling on the connections between algorithmic and statistical information theory.
Estimation on Instances
(Note - this section is inferred from a note on Wick Propagation at the end of the Max of k Heuristic Estimator report and Redwood's work on Generalized Wick Decomposition, and could be wrong/confused - I'm eager to get feedback)
Supposing we find π∗ that minimizes (some notion of) surprise, how would we produce an estimate G(C(x∗)|π∗) on a single instance x∗?
When estimating the distribution, we applied update rules presuming variables were independent (taking covariance to be 0), unless are argument specified the covariance should be tracked. To apply the same arguments to instances, we need to define similar update rules, with corresponding terms tracking the analogue of covariance.
For example, consider a node xc=xaxb. To compute the expectation μc, we decompose the expectation of a product as the sum of the product of expectations and the covariance. We apply our argument as a "weight" in this decomposition, yielding:
μc=μaμb+πabσab
To produce an analogous decomposition for xc, we appeal to wick decomposition:
xc=μc+μaϵ(xa)+μbϵ(xa)+πabϵ(xa,xb)
with μc defined according to our argument and ϵ the wick product. Note that ϵ(xa,xb) is recursively defined, such that if πab=1, then xic=xiaxib. We call this procedure Wick Propagation (I think?).
I still don't have great intuitions about wick products and wick propagation, and as stated at the begging of this section, the details here could be off. The import point is that to perform mechanistic anomaly detection, our estimator must be able to operate on individual instances, which can require non-trivial extensions to the original definition of the estimator.
Recap: Ingredients of Heuristic Estimators
We have now defined all the elements for applying heuristic estimation to mechanistic anomaly detection. Given a quantity to estimate, we construct a naive estimator based on a presumption of independence. Next, we define an argument space that selectively "violates" the presumption of independence to improve our estimate. To select arguments, we define a surprise metric, capturing the surprise of the arguments and the surprise of the estimate given the arguments. Using the surprise metric (and Monte-Carlo samples from the ground truth distribution) we search for arguments that minimize the total surprise. To apply our arguments to mechanistic anomaly detection, we define a variant of our estimator that takes in the same arguments but estimates the target quantity on individual instances.
Equipped with the basic building blocks of heuristic estimation, we can now explore and compare various approaches to heuristic estimation on deep neural networks.
Activation Modeling
Activation modeling is an approach to heuristic estimation on neural networks that aims to produce a probability distribution on model activations while faithfully explaining some target expression (e.g. expected output). In what follows, we first briefly recap Layer-By-Layer activation modeling and connect it to heuristic estimation with a proposed method for surprise accounting. Next, we analyze problems with layer-by-layer activation modeling, making connections to mechanistic interpretability and various subproblems of heuristic arguments in general.
Heuristic Estimation with Layer-by-Layer Activation Modeling
(Note: We strongly recommend reading at least the A Possible Approach for Estimating Tail Risks section of Estimating Tail Risks in Neural Networks).
Layer-by-Layer activation modeling is essentially covariance propagation (when taking each distribution to be a multivariate gaussian) applied to neural networks. Roughly following the original notation, we have a neural network M:X0→Xnwhich can be expressed as a composition of n functions f0,f1,...,fn−1 (where Xi denotes the domain of fi and we let Y:=Xn denote the output domain). In general, to produce an output distribution Pn, we successively fit distributions Pi+i to the distribution produced by applying the next layer transformation fi to the prior distribution Xi.
Note that the "strongest" form, activation modeling only requires a pre-specified distribution on the input, and fits each layer in roughly constant time. In later sections we'll relax both properties, leading to a distinction/spectrum between "analytic" and "empirical" activation modeling.
Below, we analyze the multivariate gaussian layer-by-layer activation modeling presented in the original post, defining a naive estimator, argument space, surprise metric, and estimator on instances.
Setup
Taking each distribution Pi to be multivariate gaussian N(μi,Σi), we fit the next layer i+1 by taking μi+1,Σi+1 to be the (sample), which, by the moment matching theorem, minimizes KL(fi(N(μi,Σi)||N(μi+1,Σi+1))[4].
Naive Estimator with Presumption of Independence
We presume the independence between each pair of activations by setting all covariance terms to 0. We express the naive distribution at each layer i as Qi(x):=N(μi,I⋅σi), with μi and σi estimated from the (masked) distribution of the previous layer.
Argument Space
Arguments are boolean masks over each covariance matrix: Π:={Bdim(Xi)×dim(Xi)}i=0...n(with the naive argument initialized as the identity to allow for variance propagation). We denote the distribution at a given layer induced by an argument as Piπ(x)
Surprise Accounting
As before, defining the surprise of the estimate given an argument is straightforward, but defining the surprise of the argument is still contingent on the chosen notion of surprise.
For computing the surprise of the estimate, we compute the cross entropy of the output distribution relative to Monte-Carlo samples y1,...,yN of the true output distribution:
S(EN(μ0,Σ0)[M]|π)=1N∑Nj=1−logPnπ(yj)
For computing the surprise of the argument, we can again appeal to description length or information gain:
SDL(π)=∑ni=0|πi|−dim(Xi)
SIG(π)=∑ni=0Ex∼Pi(X)[logPiπ(x)−logQi(x)]
Beyond both these definitions, we should also consider the computational cost of computing the estimate. In layer-by-layer activation modeling, cost scales roughly in roughly O(n) (with n the number of layers), treating each propagation as constant. But as we explore more sampling-based approaches (or "empirical" activation modeling), we'll need to introduce the number of samples N into our estimate.
Estimation on Instances
To run the activation model on a single instance x∗, we could (I think) use Wick Propagation (but given my uncertainty on how to define wick propagation in general, I won't try to extend the definition here). Alternatively, instead of using the estimate of M(x∗), we could leverage the distribution Pπ(x) induced by our arguments, taking the log probability of the activations ∑ni=0−logPiπ(x∗i) as an anomaly score[5]. Note that for an arbitrarily parameterized distribution Pπ this approach exactly recovers the (standard) anomaly detection method of fitting multivariate gaussians to activations. What makes the approach "heuristic arguments flavored" is the analytic propagation of selected parts of distributions.
In the following section, we explore a broader space of approaches which diverge from this a kind of heuristic argument platonic ideal, but in doing so address potential problems with layer-by-layer analytic activation modeling.
Empirical Activation Modeling: Beyond Analytic Propagation
A central concern with layer-by-layer activation modeling is that it forgets too much information. Restating the example given in ARC's post, suppose we have the following boolean network:
f1:(x)→(x,h(x))
f2:(x,y)→(1 iff h(x)==y)
with h(x) a pseudo-random function which can't be captured by the activation model model. The output of the function is always 1, but we will model is as Bern(0.5). The basic problem here is we have no way to track that h(x) is being applied by multiple layers, causing the approximation errors to be highly correlated.
Joint Activation Modeling
ARC's suggested remedy is to abandon layer-by-layer modeling in favor of a "joint" activation model, which e.g. could capture h