In 2025, the Alignment Research Center (ARC) has been making conceptual and theoretical progress at the fastest pace that I've seen since I first interned in 2022. Most of this progress has come about because of a re-orientation around a more specific goal: outperforming random sampling when it comes to understanding neural network outputs. Compared to our previous goals, this goal has the advantage of being more concrete and more directly tied to useful applications.

The purpose of this post is to:

Explain and motivate our "outperforming sampling" agenda from the standpoint of preventing catastrophic AI misalignment.
Introduce what we call the matching sampling principle (MSP) as a semi-formalization of the belief underpinning our research agenda, and discuss why we believe this principle.
Discuss the progress we've made toward matching sampling in some specific contexts, such as random MLPs and trained two-layer MLPs.

Also: we're hiring! If the research direction described in this post excites you, you can apply to ARC!

Outperforming sampling as a step toward preventing AI misalignment

Consider the following simple scheme that attempts to align an AI model , which maps inputs $x$ to outputs $y$ :

Build a "catastrophe detector" $C$ that classifies model outputs $y$ as "catastrophic" (1) or "non-catastrophic" (0).^[1] You can do this by, for example, scaffolding together a deliberative system of GPT-5's that carefully investigate whether the model is doing anything suspicious.
Do adversarial training using the catastrophe detector. Concretely, this means:
1. Optimize a probability distribution $D$ over inputs $x$ , so as to maximize $E_{x \sim D} [C (M (x))]$ .
2. At the same time, optimize $M$ so as to minimize $E_{x \sim D} [C (M (x))]$ .

I think that this is a fine starting point for alignment plan, but not a complete plan in and of itself. It suffers from at least two issues:

A catastrophe detector that's built in the way I described would be imperfect, even if you did a really good job with the engineering. If $M$ were "smarter" than $C$ , then it could figure out ways to fool $C$ into assigning a low catastrophe probability to a catastrophic output.
Even if the catastrophe detector were perfect, it is not clear how to compute $E_{x \sim D} [C (M (x))]$ efficiently enough to make this scheme work.^[2] The most straightforward approach would involve estimating this expectation by taking random samples $x \sim D$ , but this is extremely slow if catastrophes are rare on $D$ .

(There are other technical issues as well,^[3] but these are the ones that seem hardest to surmount.)

We believe that ARC's technical research agenda is capable of addressing both of these issues. However, issue #1 is mostly out of scope for this post (though I'll very briefly describe our planned approach in this footnote^[4]). The purpose of this post is to explain in detail how we hope to address issue #2.

Understanding structure helps outperform sampling

ARC's goal is to be able to estimate $E_{x \sim D} [C (M (x))]$ far better than one can just by drawing random samples from $D$ . We believe that this can be done by understanding the structure of $M$ , $C$ , and $D$ .

Here's a simple example that's meant to illustrate this point. Suppose that, by understanding the internals of $C$ , we are able to notice that $C (y)$ is the conjunction of three predicates $P_{1} (y), P_{2} (y), P_{3} (y)$ -- in other words, $C$ outputs 1 if and only if all of $P_{1}, P_{2}, P_{3}$ are true of $M$ 's output.

And suppose that, furthermore, we understand the structure of $M$ and $D$ to understand that $P_{1} (M (x)), P_{2} (M (x)), P_{3} (M (x))$ are independent events for $x \sim D$ .^[5]

Using this structural understanding, we can estimate that $E_{x \sim D} [C (M (x))] = \prod_{i = 1}^{3} E_{x \sim D} [P_{i} (M (x))]$ . If each $P_{i}$ is true on $D$ with probability one-in-a-million ( $10^{- 6}$ ), then we can estimate that $E_{x \sim D} [C (M (x))] \approx 10^{- 18}$ . Obtaining this estimate by sampling would have required roughly $10^{18}$ samples. By contrast, our structural understanding lets us estimate this probability with roughly $3 \cdot 10^{6}$ samples, and we can potentially do even better than that if we have a structural understanding of the $P_{i}$ 's themselves. Thus, our structural understanding lets us estimate the expected value far more efficiently than we could with sampling.

This example is simplistic, of course: in practice, we will need to understand structure that is far more sophisticated than "the output is a conjunction of three independent predicates." But the example illustrates the point that having a detailed mechanistic understanding of a neural net lets us estimate properties of its outputs far better than black-box methods alone.

A non-human understanding

When we speak of "understanding the structure" of $M$ , $C$ , and $D$ , we are not referring to human understanding. While a conjunctive structure like the one above can be understood by a human, we believe that in general, neural nets will be composed of mathematical structures that are far too complex to allow for a full human understanding.

Instead, we are imagining that an explanation of the structure of a neural net is written in some kind of formal language. The explanation could be as large as the neural net itself, and may be as incomprehensible to a human as the neural net. Thus, our goal is not to have a human look at the structure and estimate the expectation of $C (M (x))$ . Instead, the goal is to invent an algorithm that takes as input the explanation and estimates expectation of $C (M (x))$ based on that explanation.

This contrasts to almost all other research on neural network interpretability, which aims for a partial, human understanding of neural nets. Our research is instead aimed at full, algorithmic understanding.

In the next section, I will elaborate on what this means.

The matching sampling principle

In this section, I will more formally describe what we hope to accomplish by gaining structural understanding of neural networks. While above I talked about outperforming sampling, in the fully general case we can only hope to match the performance of sampling. In other words, we expect that the performance of our algorithms in the practical setting of trained neural nets will substantially exceed the worst-case bounds that we will be able to state and prove. See below for further discussion of this point.

I will start with a first-pass attempt at stating the matching sampling principle (MSP). As we will discuss, it does not quite make sense; however, it carries across the key intuition.

A first attempt at stating the MSP

In order to state the MSP, we will define a few pieces of notation:

We will use the notation $M_{θ}$ to describe a neural network (or other function from a parameterized family). Here, $M$ denotes the architecture of the neural net, and $θ$ denotes the parameters.
- Concretely, $M_{θ} : {0, 1}^{n} \to R$ is a function mapping $n$ -bit inputs $x$ to real numbers. (Note that this notation differs from the notation we used in the previous section; here, $M_{θ}$ describes the composition $C \circ M$ of the neural net and the catastrophe detector above.)
- We are interested in estimating $E_{x \sim {0, 1}^{n}} [M_{θ} (x)]$ , where $x$ is sampled uniformly.^[6]
We will use the notation $π$ to describe a mechanistic explanation of $M_{θ}$ ; $π$ is what provides us with a structural understanding of $M_{θ}$ and allows us to estimate its expected output. (This mirrors the way that ARC has historically used the letter $π$ .)
We will use the notation $G_{M}$ to denote an estimator (the subscript $M$ is meant to emphasize that different architectures could have different estimators): $G_{M}$ takes as input the parameters $θ$ , a mechanistic explanation $π$ , and a tolerance parameter $ε$ . (As we discuss below, the smaller the tolerance parameter, the more accurate $G_{M}$ 's estimate, but the longer $G_{M}$ will be allowed to run.)

With this notation in place, we will make our first attempt to state the MSP:

For all architectures $M$ (with parameters $θ$ ), there exists an estimator $G_{M}$ such that:
For all parameters $θ$ , there exists a short^[7] explanation $π$ (we require that $| π | \leq O (| θ |)$ ), such that:
For all tolerance parameters $ε > 0$ , $G_{M} (θ, π, ε)$ satisfies the following three properties:
1. It runs in time $O (\frac{1}{ε^{2}} Time (M_{θ}))$ .
2. Its error is competitive with sampling: $(G_{M} (θ, π, ε) - E_{x \sim {0, 1}^{n}} [M_{θ} (x)])^{2} \leq ε^{2} {Var}_{x \sim {0, 1}^{n}} [M_{θ} (x)] .$
3. It is mechanistic.

Let's parse these three requirements:

$G_{M}$ needs to run in time $O (\frac{1}{ε^{2}} Time (M_{θ}))$ . This is the time that it takes to estimate $E_{x \sim {0, 1}^{n}} [M_{θ} (x)]$ by running $1 / ε^{2}$ randomly sampled values of $x$ through $M_{θ}$ .
The squared error of $G_{M}$ must be small. Concretely, the right-hand side represents the expected squared error via taking the empirical average of $1 / ε^{2}$ samples of $M_{θ} (x)$ .
$G_{M}$ is mechanistic. We have not defined "mechanistic", so this point requires elaboration.

What makes an algorithm "mechanistic"?

We do not have a formal definition of "mechanistic." But, loosely speaking, we mean that $G_{M}$ estimates the expected output of $M_{θ}$ deductively, based on the structure of $M_{θ}$ . This contrasts with sampling-based algorithms for estimating the expected output of $M_{θ}$ , which operate based on inductive reasoning. Mechanistically estimating the expected output involves finding the reason for the expected output being what it is; meanwhile, sampling-based algorithms merely infer the existence of a reason without learning anything about the reason.

To illustrate this difference, suppose that the explanation $π$ given to $G_{M}$ is a simple heuristic argument (such as mean propagation -- see §D.2 here), which suggests that $E [M_{θ}] = 0$ but is otherwise uninformative about the structure of $M_{θ}$ . Suppose further that $G_{M}$ computes $M_{θ} (x)$ on a hundred inputs $x \in {0, 1}^{n}$ , and it finds that $M_{θ} (x) = 1$ on every one of those hundred inputs. Then $G_{M}$ should return $\frac{100}{2^{n}}$ : that's because it knows that $M_{θ} (x) = 1$ on the hundred inputs that it checks, but it has not seen any structural evidence that would suggest that $M_{θ}$ 's behavior on those hundred inputs has any bearing on how $M_{θ}$ behaves on the inputs that it has not checked. By contrast, a sampling-based estimator that checks the same hundred inputs would return $1$ , implicitly assuming that those inputs are representative.

(If indeed $M_{θ}$ always returns $1$ , then we believe that there exists a short explanation of this fact; but $G_{M}$ cannot output $1$ unless it is given this explanation.)

In some of our previous work, we discussed covariance propagation: successively modeling each layer of $M_{θ}$ as a multivariate normal distribution.^[8] Covariance propagation (and related methods, like mean propagation and cumulant propagation) is mechanistic, because it deduces an estimate based on the structure of $M_{θ}$ .^[9] More generally, deduction-projection estimators -- estimators that successively model each layer of $M_{θ}$ by finding the best-fit model from some parameterized class -- are mechanistic.

A simple, though not entirely correct, heuristic for whether an estimation algorithm is deductive, is whether it avoids any random or pseudorandom sampling. This heuristic should work for the purposes of engaging with this post.

(See much more on mechanistic estimation in our earlier paper, "Formalizing the presumption of independence",^[10] as well as in former ARC intern Gabe Wu's senior thesis on deduction-projection estimators.)

Why do we require $G_{M}$ to be mechanistic?

There are multiple reasons for this; in a previous blog post, we discussed how mechanistic estimates can help us detect mechanistic anomalies. But for the purposes of this post, the reason is pretty straightforward: in cases where $M_{θ}$ has a lot of structure, we think that $G_{M}$ can substantially outperform sampling, if given an explanation $π$ that explains that structure (as motivated above).

Thus, loosely speaking, our hope is that if we find a $G_{M}$ that both (a) is mechanistic and (b) performs at least as well as sampling for all $θ$ , then it will substantially outperform sampling for parameters $θ$ with a lot of structure, such as trained neural nets.

The intuition behind the MSP

Sampling is a really powerful tool, because randomly drawn samples are representative (with high probability), and so a sampling-based estimate can't be off by too much (with high probability). In light of this, why do we think that a mechanistic estimation algorithm can compete with sampling?

Suppose that $M_{θ} : {0, 1}^{100} \to {0, 1}$ is a boolean circuit. Suppose, further, that a naive heuristic argument (like mean propagation) suggests that $M_{θ}$ 's average output is $0.5$ , but that in fact its average output is roughly $0.49$ (far enough from $0.5$ that this discrepancy could not have happened by chance). A sampling-based algorithm can pick up on this discrepancy given about 10,000 samples; but what can a mechanistic algorithm do?

Well, given that the discrepancy could not have happened by chance, there must be structure that explains the discrepancy. For illustration, let's consider two types of structure.

First, maybe the discrepancy is caused by different gates reusing the same inputs, thereby inducing nontrivial correlations between different parts of the circuit.^[11] In that case, $π$ should be able to point out this structure, causing $G_{M}$ to understand the discrepancy (even without running any inputs through $M_{θ}$ ).

Second, maybe only the first 10 input bits matter to the output of the circuit (perhaps $M_{θ}$ ignores the last 90 input bits entirely, or perhaps they end up not affecting the output for complicated structural reasons). And then -- just by chance -- it so happens that $M_{θ}$ outputs 1 on only 49% of the 1024 possible 10-bit inputs. In this case, $π$ points out that $M_{θ}$ depends only on the first 10 input bits; it does not point out that $M_{θ}$ outputs 1 on 49% of them, because that's part of the unexplainable randomness of $M_{θ}$ .^[12] Instead, $G_{M}$ must determine this fact by using its allotted time to check the value of $M_{θ}$ on those 1024 inputs.

(What if $ε$ is large enough that $G_{M}$ doesn't have the necessary runtime to check $M_{θ}$ on all 1024 inputs? In that case, it should check however many it can and estimate the rest as being 50/50. This will still outperform sampling!)

More generally, the intuition is that knowing the structure of $M_{θ}$ gives $G_{M}$ the knowledge it needs to do no worse than random sampling. If $G_{M}$ still does worse than random sampling after reading $π$ , that can only be because $π$ did not provide a full structural explanation of $M_{θ}$ .^[13]

Why only matching sampling?

Given the above intuition that understanding structure can outperform sampling, why are we only aiming to match the performance of sampling?

Consider the above example, where the average output of $M_{θ}$ depends on 1024 effectively random computations, and suppose that $1 / ε^{2} = 512$ : enough time for $G_{M}$ to compute the output of $M_{θ}$ on 512 of the 1024 inputs. In that case, we expect both $G_{M}$ and sampling to have squared error on the order of $\frac{1}{512}$ : $G_{M}$ 's expected squared error will be somewhat lower, but not dramatically so.

In general, we expect that there will often be a range of $ε$ -values for which the best mechanistic estimate is only slightly better than sampling-based estimation.^[14] Thus, for some parameters $θ$ and tolerance parameters $ε$ , we only expect to be able to match (or perhaps slightly outperform) sampling, not to strongly outperform sampling.

However, as discussed above, we expect that if our mechanistic estimator matches the performance of sampling in all cases, then it will substantially outperform sampling in structured cases such as trained neural nets, at least for non-tiny values of $ε$ . We expect that we can leverage this to help with the sort of adversarial training process described in the introduction.

An issue: $π$ can just tell $G_{M}$ the answer

As mentioned earlier, out first attempt at stating the MSP doesn't quite make sense. The idea of MSP is for $π$ to describe the structure of $M_{θ}$ . However, in order to satisfy the MSP statement above, $π$ can just write down the value of $E_{x \sim {0, 1}^{n}} [M_{θ} (x)]$ . Then, $G_{M}$ can output that value.

To fix this issue, we observe that if $G_{M}$ understands the structure of $M_{θ}$ , then it ought to be able to answer all sorts of questions about $M_{θ}$ at least as well as sampling -- not just its expected value -- so long as those questions are not adversarially selected. To formalize this idea, we will modify the type signature of $M_{θ}$ to take two inputs $(c, x)$ (here, $c$ stands for "context"), and require that $G_{M}$ be able accurately estimate $E_{x} [M_{θ} (c, x)]$ for a random choice of $c$ .^[15] This change gives us an MSP statement that we are willing to stand behind.

Our actual MSP statement

Here is ARC's mainline "matching sampling principle" (MSP):

For all architectures $M$ (with parameters $θ$ ) mapping pairs $(c \in {0, 1}^{n_{c}}, x \in {0, 1}^{n_{x}})$ to $R$ , there exists an estimator $G_{M}$ mapping tuples $(θ, π, c, ε)$ to $R$ , such that:
For all parameters $θ$ , there exists a short explanation $π$ ( $| π | \leq O (| θ |)$ ), such that:
For all tolerance parameters $ε > 0$ , $G_{M} (θ, π, c, ε)$ satisfies the following three properties:
1. It runs in time $O (\frac{1}{ε^{2}} Time (M_{θ}))$ .
2. Its error is competitive with sampling, on average over random $c$ : $E_{c} [(G_{M} (θ, π, c, ε) - E_{x} [M_{θ} (c, x)])^{2}] \leq ε^{2} E_{c} [{Var}_{x} [M_{θ} (c, x)]]$ , where $c \sim {0, 1}^{n_{c}}$ and $x \sim {0, 1}^{n_{x}}$ .
3. It is mechanistic.

(Just as before, this statement isn't fully formal, because of the informal "mechanistic" qualifier. But in practice, we have strong enough opinions about what counts as "mechanistic" that this statement is formal enough to guide our research.)

An interesting special case of the MSP is when $M_{θ}$ encodes a universal Turing machine. See the appendix for discussion.

An important variant: Findable explanations

While the above MSP statement is the most theoretically clean one, on its face the statement is not very useful. That's because it says nothing about being able to find the explanation $π$ ; what use is it to merely know that an adequate explanation exists, if we can't find it?

This leads us to the following alternative statement, which we've been calling the "train and explain" formulation of the MSP:

For all architectures $M$ (with parameters $θ$ ) mapping pairs $(c \in {0, 1}^{n_{c}}, x \in {0, 1}^{n_{x}})$ to $R$ , there exists an estimator $G_{M}$ mapping tuples $(θ, π, c, ε)$ to $R$ , such that:
For all "training" algorithms $T$ mapping random seeds $s \in {0, 1}^{r}$ to parameters $θ$ , there exists an "explaining" algorithm $E$ mapping random seeds $s \in {0, 1}^{r}$ to explanations $π$ , with $Time (E) \leq O (Time (T))$ , such that:
For all tolerance parameters $ε > 0$ , $G_{M} (θ, π, c, ε)$ satisfies the following three properties:
1. It runs in time $O (\frac{1}{ε^{2}} Time (M_{θ}))$ .
2. Its error is competitive with sampling, on average over random $c$ and $s$ : $E_{c, s} [(G_{M} (T (s), E (s), c, ε) - E_{x} [M_{T (s)} (c, x)])^{2}] \leq ε^{2} E_{c, s} [{Var}_{x} [M_{T (s)} (c, x)]]$ , where $s \sim {0, 1}^{r}$ , $c \sim {0, 1}^{n_{c}}$ , and $x \sim {0, 1}^{n_{x}}$ .
3. It is mechanistic.

In this statement, it is useful to think of $T$ as being the learning algorithm used to find $θ$ (e.g. SGD) and $s$ as being the random bits used during training (the random initialization and random choices of training data used at each training step). Then, $E$ is the algorithm used to find the mechanistic explanation of $θ$ : intuitively, it works "in parallel" with $T$ , observing the training process and "building up" the explanation $π$ in a way that mirrors the way that $T$ "builds up" structure in $θ$ by iteratively modifying $θ$ to get lower and lower loss.

Note that our mainline MSP statement is the special case of the "train and explain" formulation where $T$ and $E$ are both computationally unbounded (so that $T$ can select the worst parameters $θ$ and $E$ can select the most helpful advice $π$ ).

In general, we think that for any computational constraints placed on $T$ (e.g. on time or memory), there is a corresponding $E$ with the same computational constraints that can find an adequate explanation $π$ . If we are correct, then that potentially gives us a strategy for efficiently computing properties of trained neural networks (such as catastrophe probability), while paying a relatively small alignment tax. (If finding $π$ takes as much time as finding $θ$ , that's an alignment tax of 100%: a small price to pay for avoiding catastrophe.)

Another modification: getting rid of $ε$

It turns out that the MSP can be stated without reference to a tolerance parameter, by subsuming the number of samples into the architecture instead. See this appendix for details.

Our progress so far

Over the course of 2025, ARC has made progress on the MSP in a few different directions. Concretely:

We have a mechanistic algorithm that competes with sampling (in theory and in practice) for estimating the size of an intersection of randomly chosen halfspaces.^[16] We have also generalized our algorithm to work for some other problems, such as estimating the satisfaction probability of a random CNF, or the permanent of a random matrix.
We believe we have a mechanistic algorithm that competes with sampling for estimating the expected output of random MLPs on Gaussian inputs. (We have an empirical demonstration of competitiveness with sampling, and a proof sketch that we are working on expanding into a full proof.)
We have made substantial progress toward a mechanistic algorithm that competes with sampling for estimating the expected output of a two-layer MLP on Gaussian inputs, where the second layer of the MLP is trained via gradient descent.

Our results are not yet ready for publication, but hope to get them ready in the coming months. In this section, I will briefly summarize these results and discuss the most interesting directions for future work.

Intersection of random half-spaces

The first problem we tackled in our "matching sampling" framework was mechanistically estimating the volume of the intersection of random half-spaces. Although it's a somewhat toy problem, it wasn't trivial to solve, and we learned a lot from solving it.

Problem statement

Find an algorithm $G$ that takes as input unit vectors $v_{1}, \dots, v_{k} \in R^{n}$ and a tolerance parameter $ε$ , and mechanistically estimates the probability that a randomly chosen unit vector $x$ has a nonnegative dot product with all of $v_{1}, \dots, v_{k}$ , such that:

The expected squared error of $G$ (over randomly chosen $v_{1}, \dots, v_{k}$ ) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling $1 / ε^{2}$ random vectors $x$ and outputting the empirical fraction that have a nonnegative dot product).
The runtime of $G$ is competitive with the runtime of the naive sampling-based algorithm, i.e. $O (k n / ε^{2})$ .^[17]

Note that this is the MSP in the particular case where $θ$ is the empty string; $c = (v_{1}, \dots, v_{k})$ ; and $M (c, x)$ returns 1 if $x \cdot v_{t} \geq 0$ for all $t$ . (The distribution of $x$ and each vector in $c$ is uniform over the unit sphere, rather than uniform over bit strings.)

Since $θ$ is empty, there is no advice $π$ in this setting. Despite that, we think that our solution to this MSP instance has helped us progress toward solving the MSP in more generality.

(Alternatively, this problem can be viewed as an instance of the "train and explain" version of the MSP, where $θ = (v_{1}, \dots, v_{k})$ , $c$ is empty, and the training algorithm $T$ simply returns random vectors. In this setting, the explaining algorithm $E$ does not have time to do any interesting computation, so $π$ might as well be empty.)

A sketch of our solution

In a sentence, our solution is to build up a polynomial approximation of $M (c, x)$ by considering one vector $v_{t}$ at a time.

To elaborate on this, for $1 \leq t \leq k$ , let $h_{t} (x)$ be the function that outputs 1 if $v_{t} \cdot x \geq 0$ , and 0 otherwise. Let us define $H_{\leq t} (x) := h_{1} (x) \dots h_{t} (x)$ (so $M (c, x) = H_{\leq k} (x)$ ). We will:

Define ${~ H}_{\leq 0}$ to be the constant 1 function.
For all $1 \leq t \leq k$ , compute the best low-degree polynomial approximation ${~ H}_{\leq t}$ to ${~ H}_{\leq t - 1} h_{t}$ . In the end, ${~ H}_{\leq k}$ will be a polynomial approximation to $H_{\leq k}$ , and we will output the exact expectation $E [{~ H}_{\leq k}]$ as our estimate of $E_{x} [M (c, x)]$ .

To be more precise, "low-degree polynomial" here means degree $d$ , where $d$ is the largest integer such that $n^{d} \leq 1 / ε^{2}$ (it turns out that that's the precision we need in order to compete with sampling). And "the best low-degree polynomial approximation" means the best approximation in terms of squared error, for $x$ drawn from the unit sphere.

In order for $G$ to be efficient enough, it needs to be able to compute this polynomial approximation in time $n / ε^{2}$ . Getting the dependence on $ε$ down to $1 / ε^{2}$ turns out to be pretty tricky.^[18] However, we were able to find a suitably efficient algorithm by working in the Hermite basis of polynomials instead of the standard monomial basis.

We have also generalized this approach to apply to a broader class of problems than just "intersection of randomly-chosen half-spaces." Roughly speaking, we can apply our methods to estimate the expected product of symmetric random functions, for a certain representation-theoretic notion of symmetry. Concrete problems that we have solved with this approach include:

Estimating the satisfaction probability of $k$ -CNFs, where each literal is chosen uniformly at random.
Estimating the permanent of a matrix where each entry is either $- 1$ or $1$ , independently with probability 50%.

We are aiming to publish the details of our algorithm and this generalization in the coming months.

Random MLPs

In the last couple of months, we have been tackling a more sophisticated MSP instance: random MLPs. We now believe that we have an algorithm and a proof of its correctness and efficiency, though we are still verifying details. We also have an empirical demonstration that our algorithm is competitive with sampling.

Problem statement

Consider the following MLP architecture: the input size is $n$ (assume that $n$ is very large^[19]); there are $k$ hidden layers ( $k$ is some fixed constant), each of width $n$ ; the output is a scalar; and every hidden layer has some activation function (e.g. ReLU).

Find an algorithm $G$ that takes as input the weights $W$ of an MLP with the above architecture and a tolerance parameter $ε$ , and mechanistically estimates the expected output of the MLP on inputs drawn from $N (0, I_{n})$ , such that:

The expected squared error of $G$ (over independent, normally distributed^[20] weights) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling $1 / ε^{2}$ random inputs and outputting the empirical average output).
The runtime of $G$ is competitive with the runtime of the naive sampling-based algorithm, i.e. $O (1 / ε^{2})$ forward passes.

Note that this is the MSP in the particular case where $θ$ is the empty string; $c = W$ ; and $M (W, x)$ returns the output of the MLP with weights $W$ on input $x$ . (The distribution of each component of $x$ and $W$ is Gaussian.)

(Similarly to the case of random half spaces, this problem can also be viewed as an instance of the "train and explain" version of the MSP, where $θ = W$ , $c$ is empty, the training algorithm $T$ simply returns random weights, and $π$ might as well be empty.)

A sketch of our solution

In a sentence, our solution to this problem is cumulant propagation, a mechanistic estimation algorithm that we introduced in Appendix D of Formalizing the Presumption of Independence.

Cumulants are a type of summary statistic of a probability distribution. Loosely speaking, the cumulant operator $κ$ takes a list of random variables and tells you something like their "multi-way correlation." For example, $κ (X)$ is the mean of $X$ ; $κ (X, X)$ is the variance; $κ (X_{1}, X_{2})$ is the covariance of $X_{1}$ and $X_{2}$ .

Cumulant propagation is a method that lets us make guesses about the cumulants of layer $ℓ$ of a neural net based on a partial list of cumulants of layer $ℓ - 1$ . (The more complete the list of cumulants, the more accurate the guesses become.) To a first approximation, then, our algorithm is to:

Start with a list of cumulants of the inputs (this is easy because the input is Gaussian: $κ (x_{i}, x_{i}) = 1$ for each input $i$ , and all other cumulants are $0$ ).
Use cumulant propagation to make guesses about the cumulants of the layer-1 activations,^[21] going up to $d$ -th order cumulants,^[22] where $d$ is the largest integer such that $n^{d} < 1 / ε^{2}$ . Then do the same for the layer-2 activations, and so on.
Output our guess about the mean (i.e. first cumulant) of the output.

(This description leaves out many details, but gets across the main idea.)

Two-layer MLPs with a trained second layer

In parallel with tackling random MLPs, we have also been investigating two-layer MLPs where the hidden layer is very wide, and where the second layer of weights is trained. This is our first serious foray into trained and/or worst-case instances -- and while we haven't fully solved it, we have made substantial progress.

Problem statement

Consider the following MLP architecture: the input size is $n_{0}$ ; there is one hidden layer of size $n_{1}$ , where $n_{1}$ is very large; the output is a scalar; and there is an activation function at the hidden layer and the output layer.

Problem 1: Find an algorithm $G$ that takes as input the weights $(W, v)$ of an MLP with the above architecture ( $W \in R^{n_{1} \times n_{0}}$ contains the first-layer weights; $v \in R^{n_{1}}$ contains the second-layer weights), an explanation $π$ , and a tolerance parameter $ε$ , and mechanistically estimates the expected output of the MLP on inputs drawn from $N (0, I_{n})$ , such that:

For all $v$ , the expected squared error of $G$ (over independent, normally distributed weights in $W$ ) is less than the expected squared error of the naive sampling-based algorithm (i.e. sampling $1 / ε^{2}$ random inputs and outputting the empirical average output).
The runtime of $G$ is competitive with the runtime of the naive sampling-based algorithm, i.e. $O (1 / ε^{2})$ forward passes.

Note that this is the MSP in the particular case where $θ = v$ ; $c = W$ ; and $M ((W, v), x)$ returns the output of the MLP with weights $(W, v)$ on input $x$ .

Problem 2: Now, suppose that $v$ is trained via SGD to make the MLP match some target function. Extend the solution to Problem 1 by finding a linear-time algorithm that takes as input the full transcript of SGD and outputs $π$ .

This is the "train and explain" version of the MSP in the particular case where $θ = v$ and the training algorithm $T$ is SGD with squared loss on an arbitrary target function.

A look at our progress so far

Unlike in the case of random MLPs, we do not expect cumulant propagation to work. That's because, for worst-case $v$ , the largest cumulants will not necessarily be the low-order ones; thus, dropping high-order might not produce a good approximation. So what can we do instead?

Consider the function $f : R^{n_{0}} \to R$ that maps the input to the final pre-activation (i.e. the output, but before the final activation function is applied). If we could find the cumulants of $f (x)$ (i.e. the mean, variance, etc. of the final pre-activation on random inputs), then we would be able to find the mean of the output of the MLP. So how can we estimate these cumulants?

The function $f$ can be well-approximated by a high-degree, $n_{0}$ -variable polynomial in the inputs. And as it turns out, there is a neat way to express the cumulants of a multivariate polynomial as an infinite sum in terms of the polynomial's coefficients in the Hermite basis. In particular, for each $d$ , the degree- $d$ coefficients can be written down in a $d$ -dimensional $n_{0} \times \dots \times n_{0}$ -tensor. (Having run out of Roman and Greek letters, we decided to call this tensor $ש_{d}$ .^[23]) Then the $d$ -th cumulant is the sum of all tensor contractions across tensor networks consisting of copies of $ש_{d}$ . This leaves us with two problems:

Computing the $ש$ -tensors.
Approximating the infinite sum, given the $ש$ -tensors.

Our solution to the first problem is to receive the $ש$ -tensors as advice. It turns out that, so long as $1 / ε^{2} < n_{1}$ (i.e. in the infinite-width limit of the hidden layer), we have enough room in $π$ to write down all of the $ש$ -tensors we need. (And for the "train and explain" version of the problem, we believe that we can learn the $ש$ -tensors in parallel with SGD.)

We are currently working on the second problem (summing up the tensor networks), and we have made substantial partial progress. If the hidden layer width is truly huge compared to the input size, then there is enough time to approximate the sum by brute force. If the hidden layer is large but not huge, then a more efficient algorithm is necessary. We are working on finding efficient ways to contract arbitrary tensor networks and being able to notice when a tensor network can only contribute negligibly to the sum (so that we can drop it from the sum).^[24]

Closing thoughts

I consider the MSP to be a significant step forward for ARC. Previously, we were interested in producing mechanistic estimates of mathematical quantities, but had no particular benchmark by which to judge our progress or deem our methods "good enough." Now, we are holding ourselves to a standard that is philosophically justified (we believe that it ought to be possible for mechanistic estimates to compete with sampling), concrete (we can check whether our methods compete with sampling using empirical tests or formal proofs), and tied to a useful application (estimating properties of neural nets, such as catastrophe probability).

Formulating the MSP has allowed us to ask more concrete questions (e.g. "How can we construct a mechanistic algorithm that competes with sampling for estimating the average output of trained two-layer MLPs?"). We have solved some of these questions, made progress on others, and are continuing to make progress.

We plan to continue attacking the MSP from a number of directions:

Attempting to solve the MSP "on paper" (i.e. using mathematical tools) for specific instances (like the ones described in this post).
Using our theoretical methods to create state-of-the-art algorithms for estimation problems.
Approaching the problem from a more high-level or philosophical perspective in order to discern what sorts of mechanistic algorithms could compete with sampling in full generality.

If you're interested in working with us on any of these directions, you can apply here!

Appendix

A note on advice verifiability

In our various MSP statements, we do not ask for $G_{M}$ to be able to "verify" that the explanation $π$ is "accurate" (i.e. correctly describes the structure of $M_{θ}$ , instead of making false claims). Is that fine, or should we require $π$ to be verifiable by $G_{M}$ ?

At least in the "train and explain" version of the MSP, we do not believe that advice needs to be verifiable. This is for two reasons:

Our eventual goal is to implement any MSP solution for actual neural nets, and to have strong accuracy guarantees (on average over the randomness of training). Solving the "train and explain" version of the MSP for a given neural architecture already comes with an accuracy guarantee, even without the ability to verify the explanation that the explaining algorithm $E$ produces.
Suppose that the training algorithm $T$ finds $θ$ by checking many random candidate values of $θ$ until it finds a particularly unlucky one (e.g. one where $E [M_{θ}]$ is much lager than a full structural understanding of $θ$ would suggest, just by chance). By observing the computations done by $T$ , $E$ can notice that $θ$ is unlucky, but it cannot succinctly explain that fact in a verifiable way. All it can do is assert (in its explanation $π$ ) that $θ$ is unlucky, and all $G_{M}$ can do is trust $E$ 's assertion.

This last point seems a little bit at odds with my earlier assertion that $π$ only makes claims about the structure of $θ$ , not the randomness. A more refined version of this assertion would be: $π$ should not assert any randomness in $θ$ that happened by accident; but if $θ$ has weird randomness due to some fact about the training process, then $π$ should reflect that fact.

What about our mainline MSP statement, where there is no explaining algorithm to track optimization done during training? If $π$ "comes out of nowhere," are we comfortable with $π$ asserting facts about $θ$ without a possibility of verification?

In my opinion, it's fine for $π$ to be unverifiable, for essentially the same reason. If it's fine for $π$ to claim that $θ$ was selected to be as adversarial as possible via a brute force search (in the "train and explain" version of the MSP), then it seems fine for $π$ to claim that $θ$ was selected to be as adversarial as possible by an omniscient oracle, if that's how $θ$ was selected.

For example, imagine that we can model $M_{θ}$ as a random oracle -- a completely different random function for each $θ$ -- and the particular $M_{θ}$ that's chosen happens to be the one whose average output is furthest from 50/50. Then it seems fine for $π$ to assert that $θ$ is a random oracle whose average output just so happens to be many standard deviations away from 50/50.

There might be natural versions of the MSP that require advice to be verifiable. However, such statements would require giving $G_{M}$ more time to run. Concretely, in the mainline MSP statement, we would ask for $G_{M}$ to run in time $O (\frac{| θ |}{ε^{2}} Time (M_{θ}))$ , where the extra factor of $| θ |$ mitigates the selection pressure put into choosing $θ$ . In the case where $M_{θ}$ is a random oracle, this is exactly the amount of compute that $G_{M}$ needs to compete with sampling, if $π$ can convey that $M_{θ}$ is a random oracle, but cannot assert anything about $M_{θ}$ being a particularly unlucky random oracle. I like this version less, in part because we don't expect that paying the extra $| θ |$ factor will be feasible in practice.

A special case of the MSP: Universal Turing machines

One interesting case of the MSP is when the architecture $M$ is a universal Turing machine $U$ . In other words, $U_{θ} (c, x)$ interprets $θ$ as the encoding of a Turing machine, and runs $θ$ on the input $(c, x)$ -- except that we will say that $θ$ is forced to halt after one million steps (so that we don't need to worry about runtime). Applying the MSP to this special case gives the following assertion:

There exists an estimator $G (θ, π, c, ε)$ such that:
For all Turing machines $θ$ , there is an explanation $π$ ( $| π | \leq O (| θ |)$ ), such that:
For all tolerance parameters $ε > 0$ , $G$ satisfies the following three properties:
- It runs in time $O (1 / ε^{2})$ .
- On average over random $c$ , its error is competitive with sampling: $E_{c} [(G (θ, π, c, ε) - E_{x \sim {0, 1}^{n}} [θ (c, x)])^{2}] \leq ε^{2} E_{c} [{Var}_{x \sim {0, 1}^{n}} [θ (c, x)]] .$
- It is mechanistic.

In other words, there is a single, universal $G$ that is able to mechanistically estimate the average output of any (time-bounded) Turing machine, if it is given advice that explains the Turing machine's structure.

Note also that a solution to this special case would yield a solution to the full MSP: suppose that we had an estimator $G$ for universal Turing machines, and consider some other architecture $M$ . Then $M_{θ} = U_{θ^{'}}$ , where $θ^{'}$ is a Turing machine whose size is $| θ |$ plus some constant that only depends on $M$ . Consider the estimator $G_{M}$ that, on input $(θ, π, c, ε)$ , writes down the $θ^{'}$ such that $M_{θ} = U_{θ^{'}}$ and returns $G (θ^{'}, π, c, ε)$ . If some $π$ causes $G$ to output accurate estimates for $U_{θ^{'}}$ , then $π$ will also cause $G_{M}$ to output accurate estimates for $M_{θ}$ . Thus, this estimator $G_{M}$ solves our mainline MSP for $M$ .

The MSP statement can also be used to obtain a claim about mechanistically estimating random Turing machines, but without advice. Concretely, we will let $θ$ be the empty string, and will instead say that $M_{θ} (c, x) := M (c, x)$ interprets $c$ as the encoding of a Turing machine, and runs $c$ on the input $x$ . (As before, we force $c$ to halt after a million steps.) Applying the MSP to this special case gives the following assertion:

There exists an estimator $G (c, ε)$ such that:
For all tolerance parameters $ε > 0$ , $G$ satisfies the following three properties:
- It runs in time $O (1 / ε^{2})$ .
- Its error is competitive with sampling, on average over random $c$ : $E_{c} [(G (c, ε) - E_{x} [c (x)])^{2}] \leq ε^{2} E_{c} [{Var}_{x} [c (x)]]$ .
- It is mechanistic.

This is an interesting and arguably bold statement: it says that as $G$ gets more time to run, it is able to get a more and more accurate mechanistic estimate of the average output of the Turing machine $c$ . This is intuitive enough for Turing machines with no interesting structure (as is the case for most random Turing machines). However, in order to satisfy the accuracy guarantee above, $G$ must converge to the right answer for all Turing machines (even if the convergence is slower for Turing machines with more sophisticated structure). Such a $G$ would probably involve a systematic search for structure: loosely speaking, since it isn't given an explanation, it must find the explanation on its own.

Getting rid of $ε$

Modulo a caveat (see below), it is possible to modify the MSP statement to get rid of the tolerance parameter $ε$ . Concretely, suppose that the following statement -- which specializes our mainline MSP statement to the case of $ε = 1$ -- is true:

For all architectures $M$ (with parameters $θ$ ) mapping pairs $(c \in {0, 1}^{n_{c}}, x \in {0, 1}^{n_{x}})$ to $R$ , there exists an estimator $G_{M}$ mapping tuples $(θ, π, c)$ to $R$ , such that:
For all parameters $θ$ , there exists a short explanation $π$ ( $| π | \leq O (| θ |)$ ), such that:
$G_{M} (θ, π, c)$ satisfies the following three properties:
1. It runs in time $O (Time (M_{θ}))$ .
2. Its error is competitive with sampling, on average over random $c$ : $E_{c} [(G_{M} (θ, π, c) - E_{x} [M_{θ} (c, x)])^{2}] \leq E_{c} [{Var}_{x} [M_{θ} (c, x)]]$ , where $c \sim {0, 1}^{n_{c}}$ and $x \sim {0, 1}^{n_{x}}$ .
3. It is mechanistic.

We claim that our mainline MSP statement almost follows from this $ε$ -less version. To see this, consider an arbitrary architecture $M$ , and fix a positive integer $m$ . We will define a modified architecture $M^{'}$ that has the same space of parameters as $M$ . Concretely, $M_{θ}^{'}$ works as follows: it takes as input a list of inputs $x_{1}, \dots, x_{m}$ to $M_{θ}$ , runs $M_{θ}$ on all of them, and outputs the average value of $M_{θ} (x_{i})$ for $i \in {1, \dots, m}$ . We claim if some estimator $G_{M^{'}}$ solves the above MSP for all $M^{'}$ regardless of the particular value of $m$ , then $G_{M^{'}}$ also solves the mainline MSP for $M$ .

To see this, suppose that we have an estimator $G_{M^{'}}$ that solves the above MSP for $M^{'}$ , in the case of $ε = 1$ . This means that for any $θ$ , there is an explanation $π$ such that $G_{M^{'}}$ :

Runs in time $O (Time (M_{θ}^{'}))$ .
Has low error on average over random $c$ : $E_{c} [(G_{M^{'}} (θ, π, c) - E_{x = (x_{1}, \dots, x_{m})} [M_{θ}^{'} (c, x)])^{2}] \leq E_{c} [{Var}_{x = (x_{1}, \dots, x_{m})} [M_{θ}^{'} (c, x)]]$ .
Is mechanistic.

Note that $E_{(x_{1}, \dots, x_{m})} [M_{θ}^{'} (c, (x_{1}, \dots, x_{m}))] = E_{x} [M_{θ} (c, x)]$ and ${Var}_{(x_{1}, \dots, x_{m})} [M_{θ}^{'} (c, (x_{1}, \dots, x_{m}))] = \frac{1}{m} {Var}_{x} [M_{θ} (c, x)]$ . Note also that $G_{M^{'}}$ runs in time $O (m \cdot Time (M_{θ}))$ . This means that that $G_{M^{'}}$ also solves the mainline MSP for the architecture $M$ if $ε^{2} = \frac{1}{m}$ .

Now, if there is a single $G_{M^{'}}$ that solves the MSP for $M^{'}$ regardless of $m$ , then $G_{M^{'}}$ will solve the mainline MSP for $M$ for all $ε$ .

The fact that we need a uniform $G_{M^{'}}$ regardless of $m$ means that we don't quite have a full reduction; however, the above $ε$ -less MSP statement is another interesting variant of MSP that is almost the same. We decided to make our mainline MSP statement contain a tolerance parameter $ε$ in order to make the connection to the idea of matching sampling more intuitive.

Compression as a possible MSP approach

In this section, I will outline one possible approach to solving the MSP. For the sake of concreteness, I will consider the case where $M$ is a universal Turing machine (which was discussed above). As a reminder, this means that $M_{θ} (c, x)$ interprets $θ$ as the encoding of a Turing machine, and then runs $θ$ on the input $(c, x)$ .

We will say that a Turing machine $θ$ is efficiently compressible if there is a significantly shorter Turing machine $θ^{'}$ that, on any input $x$ , constructs $θ$ in time $O (| θ |)$ and the runs $θ$ on input $x$ . (We call $θ^{'}$ an efficient compression of $θ$ .) One possible approach to solving the MSP looks something like this:

We solve the MSP for instances that are not efficiently compressible. Concretely, we find an estimation algorithm $G_{M}^{inc} (θ, c, ε)$ that works for any $θ$ that is not efficiently compressible, without any explanation $π$ .
We define the estimation algorithm $G_{M} (θ, π, c, ε)$ to check that $π$ is an efficient compression of $θ$ , and then to return $G_{M}^{inc} (π, c, ε)$ if so.
- If $π$ is not efficiently compressible, then $G_{M}$ 's output will be accurate (since $G_{M}^{inc}$ solves the MSP for instances that are not efficiently compressible).

The hope for Step 1 is that an estimation approach that works in the average case (over random parameters) will work for all parameters that are not efficiently compressible.

The intuition underlying this hope is that structure implies efficient compression. In other words, if $θ$ has structure that would make a mechanistic estimator mis-estimate its average output, then understanding that structure would allow us to represent $θ$ more compactly (and in a way such that $θ$ can be recovered quickly from the representation).

What about the "train and explain" version of the MSP? In order to adapt this approach to that setting, we also need to be able to learn the efficient compression in parallel with learning $θ$ itself. If the training process has enough time to find a $θ$ with special structure, then is there an "explaining process" that would have enough time to find the corresponding compression? That is unclear to me, but I think this direction is promising enough to be worth exploring.

If this approach is viable, then solving the MSP in the average case for some Turing-complete architecture $M$ (such as RNNs) would be a major step forward.

^{^}
We could instead imagine that $C$ outputs a probability of catastrophe, but we will keep the range of $C$ to ${0, 1}$ for simplicity of exposition.
^{^}
Running $C$ might take much longer than running $M$ , which is why we can't just run $C$ on every input during deployment.
^{^}
For example, the parameterization of the distribution $D$ needs to be quite flexible, so as to allow distributions that are computationally intractable to sample from. For example, if $M$ acts catastrophically when it encounters a factorization of RSA-2048, we want to be able to train that behavior out of $M$ even if we can't factorize RSA-2048. (See here for more discussion.)
^{^}
In brief, we hope to address issue #1 via mechanistic anomaly detection. A little more concretely, our plan is to:
- Build $C$ to be smart and robust enough that $M$ can only fool $C$ by thinking deliberately about how to fool $C$ .
  - In the diamond-in-the-vault analogy, this means that the cameras are robust enough that the SmartVault can only fool them deliberately.
- Build a mechanistic explanation for why $C (M (x))$ tends to be 0 on the training distribution.
  - This would let us flag inputs $x^{*}$ on which $C (M (x^{*})) = 0$ for an unusual reason. Let $A (x)$ be this anomaly detector, i.e. $A (x) = 1$ if $C (M (x)) = 0$ for an unusual reason.
- Instead of training $M$ to minimize $E_{x \sim D} [C (M (x))]$ , train $M$ to minimize $E_{x \sim D} [max {C (M (x)), A (x)}]$ (i.e. to maximize the probability that $C (M (x)) = 0$ for the usual reason).
  - And similarly to above, train $D$ to maximize this expectation.
This leaves many details unexplained, but that's the basic concept.
^{^}
Or rather, independent up to small random variation that is unpredictable just from understanding the structure of $M$ and $D$ .
^{^}
In practice, we will be interested in the behavior of neural nets on structured (rather than uniformly random) inputs. However, note that it is possible to create structured inputs out of random inputs via a generative model. For example, if we are interested in the behavior of a classification model on pictures of animals, we could let $M_{θ}$ consist of two parts: first, a generative model that creates an image of an animal from random noise, and second, a classifier that takes the animal image as input.$\$
^{^}
Why do we require $π$ to be short? The basic reason is that, as we discuss later, we will be interested in learning $π$ in parallel with learning the parameters $θ$ , and so we will want to be able to do a backward pass through $π$ as quickly as doing a backward pass through $θ$ . We also have some amount of philosophical justification for believing an explanation the size of $θ$ is sufficient. Essentially, we think that any object's structure can be described compactly, because if the amount of (non-redundant) structure in an object is much larger than the size of the object itself, that would constitute an "outrageous coincidence".
^{^}
This works by taking our (Gaussian) model of layer $k - 1$ and then modeling layer $k$ by finding the normal distribution that would minimize the KL divergence from the pushforward of our model of layer $k - 1$ .
^{^}
Covariance propagation does not require an explanation $π$ . However, some modifications of covariance propagation could require advice. For example, if $ε$ is too large to allow for $G_{M}$ to compute all of the covariances, then $π$ could advise $G_{M}$ to only keep track of some particular covariances. Or, $π$ could tell $G_{M}$ about some important third-order correlations to keep track of.
^{^}
Note that we use the word "heuristic" in place of "mechanistic" in that paper. I think that the word "mechanistic" conveys our goal slightly better.
^{^}
As a very simple example, consider the circuit $(x_{1} \land x_{2}) \land (x_{2} \land x_{3})$ . Mean propagation estimates this circuit's average output as $1 / 16$ rather than $1 / 8$ because it fails to notice the correlation induced by the presence of $x_{2}$ in the two conjunctive clauses.
^{^}
Though, see the appendix on advice verifiability for some nuance on this point.
^{^}
Eliezer Yudkowsky's Worse Than Random makes a similar point:
As a general principle, on any problem for which you know that a particular unrandomized algorithm is unusually stupid - so that a randomized algorithm seems wiser - you should be able to use the same knowledge to produce a superior derandomized algorithm.
^{^}
Roughly speaking, this is the range where $1 / ε^{2}$ is a substantial fraction of the number of times that one needs to run $M_{θ}$ to fully estimate its unstructured randomness.
^{^}
We cannot require $G_{M}$ to be accurate for all $c$ . For example, suppose that $M_{θ} (c, x)$ interprets $c$ as a Turing machine and runs $x$ on the Turing machine $c$ . Requiring $G_{M}$ to be accurate for all $c$ would mean expecting $G_{M}$ to be able to mechanistically estimate the output of a worst-case Turing machine, without any structural advice at all. (After all, $π$ cannot depend on $c$ .) This is too much to ask for.
^{^}
Note that this problem is equivalent to estimating the probability that a one-layer ReLU network outputs all zeros on a random input. Concretely, if the network is $ReLU (W x)$ , then the output is all zeros if and only if $w_{i} \cdot x \leq 0$ for every row $w_{i}$ of $W$ .
^{^}
Our algorithm's runtime is $O (\frac{k (log 1 / ε)^{2}}{ε^{2}})$ , which is technically too slow in the case where $1 / ε > 2^{\sqrt{n}}$ . However, we are most interested in the regime where $ε = poly (n)$ .
^{^}
The naive approach is to treat this as a linear regression problem, where the covariance (inner product) between two polynomials $p_{1}$ and $p_{2}$ is defined as the expectation of $p_{1} (x) p_{2} (x)$ for $x$ drawn from the unit sphere. However, doing this involves multiplying a $\frac{1}{ε^{2}} \times \frac{1}{ε^{2}}$ matrix by a $\frac{1}{ε^{2}}$ -vector, so the dependence of this algorithm on $ε$ looks like $1 / ε^{4}$ : not fast enough.
^{^}
We believe that our proof of correctness and efficiency works in the limit as $n \to \infty$ , where the MLP depth is constant and $1 / ε = poly (n)$ .
^{^}
Mean zero; standard deviation is chosen so that all activations have the same variance.
^{^}
One complication to this picture is that, although we've defined cumulant propagation for sums and products of random variables, it's not clear what it means to apply cumulant propagation to an activation function like ReLU: given the cumulants of $X$ , how does one estimate the cumulants of $ReLU (X)$ ? Our strategy is to find a polynomial approximation to the ReLU function (see the next paragraph of this footnote for details). Once we've done that, we can apply cumulant propagation as we've already defined it for sums and products.
What is the appropriate notion of polynomial approximation? It turns out that we can take the polynomial that minimizes mean squared error if $X$ is assumed to be normally distributed with mean equal to our estimate of $κ (X)$ and covariance equal to our estimate of $κ (X, X)$ . This is equivalent to taking the first several terms of the Hermite expansion of ReLU (appropriately centered and scaled).
^{^}
Actually, it is more important to keep track of cumulants in which the same activation appears multiple times, so we need to keep track of some cumulants of order higher than $d$ that involve repeated indices.
^{^}
That's the Hebrew letter shin.
^{^}
Concretely, a tensor network can only contribute substantially to the sum if every tensor in the network has a large operator norm. Thus, if in the process of contracting the tensor network, we find a tensor that has a small operator norm, we can cut off the computation and move onto the next tensor network.

LESSWRONG
LW

LESSWRONG
LW

95

ARC progress update: Competing with sampling

95

95

Outperforming sampling as a step toward preventing AI misalignment

Understanding structure helps outperform sampling

A non-human understanding

The matching sampling principle

A first attempt at stating the MSP

What makes an algorithm "mechanistic"?

Why do we require $G_{M}$ to be mechanistic?

The intuition behind the MSP

Why only matching sampling?

An issue: $π$ can just tell $G_{M}$ the answer

Our actual MSP statement

An important variant: Findable explanations

Another modification: getting rid of $ε$

Our progress so far

Intersection of random half-spaces

Problem statement

A sketch of our solution

Related questions

Random MLPs

Problem statement

A sketch of our solution

Related questions

Two-layer MLPs with a trained second layer

Problem statement

A look at our progress so far

Related questions

Closing thoughts

Appendix

A note on advice verifiability

A special case of the MSP: Universal Turing machines

Getting rid of $ε$

Compression as a possible MSP approach

95

ARC progress update: Competing with sampling

95

95

Outperforming sampling as a step toward preventing AI misalignment

Understanding structure helps outperform sampling

A non-human understanding

The matching sampling principle

A first attempt at stating the MSP

What makes an algorithm "mechanistic"?

Why do we require GM to be mechanistic?

The intuition behind the MSP

Why only matching sampling?

An issue: π can just tell GM the answer

Our actual MSP statement

An important variant: Findable explanations

Another modification: getting rid of ε

Our progress so far

Intersection of random half-spaces

Problem statement

A sketch of our solution

Related questions

Random MLPs

Problem statement

A sketch of our solution

Related questions

Two-layer MLPs with a trained second layer

Problem statement

A look at our progress so far

Related questions

Closing thoughts

Appendix

A note on advice verifiability

A special case of the MSP: Universal Turing machines

Getting rid of ε

Compression as a possible MSP approach

Why do we require $G_{M}$ to be mechanistic?

An issue: $π$ can just tell $G_{M}$ the answer

Another modification: getting rid of $ε$

Getting rid of $ε$