In 2025, the Alignment Research Center (ARC) has been making conceptual and theoretical progress at the fastest pace that I've seen since I first interned in 2022. Most of this progress has come about because of a re-orientation around a more specific goal: outperforming random sampling when it comes to understanding neural network outputs. Compared to our previous goals, this goal has the advantage of being more concrete and more directly tied to useful applications.
The purpose of this post is to:
Also: we're hiring! If the research direction described in this post excites you, you can apply to ARC!
Consider the following simple scheme that attempts to align an AI model , which maps inputs to outputs :
I think that this is a fine starting point for alignment plan, but not a complete plan in and of itself. It suffers from at least two issues:
(There are other technical issues as well,[3] but these are the ones that seem hardest to surmount.)
We believe that ARC's technical research agenda is capable of addressing both of these issues. However, issue #1 is mostly out of scope for this post (though I'll very briefly describe our planned approach in this footnote[4]). The purpose of this post is to explain in detail how we hope to address issue #2.
ARC's goal is to be able to estimate far better than one can just by drawing random samples from . We believe that this can be done by understanding the structure of , , and .
Here's a simple example that's meant to illustrate this point. Suppose that, by understanding the internals of , we are able to notice that is the conjunction of three predicates -- in other words, outputs 1 if and only if all of are true of 's output.
And suppose that, furthermore, we understand the structure of and to understand that are independent events for .[5]
Using this structural understanding, we can estimate that . If each is true on with probability one-in-a-million (), then we can estimate that . Obtaining this estimate by sampling would have required roughly samples. By contrast, our structural understanding lets us estimate this probability with roughly samples, and we can potentially do even better than that if we have a structural understanding of the 's themselves. Thus, our structural understanding lets us estimate the expected value far more efficiently than we could with sampling.
This example is simplistic, of course: in practice, we will need to understand structure that is far more sophisticated than "the output is a conjunction of three independent predicates." But the example illustrates the point that having a detailed mechanistic understanding of a neural net lets us estimate properties of its outputs far better than black-box methods alone.
When we speak of "understanding the structure" of , , and , we are not referring to human understanding. While a conjunctive structure like the one above can be understood by a human, we believe that in general, neural nets will be composed of mathematical structures that are far too complex to allow for a full human understanding.
Instead, we are imagining that an explanation of the structure of a neural net is written in some kind of formal language. The explanation could be as large as the neural net itself, and may be as incomprehensible to a human as the neural net. Thus, our goal is not to have a human look at the structure and estimate the expectation of . Instead, the goal is to invent an algorithm that takes as input the explanation and estimates expectation of based on that explanation.
This contrasts to almost all other research on neural network interpretability, which aims for a partial, human understanding of neural nets. Our research is instead aimed at full, algorithmic understanding.
In the next section, I will elaborate on what this means.
In this section, I will more formally describe what we hope to accomplish by gaining structural understanding of neural networks. While above I talked about outperforming sampling, in the fully general case we can only hope to match the performance of sampling. In other words, we expect that the performance of our algorithms in the practical setting of trained neural nets will substantially exceed the worst-case bounds that we will be able to state and prove. See below for further discussion of this point.
I will start with a first-pass attempt at stating the matching sampling principle (MSP). As we will discuss, it does not quite make sense; however, it carries across the key intuition.
In order to state the MSP, we will define a few pieces of notation:
With this notation in place, we will make our first attempt to state the MSP:
Let's parse these three requirements:
We do not have a formal definition of "mechanistic." But, loosely speaking, we mean that estimates the expected output of deductively, based on the structure of . This contrasts with sampling-based algorithms for estimating the expected output of , which operate based on inductive reasoning. Mechanistically estimating the expected output involves finding the reason for the expected output being what it is; meanwhile, sampling-based algorithms merely infer the existence of a reason without learning anything about the reason.
To illustrate this difference, suppose that the explanation given to is a simple heuristic argument (such as mean propagation -- see §D.2 here), which suggests that but is otherwise uninformative about the structure of . Suppose further that computes on a hundred inputs , and it finds that on every one of those hundred inputs. Then should return : that's because it knows that on the hundred inputs that it checks, but it has not seen any structural evidence that would suggest that 's behavior on those hundred inputs has any bearing on how behaves on the inputs that it has not checked. By contrast, a sampling-based estimator that checks the same hundred inputs would return , implicitly assuming that those inputs are representative.
(If indeed always returns , then we believe that there exists a short explanation of this fact; but cannot output unless it is given this explanation.)
In some of our previous work, we discussed covariance propagation: successively modeling each layer of as a multivariate normal distribution.[8] Covariance propagation (and related methods, like mean propagation and cumulant propagation) is mechanistic, because it deduces an estimate based on the structure of .[9] More generally, deduction-projection estimators -- estimators that successively model each layer of by finding the best-fit model from some parameterized class -- are mechanistic.
A simple, though not entirely correct, heuristic for whether an estimation algorithm is deductive, is whether it avoids any random or pseudorandom sampling. This heuristic should work for the purposes of engaging with this post.
(See much more on mechanistic estimation in our earlier paper, "Formalizing the presumption of independence",[10] as well as in former ARC intern Gabe Wu's senior thesis on deduction-projection estimators.)
There are multiple reasons for this; in a previous blog post, we discussed how mechanistic estimates can help us detect mechanistic anomalies. But for the purposes of this post, the reason is pretty straightforward: in cases where has a lot of structure, we think that can substantially outperform sampling, if given an explanation that explains that structure (as motivated above).
Thus, loosely speaking, our hope is that if we find a that both (a) is mechanistic and (b) performs at least as well as sampling for all , then it will substantially outperform sampling for parameters with a lot of structure, such as trained neural nets.
Sampling is a really powerful tool, because randomly drawn samples are representative (with high probability), and so a sampling-based estimate can't be off by too much (with high probability). In light of this, why do we think that a mechanistic estimation algorithm can compete with sampling?
Suppose that is a boolean circuit. Suppose, further, that a naive heuristic argument (like mean propagation) suggests that 's average output is , but that in fact its average output is roughly (far enough from that this discrepancy could not have happened by chance). A sampling-based algorithm can pick up on this discrepancy given about 10,000 samples; but what can a mechanistic algorithm do?
Well, given that the discrepancy could not have happened by chance, there must be structure that explains the discrepancy. For illustration, let's consider two types of structure.
First, maybe the discrepancy is caused by different gates reusing the same inputs, thereby inducing nontrivial correlations between different parts of the circuit.[11] In that case, should be able to point out this structure, causing to understand the discrepancy (even without running any inputs through ).
Second, maybe only the first 10 input bits matter to the output of the circuit (perhaps ignores the last 90 input bits entirely, or perhaps they end up not affecting the output for complicated structural reasons). And then -- just by chance -- it so happens that outputs 1 on only 49% of the 1024 possible 10-bit inputs. In this case, points out that depends only on the first 10 input bits; it does not point out that outputs 1 on 49% of them, because that's part of the unexplainable randomness of .[12] Instead, must determine this fact by using its allotted time to check the value of on those 1024 inputs.
(What if is large enough that doesn't have the necessary runtime to check on all 1024 inputs? In that case, it should check however many it can and estimate the rest as being 50/50. This will still outperform sampling!)
More generally, the intuition is that knowing the structure of gives the knowledge it needs to do no worse than random sampling. If still does worse than random sampling after reading , that can only be because did not provide a full structural explanation of .[13]
Given the above intuition that understanding structure can outperform sampling, why are we only aiming to match the performance of sampling?
Consider the above example, where the average output of depends on 1024 effectively random computations, and suppose that : enough time for to compute the output of on 512 of the 1024 inputs. In that case, we expect both and sampling to have squared error on the order of : 's expected squared error will be somewhat lower, but not dramatically so.
In general, we expect that there will often be a range of -values for which the best mechanistic estimate is only slightly better than sampling-based estimation.[14] Thus, for some parameters and tolerance parameters , we only expect to be able to match (or perhaps slightly outperform) sampling, not to strongly outperform sampling.
However, as discussed above, we expect that if our mechanistic estimator matches the performance of sampling in all cases, then it will substantially outperform sampling in structured cases such as trained neural nets, at least for non-tiny values of . We expect that we can leverage this to help with the sort of adversarial training process described in the introduction.
As mentioned earlier, out first attempt at stating the MSP doesn't quite make sense. The idea of MSP is for to describe the structure of . However, in order to satisfy the MSP statement above, can just write down the value of . Then, can output that value.
To fix this issue, we observe that if understands the structure of , then it ought to be able to answer all sorts of questions about at least as well as sampling -- not just its expected value -- so long as those questions are not adversarially selected. To formalize this idea, we will modify the type signature of to take two inputs (here, stands for "context"), and require that be able accurately estimate for a random choice of .[15] This change gives us an MSP statement that we are willing to stand behind.
Here is ARC's mainline "matching sampling principle" (MSP):
(Just as before, this statement isn't fully formal, because of the informal "mechanistic" qualifier. But in practice, we have strong enough opinions about what counts as "mechanistic" that this statement is formal enough to guide our research.)
An interesting special case of the MSP is when encodes a universal Turing machine. See the appendix for discussion.
While the above MSP statement is the most theoretically clean one, on its face the statement is not very useful. That's because it says nothing about being able to find the explanation ; what use is it to merely know that an adequate explanation exists, if we can't find it?
This leads us to the following alternative statement, which we've been calling the "train and explain" formulation of the MSP:
In this statement, it is useful to think of as being the learning algorithm used to find (e.g. SGD) and as being the random bits used during training (the random initialization and random choices of training data used at each training step). Then, is the algorithm used to find the mechanistic explanation of : intuitively, it works "in parallel" with , observing the training process and "building up" the explanation in a way that mirrors the way that "builds up" structure in by iteratively modifying to get lower and lower loss.
Note that our mainline MSP statement is the special case of the "train and explain" formulation where and are both computationally unbounded (so that can select the worst parameters and can select the most helpful advice ).
In general, we think that for any computational constraints placed on (e.g. on time or memory), there is a corresponding with the same computational constraints that can find an adequate explanation . If we are correct, then that potentially gives us a strategy for efficiently computing properties of trained neural networks (such as catastrophe probability), while paying a relatively small alignment tax. (If finding takes as much time as finding , that's an alignment tax of 100%: a small price to pay for avoiding catastrophe.)
It turns out that the MSP can be stated without reference to a tolerance parameter, by subsuming the number of samples into the architecture instead. See this appendix for details.
Over the course of 2025, ARC has made progress on the MSP in a few different directions. Concretely:
Our results are not yet ready for publication, but hope to get them ready in the coming months. In this section, I will briefly summarize these results and discuss the most interesting directions for future work.
The first problem we tackled in our "matching sampling" framework was mechanistically estimating the volume of the intersection of random half-spaces. Although it's a somewhat toy problem, it wasn't trivial to solve, and we learned a lot from solving it.
Find an algorithm that takes as input unit vectors and a tolerance parameter , and mechanistically estimates the probability that a randomly chosen unit vector has a nonnegative dot product with all of , such that:
Note that this is the MSP in the particular case where is the empty string; ; and returns 1 if for all . (The distribution of and each vector in is uniform over the unit sphere, rather than uniform over bit strings.)
Since is empty, there is no advice in this setting. Despite that, we think that our solution to this MSP instance has helped us progress toward solving the MSP in more generality.
(Alternatively, this problem can be viewed as an instance of the "train and explain" version of the MSP, where , is empty, and the training algorithm simply returns random vectors. In this setting, the explaining algorithm does not have time to do any interesting computation, so might as well be empty.)
In a sentence, our solution is to build up a polynomial approximation of by considering one vector at a time.
To elaborate on this, for , let be the function that outputs 1 if , and 0 otherwise. Let us define (so ). We will:
To be more precise, "low-degree polynomial" here means degree , where is the largest integer such that (it turns out that that's the precision we need in order to compete with sampling). And "the best low-degree polynomial approximation" means the best approximation in terms of squared error, for drawn from the unit sphere.
In order for to be efficient enough, it needs to be able to compute this polynomial approximation in time . Getting the dependence on down to turns out to be pretty tricky.[18] However, we were able to find a suitably efficient algorithm by working in the Hermite basis of polynomials instead of the standard monomial basis.
We have also generalized this approach to apply to a broader class of problems than just "intersection of randomly-chosen half-spaces." Roughly speaking, we can apply our methods to estimate the expected product of symmetric random functions, for a certain representation-theoretic notion of symmetry. Concrete problems that we have solved with this approach include:
We are aiming to publish the details of our algorithm and this generalization in the coming months.
While we have solved this particular MSP instance, there are some related settings for which we do not have a solution:
In the last couple of months, we have been tackling a more sophisticated MSP instance: random MLPs. We now believe that we have an algorithm and a proof of its correctness and efficiency, though we are still verifying details. We also have an empirical demonstration that our algorithm is competitive with sampling.
Consider the following MLP architecture: the input size is (assume that is very large[19]); there are hidden layers ( is some fixed constant), each of width ; the output is a scalar; and every hidden layer has some activation function (e.g. ReLU).
Find an algorithm that takes as input the weights of an MLP with the above architecture and a tolerance parameter , and mechanistically estimates the expected output of the MLP on inputs drawn from , such that:
Note that this is the MSP in the particular case where is the empty string; ; and returns the output of the MLP with weights on input . (The distribution of each component of and is Gaussian.)
(Similarly to the case of random half spaces, this problem can also be viewed as an instance of the "train and explain" version of the MSP, where , is empty, the training algorithm simply returns random weights, and might as well be empty.)
In a sentence, our solution to this problem is cumulant propagation, a mechanistic estimation algorithm that we introduced in Appendix D of Formalizing the Presumption of Independence.
Cumulants are a type of summary statistic of a probability distribution. Loosely speaking, the cumulant operator takes a list of random variables and tells you something like their "multi-way correlation." For example, is the mean of ; is the variance; is the covariance of and .
Cumulant propagation is a method that lets us make guesses about the cumulants of layer of a neural net based on a partial list of cumulants of layer . (The more complete the list of cumulants, the more accurate the guesses become.) To a first approximation, then, our algorithm is to:
(This description leaves out many details, but gets across the main idea.)
We would be really interested in finding a way to mechanistically estimate the average output of random recurrent neural networks (RNNs). We believe that this will be much more difficult than the MLP setting, because of the weight sharing. (We think that interesting structure can arise in random RNNs in a way that's far more improbable for random MLPs; this is related to the fact that RNNs are Turing-complete.) We think it's possible that finding an algorithm that solves the "random RNNs" instance of the MSP would constitute major progress toward finding an algorithm that solves the MSP in full generality (see the appendix for more discussion).
A shorter-term project might be to adapt our solution to other architectures. Can we solve the problem in the case of narrow MLPs (as opposed to the infinite-width limit)? What about random CNNs? Random transformers?
In parallel with tackling random MLPs, we have also been investigating two-layer MLPs where the hidden layer is very wide, and where the second layer of weights is trained. This is our first serious foray into trained and/or worst-case instances -- and while we haven't fully solved it, we have made substantial progress.
Consider the following MLP architecture: the input size is ; there is one hidden layer of size , where is very large; the output is a scalar; and there is an activation function at the hidden layer and the output layer.
Problem 1: Find an algorithm that takes as input the weights of an MLP with the above architecture ( contains the first-layer weights; contains the second-layer weights), an explanation , and a tolerance parameter , and mechanistically estimates the expected output of the MLP on inputs drawn from , such that:
Note that this is the MSP in the particular case where ; ; and returns the output of the MLP with weights on input .
Problem 2: Now, suppose that is trained via SGD to make the MLP match some target function. Extend the solution to Problem 1 by finding a linear-time algorithm that takes as input the full transcript of SGD and outputs .
This is the "train and explain" version of the MSP in the particular case where and the training algorithm is SGD with squared loss on an arbitrary target function.
Unlike in the case of random MLPs, we do not expect cumulant propagation to work. That's because, for worst-case , the largest cumulants will not necessarily be the low-order ones; thus, dropping high-order might not produce a good approximation. So what can we do instead?
Consider the function that maps the input to the final pre-activation (i.e. the output, but before the final activation function is applied). If we could find the cumulants of (i.e. the mean, variance, etc. of the final pre-activation on random inputs), then we would be able to find the mean of the output of the MLP. So how can we estimate these cumulants?
The function can be well-approximated by a high-degree, -variable polynomial in the inputs. And as it turns out, there is a neat way to express the cumulants of a multivariate polynomial as an infinite sum in terms of the polynomial's coefficients in the Hermite basis. In particular, for each , the degree- coefficients can be written down in a -dimensional -tensor. (Having run out of Roman and Greek letters, we decided to call this tensor .[23]) Then the -th cumulant is the sum of all tensor contractions across tensor networks consisting of copies of . This leaves us with two problems:
Our solution to the first problem is to receive the -tensors as advice. It turns out that, so long as (i.e. in the infinite-width limit of the hidden layer), we have enough room in to write down all of the -tensors we need. (And for the "train and explain" version of the problem, we believe that we can learn the -tensors in parallel with SGD.)
We are currently working on the second problem (summing up the tensor networks), and we have made substantial partial progress. If the hidden layer width is truly huge compared to the input size, then there is enough time to approximate the sum by brute force. If the hidden layer is large but not huge, then a more efficient algorithm is necessary. We are working on finding efficient ways to contract arbitrary tensor networks and being able to notice when a tensor network can only contribute negligibly to the sum (so that we can drop it from the sum).[24]
Once we have an on-paper solution, we will be interested to see how well the solution works in practice. There are some reasons to believe that it would be slower than sampling (the algorithm is likely to be quite complex), but also some reasons to believe that it would be faster than sampling (on paper, it would match the performance of sampling under worst-case conditions; in typical conditions, it might outperform sampling).
Additionally, even though we are excited about our progress on this question, "two-layer MLPs with a very large hidden layer, where only the second layer is trained" is ultimately a fairly narrow setting. There are many directions in which we could try to generalize our methods: deeper MLPs; hidden layers that are of similar size to the input layer; training both layers; other distributions of input data; and so on.
I consider the MSP to be a significant step forward for ARC. Previously, we were interested in producing mechanistic estimates of mathematical quantities, but had no particular benchmark by which to judge our progress or deem our methods "good enough." Now, we are holding ourselves to a standard that is philosophically justified (we believe that it ought to be possible for mechanistic estimates to compete with sampling), concrete (we can check whether our methods compete with sampling using empirical tests or formal proofs), and tied to a useful application (estimating properties of neural nets, such as catastrophe probability).
Formulating the MSP has allowed us to ask more concrete questions (e.g. "How can we construct a mechanistic algorithm that competes with sampling for estimating the average output of trained two-layer MLPs?"). We have solved some of these questions, made progress on others, and are continuing to make progress.
We plan to continue attacking the MSP from a number of directions:
If you're interested in working with us on any of these directions, you can apply here!
In our various MSP statements, we do not ask for to be able to "verify" that the explanation is "accurate" (i.e. correctly describes the structure of , instead of making false claims). Is that fine, or should we require to be verifiable by ?
At least in the "train and explain" version of the MSP, we do not believe that advice needs to be verifiable. This is for two reasons:
This last point seems a little bit at odds with my earlier assertion that only makes claims about the structure of , not the randomness. A more refined version of this assertion would be: should not assert any randomness in that happened by accident; but if has weird randomness due to some fact about the training process, then should reflect that fact.
What about our mainline MSP statement, where there is no explaining algorithm to track optimization done during training? If "comes out of nowhere," are we comfortable with asserting facts about without a possibility of verification?
In my opinion, it's fine for to be unverifiable, for essentially the same reason. If it's fine for to claim that was selected to be as adversarial as possible via a brute force search (in the "train and explain" version of the MSP), then it seems fine for to claim that was selected to be as adversarial as possible by an omniscient oracle, if that's how was selected.
For example, imagine that we can model as a random oracle -- a completely different random function for each -- and the particular that's chosen happens to be the one whose average output is furthest from 50/50. Then it seems fine for to assert that is a random oracle whose average output just so happens to be many standard deviations away from 50/50.
There might be natural versions of the MSP that require advice to be verifiable. However, such statements would require giving more time to run. Concretely, in the mainline MSP statement, we would ask for to run in time , where the extra factor of mitigates the selection pressure put into choosing . In the case where is a random oracle, this is exactly the amount of compute that needs to compete with sampling, if can convey that is a random oracle, but cannot assert anything about being a particularly unlucky random oracle. I like this version less, in part because we don't expect that paying the extra factor will be feasible in practice.
One interesting case of the MSP is when the architecture is a universal Turing machine . In other words, interprets as the encoding of a Turing machine, and runs on the input -- except that we will say that is forced to halt after one million steps (so that we don't need to worry about runtime). Applying the MSP to this special case gives the following assertion:
In other words, there is a single, universal that is able to mechanistically estimate the average output of any (time-bounded) Turing machine, if it is given advice that explains the Turing machine's structure.
Note also that a solution to this special case would yield a solution to the full MSP: suppose that we had an estimator for universal Turing machines, and consider some other architecture . Then , where is a Turing machine whose size is plus some constant that only depends on . Consider the estimator that, on input , writes down the such that and returns . If some causes to output accurate estimates for , then will also cause to output accurate estimates for . Thus, this estimator solves our mainline MSP for .
The MSP statement can also be used to obtain a claim about mechanistically estimating random Turing machines, but without advice. Concretely, we will let be the empty string, and will instead say that interprets as the encoding of a Turing machine, and runs on the input . (As before, we force to halt after a million steps.) Applying the MSP to this special case gives the following assertion:
This is an interesting and arguably bold statement: it says that as gets more time to run, it is able to get a more and more accurate mechanistic estimate of the average output of the Turing machine . This is intuitive enough for Turing machines with no interesting structure (as is the case for most random Turing machines). However, in order to satisfy the accuracy guarantee above, must converge to the right answer for all Turing machines (even if the convergence is slower for Turing machines with more sophisticated structure). Such a would probably involve a systematic search for structure: loosely speaking, since it isn't given an explanation, it must find the explanation on its own.
Modulo a caveat (see below), it is possible to modify the MSP statement to get rid of the tolerance parameter . Concretely, suppose that the following statement -- which specializes our mainline MSP statement to the case of -- is true:
We claim that our mainline MSP statement almost follows from this -less version. To see this, consider an arbitrary architecture , and fix a positive integer . We will define a modified architecture that has the same space of parameters as . Concretely, works as follows: it takes as input a list of inputs to , runs on all of them, and outputs the average value of for . We claim if some estimator solves the above MSP for all regardless of the particular value of , then also solves the mainline MSP for .
To see this, suppose that we have an estimator that solves the above MSP for , in the case of . This means that for any , there is an explanation such that :
Note that and . Note also that runs in time . This means that that also solves the mainline MSP for the architecture if .
Now, if there is a single that solves the MSP for regardless of , then will solve the mainline MSP for for all .
The fact that we need a uniform regardless of means that we don't quite have a full reduction; however, the above -less MSP statement is another interesting variant of MSP that is almost the same. We decided to make our mainline MSP statement contain a tolerance parameter in order to make the connection to the idea of matching sampling more intuitive.
In this section, I will outline one possible approach to solving the MSP. For the sake of concreteness, I will consider the case where is a universal Turing machine (which was discussed above). As a reminder, this means that interprets as the encoding of a Turing machine, and then runs on the input .
We will say that a Turing machine is efficiently compressible if there is a significantly shorter Turing machine that, on any input , constructs in time and the runs on input . (We call an efficient compression of .) One possible approach to solving the MSP looks something like this:
The hope for Step 1 is that an estimation approach that works in the average case (over random parameters) will work for all parameters that are not efficiently compressible.
The intuition underlying this hope is that structure implies efficient compression. In other words, if has structure that would make a mechanistic estimator mis-estimate its average output, then understanding that structure would allow us to represent more compactly (and in a way such that can be recovered quickly from the representation).
What about the "train and explain" version of the MSP? In order to adapt this approach to that setting, we also need to be able to learn the efficient compression in parallel with learning itself. If the training process has enough time to find a with special structure, then is there an "explaining process" that would have enough time to find the corresponding compression? That is unclear to me, but I think this direction is promising enough to be worth exploring.
If this approach is viable, then solving the MSP in the average case for some Turing-complete architecture (such as RNNs) would be a major step forward.
We could instead imagine that outputs a probability of catastrophe, but we will keep the range of to for simplicity of exposition.
Running might take much longer than running , which is why we can't just run on every input during deployment.
For example, the parameterization of the distribution needs to be quite flexible, so as to allow distributions that are computationally intractable to sample from. For example, if acts catastrophically when it encounters a factorization of RSA-2048, we want to be able to train that behavior out of even if we can't factorize RSA-2048. (See here for more discussion.)
In brief, we hope to address issue #1 via mechanistic anomaly detection. A little more concretely, our plan is to:
This leaves many details unexplained, but that's the basic concept.
Or rather, independent up to small random variation that is unpredictable just from understanding the structure of and .
In practice, we will be interested in the behavior of neural nets on structured (rather than uniformly random) inputs. However, note that it is possible to create structured inputs out of random inputs via a generative model. For example, if we are interested in the behavior of a classification model on pictures of animals, we could let consist of two parts: first, a generative model that creates an image of an animal from random noise, and second, a classifier that takes the animal image as input.\(\\)
Why do we require to be short? The basic reason is that, as we discuss later, we will be interested in learning in parallel with learning the parameters , and so we will want to be able to do a backward pass through as quickly as doing a backward pass through . We also have some amount of philosophical justification for believing an explanation the size of is sufficient. Essentially, we think that any object's structure can be described compactly, because if the amount of (non-redundant) structure in an object is much larger than the size of the object itself, that would constitute an "outrageous coincidence".
This works by taking our (Gaussian) model of layer and then modeling layer by finding the normal distribution that would minimize the KL divergence from the pushforward of our model of layer .
Covariance propagation does not require an explanation . However, some modifications of covariance propagation could require advice. For example, if is too large to allow for to compute all of the covariances, then could advise to only keep track of some particular covariances. Or, could tell about some important third-order correlations to keep track of.
Note that we use the word "heuristic" in place of "mechanistic" in that paper. I think that the word "mechanistic" conveys our goal slightly better.
As a very simple example, consider the circuit . Mean propagation estimates this circuit's average output as rather than because it fails to notice the correlation induced by the presence of in the two conjunctive clauses.
Though, see the appendix on advice verifiability for some nuance on this point.
Eliezer Yudkowsky's Worse Than Random makes a similar point:
As a general principle, on any problem for which you know that a particular unrandomized algorithm is unusually stupid - so that a randomized algorithm seems wiser - you should be able to use the same knowledge to produce a superior derandomized algorithm.
Roughly speaking, this is the range where is a substantial fraction of the number of times that one needs to run to fully estimate its unstructured randomness.
We cannot require to be accurate for all . For example, suppose that interprets as a Turing machine and runs on the Turing machine . Requiring to be accurate for all would mean expecting to be able to mechanistically estimate the output of a worst-case Turing machine, without any structural advice at all. (After all, cannot depend on .) This is too much to ask for.
Note that this problem is equivalent to estimating the probability that a one-layer ReLU network outputs all zeros on a random input. Concretely, if the network is , then the output is all zeros if and only if for every row of .
Our algorithm's runtime is , which is technically too slow in the case where . However, we are most interested in the regime where .
The naive approach is to treat this as a linear regression problem, where the covariance (inner product) between two polynomials and is defined as the expectation of for drawn from the unit sphere. However, doing this involves multiplying a matrix by a -vector, so the dependence of this algorithm on looks like : not fast enough.
We believe that our proof of correctness and efficiency works in the limit as , where the MLP depth is constant and .
Mean zero; standard deviation is chosen so that all activations have the same variance.
One complication to this picture is that, although we've defined cumulant propagation for sums and products of random variables, it's not clear what it means to apply cumulant propagation to an activation function like ReLU: given the cumulants of , how does one estimate the cumulants of ? Our strategy is to find a polynomial approximation to the ReLU function (see the next paragraph of this footnote for details). Once we've done that, we can apply cumulant propagation as we've already defined it for sums and products.
What is the appropriate notion of polynomial approximation? It turns out that we can take the polynomial that minimizes mean squared error if is assumed to be normally distributed with mean equal to our estimate of and covariance equal to our estimate of . This is equivalent to taking the first several terms of the Hermite expansion of ReLU (appropriately centered and scaled).
Actually, it is more important to keep track of cumulants in which the same activation appears multiple times, so we need to keep track of some cumulants of order higher than that involve repeated indices.
That's the Hebrew letter shin.
Concretely, a tensor network can only contribute substantially to the sum if every tensor in the network has a large operator norm. Thus, if in the process of contracting the tensor network, we find a tensor that has a small operator norm, we can cut off the computation and move onto the next tensor network.