Announcing the ARC White-Box Estimation Challenge

Jacob_Hilton; paulfchristiano; Wilson Wu

ARC has teamed up with AIcrowd to launch the ARC White-Box Estimation Challenge, a contest to improve upon our estimation algorithms for random MLPs. The warm-up round begins this week, and later rounds will have a total prize pool of at least $100,000.

We are very grateful to Sharada Mohanty, Sneha Nanavati, Dipam Chakraborty and everyone else at AIcrowd for working with us to host this contest, as well as to Paul Rosu for testing the contest and to Harshita Khera for operational support.

Introduction to the Challenge

Our challenge follows the same setup as our recent paper on wide random MLPs: we consider MLPs with weights , defined by

where the activation function is , applied coordinatewise.

To begin with, we are fixing the width and the number of hidden layers , but we expect to change this setup in future rounds.^[1]

Contestants must design an algorithm that takes in a set of weights and produces an estimate for the expected output

Algorithms will be evaluated on MLPs with randomly-sampled Gaussian weights. The goal is to achieve as low mean squared error as possible, subject to certain computational constraints.

We have devised a FLOP-counting scheme with AIcrowd to minimize any advantage from using heavily optimized numerical kernels, allowing participants to focus on higher-level algorithm design instead. This scheme may still have a few rough edges remaining, but we hope to round these out over the course of the warm-up round.

For further details, please see the challenge website.

Why run this contest?

In the long run, we would like to answer questions about highly intelligent AI systems such as, "Are there unusual situations in which the system would undermine human control?". Running the system on a huge number of different inputs may not be a reliable way to answer such questions, since a highly intelligent system may not fall for our "honey pots". This why we are interested in white-box approaches that leverage our access to the model's internals. Ultimately, of course, we should use whichever methods perform the best.

Unfortunately, designing highly performant white-box estimation methods for trained networks is challenging, even for tiny models. ARC's bet is that we can build up to this challenge by first producing performant white-box estimation methods for randomly-initialized networks, and then figuring out how those methods can be adapted with each step of training. However, even this first step remains incomplete. In our recent paper, we produced white-box methods that outperform black-box methods for MLPs with large width, but they break down as the depth grows, and we are very confident that our methods can be significantly improved.

By running this contest, we hope to spur others to discover such improvements. Even though "white-box" is in the name of the challenge, contestants are permitted to use any methods they choose, whether white-box or black-box: as stated above, we ultimately want the best-performing algorithm. However, we strongly expect the best possible algorithms for this problem to be "mechanistic" (i.e., to avoid black-box sampling entirely), mirroring our existing results in the large width setting.

Use of LLMs

We encourage contestants to use LLMs to whatever extent helps them improve their submissions the most. In later rounds, there will be two kinds of prize: one for the best-performing submission, and one for our favorite algorithmic contribution described in a technical report. Especially for the latter kind of prize, contestants may benefit from having a good understanding of any LLM-written code themselves, but the rules of the contest do not require this.

In fact, exploring LLM usage is another motivation for holding the contest. Thanks to how our research has developed, it now looks possible to make progress on some of our core problems by hill-climbing on well-defined metrics, which is exciting to us.^[2] At the same time, the ability of LLMs to make considerable progress on such problems is improving rapidly, and we want to position ourselves to take full advantage of this. We are not sure whether we will be able to draw generalizable insights from strong submissions that are primarily LLM-written, but we think putting LLMs to work on the problem is a worthwhile experiment nonetheless.

As a word of caution, our FLOP-counting utility is definitely hackable in ways that would be very unambiguously hacking once pointed out, such as by modifying constants or counts held in memory. Contestants are responsible for ensuring that their submissions do not hack our FLOP-counting utility, regardless of whether or how they choose to use LLMs.

To all contestants: good luck!

The contest setup is actually slightly different in that it omits the final linear layer, but this makes essentially no difference. ↩︎
We have previously offered prizes for solutions to problems, but they have either been more pedagogical or less central to our agenda. ↩︎

It looks to me that the pre-grading "smoke-test" has a flop cap that is well below 6.8e10, it's perhaps at 1e10? So for me submissions at 0.1x the cap are getting through but larger flop counts are failing.

(Update: it's now fixed, thanks!)

FYI I found the grading criteria confusing:

In problem-setup.md you say:

Scoring is budget-adjusted final-layer MSE: adjusted_final_layer_score = final_layer_mse × max(0.1, effective_compute / flop_budget). If the budget is exceeded (analytical FLOPs or effective_compute = F_m + λ·R_m), predictions are zeroed and the multiplier is forced to 1.0 (no compute discount).

But in scoring-model.md you say

Your score is pure MSE — the closer to zero, the better.
...
primary_score is what the leaderboard ranks on.

These are quite different - e.g. the former means I should care about generally reducing FLOP count while the latter implies I should only care about staying below the 6.8e10 FLOP limit

Your score is the budget-adjusted MSE. You can see how it looks on the leaderboard, where the rank is by "adjusted score" which is 0.1-1x the "final layer MSE."

scoring-model.md is wrong or at least confusing, I'll pass on the feedback. (I think maybe that's because that document is just describing the "scoring model," which is the input into the leaderboard score, but regardless it's a misleading way to describe what's going on.)

Regarding the leaderboard, also not quite clear on what the effective difference between Warm-up and Phase 1 is at this point. June 18th is slated as "leaderboard opens" but unless it's materially different from the existing leaderboard I'm not sure what Phase 1 actually signifies.

I think the only difference between the phases is how locked in the rules are. During the warm-up round we may make big changes while we uncover bugs. In phase 1 we may make smaller changes. In phase 2 we won't make changes so that people know how they are being graded.

(AICrowd has run a lot of competitions and ARC has run none so we just handed it off to them and letting them make these kinds of decisions.)

This is fun! I submitted an entry to try it out, but I would be much more motivated to try to solve this problem with trained MLPs where different techniques should be applicable, have you considered adding that as a secondary leaderboard?

ARC's bet is that we can build up to this challenge by first producing performant white-box estimation methods for randomly-initialized networks, and then figuring out how those methods can be adapted with each step of training.

What knowledge is this bet based on?

I definitely agree that trained models are the real goal.

One view we hold is that all the complexities and messiness of randomly initialized models are still present in trained models---just combined with a lot of richer structure, different sets of features that are relevant, and so on. Moreover, we think that current algorithms for randomly initialized networks are not yet good enough that you should expect it to be possible to analyze trained models in a convincing way, except perhaps in the extremely overtrained regime where the model converges to a highly simplified algorithm.

We think there probably do exist adequate algorithms for analyzing random networks and the reason we haven't found them is because so little research has addressed the question head on. And if that's not possible we want to learn sooner and give up on the project.

In my opinion you can see this issue a lot in mechanistic interpretability research. Often at some point an investigation into a particular behavior needs to throw up its hands and say "and then a bunch of messy and complicated stuff happens" and you are left wondering whether you understand the phenomenon or whether the important action is in all the messy details. (Except in the super overtrained setting.)

I think that interpretability research would be more promising if we had a really solid understanding of "random messy stuff," such that you could say "yes we understand the randomly initialized model" and then focus on whether you are able to keep handling new structure as it gets introduced by training. Rather than starting off confused and hoping that you can catch some glimpses of what training is doing despite the background confusion.

(Of course in this comment I'm using a much more limited notion of "understanding" than interpretability researchers typically have in mind---I just want algorithms that are able to leverage a mechanistic analysis of the weights to answer concrete questions about the model. We hope that's a productive substitution that can drive much faster progress and help us find scalable algorithms, and we think that this more limited sense would still be sufficient to solve alignment.)

Thanks! Trained MLPs are somewhat trickier to design a contest around, since once you have a well-defined distribution over trained models, people can spend a lot of compute offline empirically fitting predictors for the expected output as a function of the model parameters, and it would be much harder to outperform these with mechanistic approaches. But it's good to know this may be more motivating, and we may consider it.

What knowledge is this bet based on?

The existence of such an algorithm is a special case of a broader conjecture that we have been calling the "matching sampling principle" (specifically the train-and-explain version, as described here). Our evidence for this conjecture is mainly based on examples where it appears to hold, a lack of compelling counterexamples, and more abstract philosophical reasoning. The specific research bet is a judgment call based on the plausibility of the overall approach, tractability, value of information, etc. (Sorry to be vague, it would take a lot of work to give a very detailed answer to your question, and we are hoping to say more about all of this in the not-too-distant future.)

people can spend a lot of compute offline empirically fitting predictors for the expected output as a function of the model parameters

In fact I implemented a transformer to extract strings memorized by a given ReLU MLP (that couldn't be found just by looking at the weights) faster than sampling, but for now, slower than GCG, the goal was to see whether its computationally hard is to extract them if they're naturally learned with SGD, rather than developing a mechanistic method.

So I agree that probably such a competition could be easy to trick this way, unless the problem is estimating the output with a high numerical precision (where transformers would fail), as opposed to "the proportion of rare problematic inputs". And also solving that competition with a mechanistic estimator implies solving this one too.

Coming here to +1 Guillame's interest in trained MLPs (which were the directional target of my independent research prior to the pit stop for untrained). Looking forward to the update on your research bet as it sounds aligned with the theory work that grounded my submission efforts.

FYI I found the grading criteria confusing:

In problem-setup.md you say:

Scoring is budget-adjusted final-layer MSE: adjusted_final_layer_score = final_layer_mse × max(0.1, effective_compute / flop_budget). If the budget is exceeded (analytical FLOPs or effective_compute = F_m + λ·R_m), predictions are zeroed and the multiplier is forced to 1.0 (no compute discount).

But in scoring-model.md you say

Your score is pure MSE — the closer to zero, the better.
...
primary_score is what the leaderboard ranks on.

These are quite different - e.g. the former means I should care about generally reducing FLOP count while the latter implies I should only care about staying below the 6.8e10 FLOP limit

Your score is the budget-adjusted MSE. You can see how it looks on the leaderboard, where the rank is by "adjusted score" which is 0.1-1x the "final layer MSE."

(AICrowd has run a lot of competitions and ARC has run none so we just handed it off to them and letting them make these kinds of decisions.)

ARC's bet is that we can build up to this challenge by first producing performant white-box estimation methods for randomly-initialized networks, and then figuring out how those methods can be adapted with each step of training.

What knowledge is this bet based on?

I definitely agree that trained models are the real goal.

What knowledge is this bet based on?

people can spend a lot of compute offline empirically fitting predictors for the expected output as a function of the model parameters

159

Announcing the ARC White-Box Estimation Challenge

159

Ω 63

Introduction to the Challenge

Why run this contest?

Use of LLMs

159

Ω 63

159

Ω 63