Jacob_Hilton

Mechanistic estimation for expectations of random products

We have developed some relatively general methods for mechanistic estimation competitive with sampling by studying problems that are expressible as expectations of random products. This includes several different estimation problems, such as random halfspace intersections, random #3-SAT and random permanents. In this post, we will give a high-level introduction to...

May 1550

Mechanistic estimation for wide random MLPs

This post covers joint work with Wilson Wu, George Robinson, Mike Winer, Victor Lecomte and Paul Christiano. Thanks to Geoffrey Irving and Jess Riedel for comments on the post. In ARC's latest paper, we study the following problem: given a randomly initialized multilayer perceptron (MLP), produce an estimate for the...

May 784

AlgZoo: uninterpreted models with fewer than 1,500 parameters

This post covers work done by several researchers at, visitors to and collaborators of ARC, including Zihao Chen, George Robinson, David Matolcsi, Jacob Stavrianos, Jiawei Li and Michael Sklar. Thanks to Aryan Bhatt, Gabriel Wu, Jiawei Li, Lee Sharkey, Victor Lecomte and Zihao Chen for comments. In the wake of...

Jan 26179

ARC progress update: Competing with sampling

by Eric Neyman, Victor Lecomte, Wilson Wu, Mikewins, Jacob_Hilton, and George Robinson

In 2025, the Alignment Research Center (ARC) has been making conceptual and theoretical progress at the fastest pace that I (Eric) have seen since I first interned in 2022. Most of this progress has come about because of a re-orientation around a more specific goal: outperforming random sampling when it...

Nov 18, 2025131

Jacob_Hilton's Shortform

May 1, 20256

A bird's eye view of ARC's research

This post includes a "flattened version" of an interactive diagram that cannot be displayed on this site. I recommend reading the original version of the post with the interactive diagram, which can be found here. Over the last few months, ARC has released a number of pieces of research. While...

Oct 23, 2024121

Backdoors as an analogy for deceptive alignment

ARC has released a paper on Backdoor defense, learnability and obfuscation in which we study a formal notion of backdoors in ML models. Part of our motivation for this is an analogy between backdoors and deceptive alignment, the possibility that an AI system would intentionally behave well in training in...

Sep 6, 2024104

Jacob_Hilton

Jacob_Hilton

AlgZoo: uninterpreted models with fewer than 1,500 parameters

Mechanistic estimation for wide random MLPs

Formal verification, heuristic explanations and surprise accounting

A bird's eye view of ARC's research

Jacob_Hilton

AlgZoo: uninterpreted models with fewer than 1,500 parameters

Mechanistic estimation for wide random MLPs

Formal verification, heuristic explanations and surprise accounting

A bird's eye view of ARC's research

Mechanistic estimation for expectations of random products

Mechanistic estimation for wide random MLPs

AlgZoo: uninterpreted models with fewer than 1,500 parameters

ARC progress update: Competing with sampling

Jacob_Hilton's Shortform

A bird's eye view of ARC's research

Backdoors as an analogy for deceptive alignment