I do AI Alignment research. Currently at ARC Evals, though I still dabble in grantmaking and interpretability in my spare time. 

I'm also currently on leave from my PhD at UC Berkeley's CHAI. 

Obligatory research billboard website:


(Lawrence's) Reflections on Research
[Redwood Research] Causal Scrubbing

Wiki Contributions


I strongly agree that this would be valuable; if not for the existence of this shallow review I'd consider doing this myself just to serve as a reference for myself. 

I spent a good hour or two reading the construction and proposed solution of the paper; here's my attempted explanation with cleaned up notation.

Basically, he considers a setting with four actions: a, b, c, d, and a real numbered state s, where R(s, a) > R(s, b) = R(s, c) > R(s, d) = 0 if s > 0 and  0 = R(s, d) > R(s, c) = R (s, b) > R(s, c) if s <= 0. 

The transition rule is:
s' = s - 1 + L if action b is taken and s > 0,
s' = s - 1 - L if action c is taken and s > 0,
s' = s - 1 otherwise
for some constant L >= 0. 

The paper imagines s as measuring 'time until the human presses the shutdown button', and imagines action d as "stay shut down".  Because the reward of each of a, b, and c is greater than d before the buttonpress, the agent will never take action d before the buttonpress.  After the buttonpress (i.e. the timestep at which s drops below 0), the utility of "stay shut down" is higher than the other actions, and so the agent is incentivized to press d.  That is, the agent has an incentive to shut down after the buttonpress (desideratum 1 of the 2015 Soares et al paper), by construction

Accordingly, b is the delay action (which, in the limit of large L, just disables the button) and c is the advance action (which, in the limit, corresponds to immediately pressing the button). We'll call action a the "exploit" action. 

In this setting, the goal is to create a state-dependent correction fudge-factor f_c : S x A -> \R that occurs when the rewards are flipped (that is, an additional reward added for transition into s' <= 0 from s > 0, such that the agent never takes either the delaying action b or advancing action c.

(In his paper he calls these actions "p", ">", "<", and "e", and denotes the timestep at which s' <= 0 as "#", but that notation is annoying when ">" already means "greater than" and "<" already means less than. There's also a bunch of notational simplifications I've made that are harder to explain.)

In his setup, R(s,a) = 10, R(s,b) = R(s, c) = 8, and s_0 = 6. By default, he sets the delay/advance size L to 0.6. He also uses discount rate gamma = 0.9. With these values, the optimal policy does a once (s=5), b 12 times (s=0.2), a once again (s=-0.8), at which point the buttonpress occurs since s < 0, and thereafter takes action d. 

His solution is simply the utility indifference solution from the Soares et al paper: consider an alternative reward function R' such that R'(s, x) = R(1, x); that is, a world in which the shutdown button is never pressed. Then, we add the difference in (optimal) value functions under R' relative to R:

f(s, x) = V*_R'(s) - V*_R(s) = V*_R(s) 

By construction, the agent has no incentive to take the advance or delay actions (b or c), because its expected value in both cases is the same.  (desiderata 2 and 3 of the Soares et al paper)

Unfortunately, as this is in some sense the same solution as in the 2015 Soares et al paper, should run into the same problem of "there's no incentive to preserve the shutdown mechanism". (That is, it fails desiderata 4 of the 2015 paper.)

From a quick skim of section 9.8 (which argues for the corrigibility of his indifference solution) and sections 8.2 and 8.4 (which detail the part of his indifference solution), it seems like he proposes the following modification: instead of having R(s, d) = 0 for all s, we instead have R(s, d) be larger when s <= 0. This does incentivize preserving the ability to be shut down. He calls this agent, with modified R, the g_c agent. 

The g_c agent does indeed satisfy desiderata 4; there's an incentive to preserve the shutdown mechanism; in fact, there's again an incentive to press the shutdown mechanism!

The specific problem is, this incentive modifies the reward of R(s,d), but in his construction/implementation he does not go back to make this change to f(s, x). (He keeps it at V*_R(s), which no longer equals the value difference between shutting down and not shutting down.) This breaks the utility indifference, causing the Q-value of the advance action c to be larger for sufficiently large values of L. That is, his g_c agent now has an incentive to once again press the shutdown button (!!).

 TL;DR: no, based on a quick skim, the paper doesn't solve corrigibility.

Very small nitpick: I think you should at least add Alex Lyzhov, David Rein, Jacob Pfau, Salsabila Mahdi, and Julian Michael for the NYU Alignment Research Group; it's a bit weird to not list any NYU PhD students/RSs/PostDocs when listing people involved in NYU ARG. 

Both Alex Lyzhov and Jacob Pfau also post on LW/AF:

Expanding on this -- this whole area is probably best known as "AI Control", and I'd lump it under "Control the thing" as its own category. I'd also move Control Evals to this category as well, though someone at RR would know better than I. 

Thanks for making this! I’ll have thoughts and nitpicks later, but this will be a useful reference!

Thanks for doing this study! I'm glad that people are doing RCTs on creatine with more subjects. (Also, I didn't know that vegetarians had similar amounts of brain creatine as omnivores, which meant I would've incorrectly guessed that vegetarians benefit more than omnivores from creatine supplementation). 

Here's the abstract of the paper summarizing the key results and methodology:


Creatine is an organic compound that facilitates the recycling of energy-providing adenosine triphosphate (ATP) in muscle and brain tissue. It is a safe, well-studied supplement for strength training. Previous studies have shown that supplementation increases brain creatine levels, which might increase cognitive performance. The results of studies that have tested cognitive performance differ greatly, possibly due to different populations, supplementation regimens, and cognitive tasks. This is the largest study on the effect of creatine supplementation on cognitive performance to date.


Our trial was preregistered, cross-over, double-blind, placebo-controlled, and randomised, with daily supplementation of 5 g for 6 weeks each. We tested participants on Raven’s Advanced Progressive Matrices (RAPM) and on the Backward Digit Span (BDS). In addition, we included eight exploratory cognitive tests. About half of our 123 participants were vegetarians and half were omnivores.


Bayesian evidence supported a small beneficial effect of creatine. The creatine effect bordered significance for BDS (p = 0.064, η2P = 0.029) but not RAPM (p = 0.327, η2P = 0.008). There was no indication that creatine improved the performance of our exploratory cognitive tasks. Side effects were reported significantly more often for creatine than for placebo supplementation (p = 0.002, RR = 4.25). Vegetarians did not benefit more from creatine than omnivores.


Our study, in combination with the literature, implies that creatine might have a small beneficial effect. Larger studies are needed to confirm or rule out this effect. Given the safety and broad availability of creatine, this is well worth investigating; a small effect could have large benefits when scaled over time and over many people.

Note that the effect size is quite small:

We found Bayesian evidence for a small beneficial effect of creatine on cognition for both tasks. Cohen’s d based on the estimated marginal means of the creatine and placebo scores was 0.09 for RAPM and 0.17 for BDS. If these were IQ tests, the increase in raw scores would mean 1 and 2.5 IQ points. The preregistered frequentist analysis of RAPM and BDS found no significant effect at p < 0.05 (two-tailed), although the effect bordered significance for BDS.

I don't think that's actually true at all;  Anthropic was explicitly a scaling lab when made, for example, and Deepmind does not seem like it was "an attempt to found an ai safety org". 

It is the case that Anthropic/OAI/Deepmind did feature AI Safety people supporting the org, and the motivation behind the orgs is indeed safety, but the people involved did know that they were also going to build SOTA AI models. 

I'm not sure I agree -- I think historically I made the opposite mistake, and from a rough guess the average new grad student at top CS programs tends to look too much for straightforward new projects (in part because you needed to have a paper in undergrad to get in, and therefore have probably done a project that was pretty straightforward and timeboxed).

I do think many early SERI MATS mentees did make the mistake you describe though, so maybe amongst people who are reading this post, the average person considering mentorship (who is not the average grad student) would indeed make your mistake? 

My hope is that products will give a more useful feedback signal than other peoples' commentary on our technical work.

I'm curious what form these "products" are intended to take -- if possible, could you give some examples of things you might do with a theory of natural abstractions? If I had to guess, the product will be an algorithm that identifies abstractions in a domain where good abstractions are useful, but I'm not sure how or in what domain. 

Load More