james.lucassen

Retrospective: 12 [sic] Months Since MIRI

it's now been 15 months since MIRI but I just remembered that three separate people have told me they liked this post despite my not cross-posting it, so I am now cross-posting it. Not written with the intention of being useful to any particular audience, just collecting my thoughts on this past year's work. September-December 2023: Orienting and Threat Modeling Until September, I was contracting full time for a project at MIRI. When the MIRI project ended, I felt very confused about lots of things in AI safety. I didn't know what sort of research would be useful for making AI safe, and I also didn't really know what my cruxes were for resolving that confusion. When people asked what I was thinking about, I usually responded with some variation of the following: > MIRI traditionally says that even though their research is hard and terrible, everyone else's research is not at all on track to have a shot at making safe AI, and so MIRI research is still the best option available. > > After working with MIRI for a year, I certainly agree that their research is hard and terrible. I have yet to be convinced that everyone else's research is worse. I'm working on figuring that out. So I set out to investigate why MIRI thinks their research is the only kind of thing that has a shot, and if they're right or wrong about that. My plan? Read a bunch of stuff, think about it, write down my thoughts, talk to lots of people about it, try to get a foothold on the vague mass of confusion in my head. To my credit, I did anticipate that this could eat up infinite time and effort if I let it, so I decided to call a hard "stop and re-evaluate" at the end of September. To my discredit, when the end-of-month deadline came I took a moment to reflect on what I'd learned, and I realized I still didn't know what to do next. I thought that all I needed was a little more thinking bro and all the pieces would snap into place. Just a little more thinking bro. No need to set a next stop-and-r

68Jan 21, 2025

james.lucassen

Message

jlucassen.com

719

109

BashArena: A Control Setting for Highly Privileged AI Agents

We’ve just released BashArena, a new high-stakes control setting we think is a major improvement over the settings we’ve used in the past. In this post we’ll discuss the strengths and weaknesses of BashArena, and what we’ve learned about how to make settings for high stakes control research. BashArena Overview...

Dec 18, 202558

Decision Theory Guarding is Sufficient for Scheming

Reference post for a point I was surprised to not see in circulation already. Thanks to the acorn team for conversations that changed my mind about this. The standard argument for scheming centrally involves goal-guarding. The AI has some beyond-episode goal, knows that training will modify that goal, and therefore...

Sep 9, 202536

How Can You Tell if You've Instilled a False Belief in Your LLM?

In the spirit of better late than never - this has been sitting in drafts for a couple months now. Big thanks for Aryan Bhatt for helpful input throughout. Thanks to Abhay Sheshadri for running a bunch of experiments on other models for me. Thanks to Rowan Wang, Stewy Slocum,...

Sep 6, 202520

On Contact, Part 1

Context: for fun (and profit?) Basic Contact Contact is a lightweight many-versus-one word guessing game. I was first introduced to it on a long bus ride several years ago, and since then it’s become one of my favorite games to play casually with friends. There are a few blog posts...

Jan 21, 202514

Retrospective: 12 [sic] Months Since MIRI

Jan 21, 202568

Evaluating Stability of Unreflective Alignment

This post has an accompanying SPAR project! Apply here if you’re interested in working on this with me. Huge thanks to Mikita Balesni for helping me implement the MVP. Regular-sized thanks to Aryan Bhatt, Rudolph Laine, Clem von Stengel, Aaron Scher, Jeremy Gillen, Peter Barnett, Stephen Casper, and David Manheim...

Feb 1, 202457

Attempts at Forwarding Speed Priors

This post summarizes research conducted under the mentorship of Evan Hubinger, and was assisted by collaboration with Pranav Gade, discussions with Adam Jermyn, and draft feedback from Yonadav Shavit. Summary 1. Forwarding priors is a subproblem of deceptive alignment, because if we want to use regularization to create a prior...

Sep 24, 202230

Load More (7/12)

LESSWRONG
LW

LESSWRONG
LW

james.lucassen

james.lucassen

james.lucassen

Retrospective: 12 [sic] Months Since MIRI

BashArena: A Control Setting for Highly Privileged AI Agents

Evaluating Stability of Unreflective Alignment

Decision Theory Guarding is Sufficient for Scheming

james.lucassen

BashArena: A Control Setting for Highly Privileged AI Agents

Decision Theory Guarding is Sufficient for Scheming

How Can You Tell if You've Instilled a False Belief in Your LLM?

On Contact, Part 1

Retrospective: 12 [sic] Months Since MIRI

Evaluating Stability of Unreflective Alignment

Attempts at Forwarding Speed Priors

Retrospective: 12 [sic] Months Since MIRI

BashArena: A Control Setting for Highly Privileged AI Agents

Evaluating Stability of Unreflective Alignment

Decision Theory Guarding is Sufficient for Scheming

BashArena: A Control Setting for Highly Privileged AI Agents

Decision Theory Guarding is Sufficient for Scheming

How Can You Tell if You've Instilled a False Belief in Your LLM?

On Contact, Part 1

Retrospective: 12 [sic] Months Since MIRI

Evaluating Stability of Unreflective Alignment

Attempts at Forwarding Speed Priors