The Mechanism Design for AI Safety (MDAIS) reading group, announced here, is currently in it's eighth of twelve weeks. I'm very excited by the quality of discussions we've had so far, and for the potential of future work from members of this group. If you're interested in working at the intersection of mechanism design and AI safety, please send me a message so that I can keep you in mind for future opportunities.

Edit: we have completed this initial list and are now meeting on a monthly basis. You can sign up to attend the meetings here.

A number of people have reached out to ask me for the reading list we're using.  Until now, I've had to tell them that it was still being developed, but at long last it has been finalized. This post is to communicate the list publicly for anyone curious about what we've been discussing, or who would like to follow along themselves. It goes week by week listing the papers covered, the topics of discussion, and any notes I have. After the first two weeks, the order of the papers covered is largely inconsequential.

Reading List

Week 1


  1. The Principal-Agent Alignment Problem in Artificial Intelligence by Dylan Hadfield-Menell
  2. Incomplete Contracting and AI Alignment by Dylan Hadfield-Menell and Gillian Hadfield

Discussion: Introductions, formalization of the alignment problem, inverse reinforcement learning and cooperative inverse reinforcement learning

Notes: The Principal-Agent Alignment Problem in Artificial Intelligence is extremely long, essentially multiple papers concatenated, so discussing it in the first week gave people more prep time to read it. Incomplete Contracting and AI Alignment is much shorter and less formal but did not add much, in hindsight I would not had included it.

Week 2

Paper: Risks from Learned Optimization in Advanced Machine Learning Systems by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant

Discussion: Inner vs. outer alignment, what applications mechanism design has for each "step" in alignment

Week 3

Paper: Decision Scoring Rules (Extended Version) by Caspar Oesterheld and Vincent Conitzer 

Discussion: Oracle AI, making predictions safely

Week 4

Paper: Discovering Agents by Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, and Tom Everitt

Discussion: Defining agents, using causal influence diagrams in AI safety

Week 5


  1. Model-Free Opponent Shaping by Chris Lu, Timon Willi, Christian Schroeder de Witt, and Jakob Foerster
  2. The Good Shepherd: An Oracle Agent for Mechanism Design by Jan Balaguer, Raphael Koster, Christopher Summerfield, and Andrea Tacchetti

Discussion: Mechanism design affecting learning, how deception might arise

Notes: Almost everything in The Good Shepherd was also covered in Model-Free Opponent Shaping, so in hindsight including it as well was redundant.

Week 6

Paper: Fully General Online Imitation Learning by Michael Cohen, Marcus Hutter, and Neel Nanda

Discussion: Advantages, disadvantages, and extensions for the mechanism proposed in the paper

Week 7


  1. Corrigibility by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong
  2. The Off Switch Game by Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell

Discussion: Formalizing issues with corrigibility, approaches to instill corrigibility

Week 8

Paper: Investment Incentives in Truthful Approximation Mechanisms by Mohammad Akbarpour, Scott Kominers, Kevin Li, Shengwu Li, and Paul Milgrom

Discussion: Implementing mechanisms with AI, issues with approximation

Week 9

Paper: Cooperation, Conflict, and Transformative Artificial Intelligence - A Research Agenda by Jesse Clifton

Discussion: Various topics from the agenda with a focus on S-risks and bargaining

Week 10

Paper: Getting Dynamic Implementation to Work (excluding sections 3 and 4) by Yi-Chun Chen, Richard Holden, Takashi Kunimoto, Yifei Sun, and Tom Wilkening

Discussion: Ensemble models, AI monitoring AI

Notes: Sections 3 and 4 of the paper were excluded as they focus on experimental results with humans, which are of minimal relevance.

Week 11


  1. Learning to Communicate with Deep Multi-Agent Reinforcement Learning by Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson
  2. Emergent Cover Signaling in Adversarial Reference Games by Dhara Yu, Jesse Mu, and Noah Goodman 

Discussion: Detecting communication, intercepting communication

Week 12

Paper: Functional Decision Theory: A New Theory of Instrumental Rationality by Eliezer Yudkowsky and Nate Soares 

Discussion: Functional decision theory, mechanism design for superrational agents and functional decision theorists

Next Steps

Once we have finished going through this reading list, I would like to move to a more infrequent and irregular schedule. Meetings would be to discuss new developments in the space, the research produced by reading groups members, or topics missed during the first twelve weeks. I expect this ongoing reading group would expand beyond the initial members and be open to anyone interested. 

If there is sufficient interest, another iteration going through the above reading list can be run, although likely with several updates.

Finally, we plan to collaborate on an agenda laying out promising research directions in the intersection mechanism design and AI safety. Ideally, we will have interested members transition to a working group where we can collaborate on research to address the challenge of ensuring AI is a positive development for humanity.

Edit: We have completed the initial readings and are now meeting once a month for further readings. You can sign up to be notified here.

Ongoing Readings

Meeting 13

Paper:  Safe Pareto Improvements for Delegated Game Playing by Caspar Oesterheld and Vince Conitzer

Meeting 14


  1.  Quantilizers: A Safer Alternative to Maximizers for Limited Optimization by Jessica Taylor
  2.  Safety Considerations for Online Generative Modeling by Sam Marks

Meeting 16

Paper: A Robust Bayesian Truth Serum for Small Populations by Jens Witkowski and David C. Parkes

Meeting 17 (Upcoming, early April)

Paper: Misspecification in Inverse Reinforcement Learning by Joar Skalse and Alessandro Abate

Meeting 18 (Upcoming, end of April)

Paper: Hidden Incentives for Auto-Induced Distributional Shift by David Krueger, Tegan Maharaj, and Jan Leike


New Comment
3 comments, sorted by Click to highlight new comments since: Today at 12:48 PM

Interesting, thank you! Has the group gotten around to discussing something like "lessons from contract theory or corporate governance for factored cognition-style proposals" at all?

Not yet! We're now meeting on a monthly schedule, and there has only been one meeting since completing the list here. I'll look into finding a relevant paper on the subject, but if you have any recommendations please let me know.

General question: mechanism design generally must cope with unknown agent beliefs/values and thus use truth incentivizing mechanisms, etc.

Has there been much/any work on using ZK proof techniques to prove some of these properties to make coordination easier? (ie proof that I agent X is the result of some program P using at least some amount of compute - training on dataset D using optimizer O and utility function U, etc)

New to LessWrong?