The Mechanism Design for AI Safety (MDAIS) reading group, announced here, is currently in it's eighth of twelve weeks. I'm very excited by the quality of discussions we've had so far, and for the potential of future work from members of this group. If you're interested in working at the intersection of mechanism design and AI safety, please send me a message so that I can keep you in mind for future opportunities.
Edit: we have completed this initial list and are now meeting on a monthly basis. You can sign up to attend the meetings here.
A number of people have reached out to ask me for the reading list we're using. Until now, I've had to tell them that it was still being developed, but at long last it has been finalized. This post is to communicate the list publicly for anyone curious about what we've been discussing, or who would like to follow along themselves. It goes week by week listing the papers covered, the topics of discussion, and any notes I have. After the first two weeks, the order of the papers covered is largely inconsequential.
- The Principal-Agent Alignment Problem in Artificial Intelligence by Dylan Hadfield-Menell
- Incomplete Contracting and AI Alignment by Dylan Hadfield-Menell and Gillian Hadfield
Discussion: Introductions, formalization of the alignment problem, inverse reinforcement learning and cooperative inverse reinforcement learning
Notes: The Principal-Agent Alignment Problem in Artificial Intelligence is extremely long, essentially multiple papers concatenated, so discussing it in the first week gave people more prep time to read it. Incomplete Contracting and AI Alignment is much shorter and less formal but did not add much, in hindsight I would not had included it.
Paper: Risks from Learned Optimization in Advanced Machine Learning Systems by Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant
Discussion: Inner vs. outer alignment, what applications mechanism design has for each "step" in alignment
Paper: Decision Scoring Rules (Extended Version) by Caspar Oesterheld and Vincent Conitzer
Discussion: Oracle AI, making predictions safely
Paper: Discovering Agents by Zachary Kenton, Ramana Kumar, Sebastian Farquhar, Jonathan Richens, Matt MacDermott, and Tom Everitt
Discussion: Defining agents, using causal influence diagrams in AI safety
- Model-Free Opponent Shaping by Chris Lu, Timon Willi, Christian Schroeder de Witt, and Jakob Foerster
- The Good Shepherd: An Oracle Agent for Mechanism Design by Jan Balaguer, Raphael Koster, Christopher Summerfield, and Andrea Tacchetti
Discussion: Mechanism design affecting learning, how deception might arise
Notes: Almost everything in The Good Shepherd was also covered in Model-Free Opponent Shaping, so in hindsight including it as well was redundant.
Paper: Fully General Online Imitation Learning by Michael Cohen, Marcus Hutter, and Neel Nanda
Discussion: Advantages, disadvantages, and extensions for the mechanism proposed in the paper
- Corrigibility by Nate Soares, Benja Fallenstein, Eliezer Yudkowsky, and Stuart Armstrong
- The Off Switch Game by Dylan Hadfield-Menell, Anca Dragan, Pieter Abbeel, and Stuart Russell
Discussion: Formalizing issues with corrigibility, approaches to instill corrigibility
Paper: Investment Incentives in Truthful Approximation Mechanisms by Mohammad Akbarpour, Scott Kominers, Kevin Li, Shengwu Li, and Paul Milgrom
Discussion: Implementing mechanisms with AI, issues with approximation
Paper: Cooperation, Conflict, and Transformative Artificial Intelligence - A Research Agenda by Jesse Clifton
Discussion: Various topics from the agenda with a focus on S-risks and bargaining
Paper: Getting Dynamic Implementation to Work (excluding sections 3 and 4) by Yi-Chun Chen, Richard Holden, Takashi Kunimoto, Yifei Sun, and Tom Wilkening
Discussion: Ensemble models, AI monitoring AI
Notes: Sections 3 and 4 of the paper were excluded as they focus on experimental results with humans, which are of minimal relevance.
- Learning to Communicate with Deep Multi-Agent Reinforcement Learning by Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson
- Emergent Cover Signaling in Adversarial Reference Games by Dhara Yu, Jesse Mu, and Noah Goodman
Discussion: Detecting communication, intercepting communication
Paper: Functional Decision Theory: A New Theory of Instrumental Rationality by Eliezer Yudkowsky and Nate Soares
Discussion: Functional decision theory, mechanism design for superrational agents and functional decision theorists
Once we have finished going through this reading list, I would like to move to a more infrequent and irregular schedule. Meetings would be to discuss new developments in the space, the research produced by reading groups members, or topics missed during the first twelve weeks. I expect this ongoing reading group would expand beyond the initial members and be open to anyone interested.
If there is sufficient interest, another iteration going through the above reading list can be run, although likely with several updates.
Finally, we plan to collaborate on an agenda laying out promising research directions in the intersection mechanism design and AI safety. Ideally, we will have interested members transition to a working group where we can collaborate on research to address the challenge of ensuring AI is a positive development for humanity.
Edit: We have completed the initial readings and are now meeting once a month for further readings. You can sign up to be notified here.
Paper: Safe Pareto Improvements for Delegated Game Playing by Caspar Oesterheld and Vince Conitzer
- Quantilizers: A Safer Alternative to Maximizers for Limited Optimization by Jessica Taylor
- Safety Considerations for Online Generative Modeling by Sam Marks
Paper: A Robust Bayesian Truth Serum for Small Populations by Jens Witkowski and David C. Parkes
Meeting 17 (Upcoming, early April)
Paper: Misspecification in Inverse Reinforcement Learning by Joar Skalse and Alessandro Abate
Meeting 18 (Upcoming, end of April)
Paper: Hidden Incentives for Auto-Induced Distributional Shift by David Krueger, Tegan Maharaj, and Jan Leike
Interesting, thank you! Has the group gotten around to discussing something like "lessons from contract theory or corporate governance for factored cognition-style proposals" at all?
Not yet! We're now meeting on a monthly schedule, and there has only been one meeting since completing the list here. I'll look into finding a relevant paper on the subject, but if you have any recommendations please let me know.
General question: mechanism design generally must cope with unknown agent beliefs/values and thus use truth incentivizing mechanisms, etc.
Has there been much/any work on using ZK proof techniques to prove some of these properties to make coordination easier? (ie proof that I agent X is the result of some program P using at least some amount of compute - training on dataset D using optimizer O and utility function U, etc)