Content and Takeaways from SERI MATS Training Program with John Wentworth

RohanS

Introduction

I participated in the training program for John Wentworth’s stream of SERI MATS, and overall, it was a very good experience! This post is meant to convey the content of the workshops, so that others can carry out the exercises on their own if they would like, along with my thoughts and takeaways from them. Most of these are in the day-by-day breakdown below, but here is a high-level summary of my experience:

The workshops gave me a sense of some skills that John thinks are useful for doing AI alignment research. (My guess is that John may emphasize some of these skills more than other alignment researchers would, but many of them are uncontroversial.) The exercises helped me acquire and practice some of those skills, but they also gave me ideas about how to further develop them going forward.
Some of the prompts for the exercises were very effective at getting me to generate interesting ideas. I wasn’t able to do all the exercises as thoroughly as I would have liked, so I may want to revisit the ones that seem most valuable. These include (but are not limited to):
- Learning and distilling a math topic slightly beyond your grasp (Week 3)
- Turning an intuitive concept into a formal mathematical object (Week 4, Day 1)
- Translating an intuitive high-level claim into a formal mathematical statement, then proving it (Week 4, Day 4)
- Making good plans to contribute to alignment (Week 6)
I learned that at this stage, I prefer theory and writing to experiment.

Day-by-Day Breakdown

Participants in the training program met 4 days each week on Gathertown, usually for about 1 hour per workshop (but sometimes up to about 2), for 6 weeks. Below is a day-by-day breakdown of what the workshops involved, with comments about my experiences with them. Naturally, not everyone’s experiences were the same as mine, so take what I say with a grain of salt. Many of the exercises can be carried out individually or in small groups, so some readers of this post may want to try them out! I’m happy to try to provide more details if any of the workshop descriptions are confusing.

(Italicized text describes the content of the workshop (usually John gave us these instructions at the start), regular text describes my thoughts and takeaways.)

Week 1 - Intro Week

Week 1, Day 1 - Intros and Alignment Disagreements

Find a partner, introduce yourselves to each other, and find something you disagree about related to alignment. After discussing for ~10 minutes, rotate partners and repeat (several times). You can’t use the same disagreement again.
This requires you to have alignment inside views that you can think of off the top of your head. I noticed I mostly don’t have these, and it might be worth developing them. One way to do this could be to figure out what experienced alignment researchers disagree about, then explore the topics in enough detail to form your own views on them. I’ll also probably want to revisit Concrete Advice for Forming Inside Views on AI Safety.

Week 1, Day 2 - Alignment Game Tree

Form a group of 2 or 3. Create a shared google doc. One person lists as many alignment proposals as they can, the other(s) try to break the proposals in sub-bullets. The proposer can try to patch the holes in deeper sub-bullets, pessimist(s) can add further critiques, etc. Aim for breadth over depth.
To do this quickly requires you to know alignment proposals, and their strengths and weaknesses, off the top of your head. If you don’t already have this knowledge, I think carrying out this activity while consulting various descriptions of alignment proposals online is a good way to develop familiarity with different proposals.
A bunch of alignment game tree docs:
- Alignment Game Tree - Rohan, Ashwin, Anish
- Alignment Game Tree (Abhay, Jay, Jack)
- Game Tree of Alignment
- Alignment Game Tree- Aditya, Victor, Valerio, and David McSharry
- W1.2 Game Tree

Week 1, Day 3 - Hypercomputer Exercise

Suppose you have a hypercomputer on a USB stick. You send it a program in your favorite programming language, it sends you back the output of that program within microseconds, no matter how many steps the program runs for. (Note that I/O is still normal speed.)

Assuming access to this magical hypercomputer, write a program which… [pick one]

Takes in a neural net (and, optionally, its training data) and detects any mesaoptimization and the objective of the mesaoptimization
Takes in a neural net (and, optionally, its training data), identifies net-internal structures which represent human-intuitive concepts, then returns (some representation of) the net-internal structures and the corresponding human concepts.
Takes in a low-level specification of a bacterium, and backs out any implied world-model or goals.
Takes in some data, builds a world model, and then can communicate with a copy of itself (trained on potentially different data from the same environment) to coordinate on a choice of one object in the environment which is not directly visible to both of their sensors.
Safely performs a pivotal act
Takes in some instructions in English, figures out what the user means (asking a reasonable number of clarifying questions if needed), and then does what the user means.

Notes on the exercise:

All of these should be robust, i.e. it should predictably work well even far outside the training distribution.
I recommend picking one which seems middlingly confusing to you, not most or least confusing.
The main point of the exercise is to notice any subproblems/barriers we run into.
It’s fine to start with pseudocode, and leave some functions with just a comment explaining what the function is supposed to do and saying [TODO: figure out how to do this]. In that case, it’s high value to think through what the inputs/outputs are for that function, even if you’re not sure how the function itself works.

I thought this was an ok exercise at my current stage. I didn’t really have a good sense of what a hypercomputer could help with, so it wasn’t a great generator of new ideas for me. I felt like I couldn’t come up with any concrete ideas for how to start writing these programs, so I couldn’t even run into bottlenecks that arise from realistic computer constraints. If I had been able to do that, I can imagine the exercise would have been useful for getting past those bottlenecks and finding other interesting barriers. I’m not sure how to get to that stage though.

Week 1, Day 4 - Ball and Cup Exercise

John had a ramp (specifically, a curved Hot Wheels track) on a table and a metal ball that he would drop at the top of the ramp. He had a line on the ground below the table, and we know the ball will land somewhere on the line. Form a group of 2-3 people, and try to figure out how far horizontally from the end of the ramp to put the cup to get the ball in the cup. You can ask for measurements as needed. When you have a guess, tell John and he’ll test it! The goal is to get the ball in the cup in as few tries as possible, ideally one.
This was fun, and led to a couple interesting takeaways. The takeaways I had (which at least partially spoil the exercise, so don’t look if you want to try it!) are here.

Week 2 - Experiment Week

Week 2, Day 1 - Basin Broadness Discussion and Experiment Prep

Before attending:
- Discuss Vivek’s basin broadness posts
  - Information Loss --> Basin flatness
  - Hessian and Basin volume
- Train a basic MNIST classifier
  - I did this in PyTorch on Colab
During the workshop on this day, the floor was open for us to ask questions that would help us better understand the posts and/or carry out the exercise scheduled for the next day (see below)

Week 2, Day 2 - Main Experiment

Compute eigenvalues and eigenvectors of the Hessian of the loss , or the SVD of the behavioral gradient matrix (d(output for all inputs)/d $θ$ ), for the MNIST classifier you trained. (Behavioral gradients are described in more detail in the Information Loss --> Basin flatness post.)
This was a pretty difficult exercise (for me, and for some others), and I wasn’t able to get very far. This served as a notice to me that this may be a kind of thing I should be able to do if I want to do good alignment research, and I should seriously consider trying to improve my ability to carry out and interpret experiments like this.

Week 2, Day 3 - You Are Not Measuring What You Think You Are Measuring

Read You Are Not Measuring What You Think You Are Measuring
Look at the abstracts of papers that describe experiments (e.g on ML arxiv), explain how they (might) fail to measure what they think they’re measuring.
- The point is not to find flaws in existing papers, but to practice the skill of predicting and recognizing ways in which experiments can be misleading or have unexpected results

Week 2, Day 4 - Interpreting Results

Figure out what we were measuring two days ago by doing lots of experiments and graphing lots of things
Same as above: This was a pretty difficult exercise (for me, and for some others), and I wasn’t able to get very far. This served as a notice to me that this may be a kind of thing I should be able to do if I want to do good alignment research, and I should seriously consider trying to improve my ability to carry out and interpret experiments like this.

Week 3 - Writing Week

Week 3, Day 1 - Prototypical examples

Read a technical thing (e.g. paper abstract, Wikipedia article, etc.). After each sentence, pause and identify key terms. Identify a prototypical example of those terms, as a way to track what the text is saying in a more concrete setting.
- We started with Brouwer’s fixed-point theorem
  - What are some prototypical examples of continuous functions from a closed disk to itself? Identify a fixed point for each of these functions.
  - Repeat for each of the generalized forms of the theorem.
- Then we moved to ML arxiv paper abstracts
When reading complicated technical things, keeping a prototypical example in mind can help things make more sense. When writing, try to install a good prototypical example in the reader’s mind!
Pick a math concept that you don’t fully understand, but which you are close to being able to understand. (“You should know what all the individual words mean, but not grasp the whole idea already” was John’s guidance.) For the rest of the week, work on understanding that topic and write a distillation for it. By Wednesday’s meeting, have a first draft ready.
I thought this was a very good prompt. I felt excited about picking a math topic, trying to understand it better, and writing about it. I hadn’t previously actively tried to think of things in the category of “math topics slightly beyond my grasp that I’d like to learn and write about,” and a few exciting options came to mind pretty quickly.

Week 3, Day 2 - Distillation Workshop

Work on understanding and writing about the topic you chose. (I think that’s all this day was, but I might be forgetting something.)

Week 3, Day 3 - Reader’s Mental Picture

Share your draft with a partner. They will tell you what picture (perhaps what prototypical example) they have in mind as they read your draft. Compare this to what you wanted them to have in mind.
Good exercise. Useful for identifying things I wasn’t explaining properly.

Week 3, Day 4 - Hook Workshop

Something about how to write a good hook at the start of a post.
I didn’t attend, this was Thanksgiving Day.
Finish your distillation and post it online before next week!
I didn’t actually do this. I was looking at transformers (the neural net architecture), and I focused more on the learning part without getting to enough of the writing. I’ll try the writing part on my own soon. I think the essential exercise of this week is one I’m excited to do, potentially several times, on my own.

Week 4: Theory Week

Week 4, Day 1 - Boundaries Exercises

Before the meeting:

Read «Boundaries», Part 1: a key missing concept from utility theory by Andrew Critch

Task 1: In a group of 2-3, come up with a mathematical formulation for what a boundary is (we may have focused on the boundaries of agents in particular).

Should include as few assumptions as possible
Should be formal and precise
Should satisfy the properties desired from the concept
- In the assigned post above, Critch outlines some reasons why boundaries would be a useful concept in utility theory. The mathematical formalism should be useful for those purposes.
I thought this was a very useful and fun exercise! It is a thing I had not thought to try to do before, but it felt quite tractable to come up with interesting ideas.
After ~10 minutes, discuss what each group came up with for ~5 minutes.

Task 2: Pick one of the groups’ definitions for everyone to work with now. Make that definition more precise, or find flat out mistakes.

Task 3: Come up with a mathematical notion of power / control.

Power / control was a part of the notion of boundaries that we came up with, so this was diving deeper into trying to understand the hidden details.

One potential use of the concept of boundaries is something like, “If boundaries can be formalized, maybe we could make safe AI by having it respect boundaries, and we can set those boundaries to prevent the AI from doing harm to things we care about.” “Boxed AI” is AI that only tries to interact with the world through specific I/O channels, and doesn’t try to acquire the ability to interact via other channels. What is the relationship between Boxed AI and the definition of boundaries you came up with? What does your definition of boundaries tell you about how to develop boxed AI?

Also an interesting exercise

What will go wrong if you rely heavily on this definition of boundaries?

None of us thought that we had a True Name for boundaries, so this was a useful thing to think through.

Week 4, Day 2 - Framing & Active Ducking Exercises

Find a partner. Go through the concepts in this doc, taking turns as the example-generator and the “duck.” Spend 10 minutes on each concept, in which the example-generator comes up with as many examples of the concept as they can that they had not previously thought of as examples of the concept, and the duck gives small helpful comments (such as “Can you make that example more concrete?”). (Coming up with examples probably first requires coming up with a working definition of each concept.)
- I thought this was a pretty good exercise, but some of the concepts lent themselves to this exercise much better than others.

Week 4, Day 3 - Prototypical Example Degrees of Freedom & Type Signatures

For mathematical formalisms of different concepts (like boundaries), consider:
- Degrees of freedom
  - According to the formalism, what things could change without affecting the properties of the thing? Does this make sense with the actual concept?
- Type signatures
  - What data structure would you use to represent it?
  - What inputs and outputs can it take?
  - What constraints does this place on examples of the concept?

Week 4, Day 4 - Conjecture Workshop

Come up with a conjecture using high-level concepts (e.g. some statement about boundaries). Then try to prove that conjecture using mathematical formalisms of the high-level concepts.
I want to try this activity again. It felt really exciting, but I couldn’t quickly come up with a conjecture that really felt suitable for the exercise so I just did it with a statement that didn’t really make sense. I think coming up with a conjecture that is amenable to this exercise is an interesting challenge in its own right.

Week 5: Big Picture & Strategy Week

Week 5, Day 1 - Alignment Game Tree II: Problem Tree + slackness/tautness

Before the workshop: Take a look at anything which sounds interesting to you in List of Lethalities (especially sections B.2 - B.3), and any of the Why Not Just... posts. You don't need to read all of these; look for whatever most directly challenges the strategies you find most interesting or promising. The point is to have a little more background on barriers other people have talked about before doing the next round of the Game Tree.
Similar to earlier alignment game tree exercise
Which failure modes kept popping up across various alignment proposals? Those are taut constraints, which are core problems that (John thinks) we need to face head-on.

Week 5, Day 2 - Existing Evidence Exercises

Running experiments and gathering data sometimes requires a lot of resources, so we want to make use of the abundant existing evidence in the world when possible. Pick an alignment strategy of interest to you, and identify the closest real-world analogy you can. Figure out what that analogy can tell you about the alignment strategy.
- E.g. AI Safety via Debate ⇐⇒ Jury trials. The fact that jury trials are far-from-perfect at reaching true conclusions is a pretty bad sign for debate. (At least, John thinks so. I’m inclined to agree, although I want to hear or construct a steelman argument for debate before dismissing it completely.)
This felt like a promising exercise, but coming up with good analogies wasn’t easy for some alignment proposals, and it was difficult to get much insight without good analogies.

Week 5, Day 3 - X-o-scope

The development of new tools and measurement devices has previously been critical to the development of scientific fields. For example, microscopes were essential for biology, telescopes for astronomy, etc.
Brainstorm tools for alignment - be as sci-fi as you want
- What could you measure to make alignment easier?
After brainstorming, pick one idea and make it much more concrete. What exactly are the tool’s inputs and outputs? How might you build it?
I wasn’t able to come up with much that I found interesting with this prompt. I think I would have benefited from seeing more examples from the history of science of measuring tools catalyzing rapid progress in a field, or more ideas for tools that would be valuable to other fields in the present day, before trying to generate ideas for alignment tools on my own.

Week 5, Day 4 - Nate’s Giant Text File Technique

Open up a blank Google doc. Write everything you know about alignment (or alternatively, agency).
Great exercise! I plan to repeat this in the future. I only spent about an hour on it in the workshop, I think having more time to work with would be better.

Week 6: Wrap-up

Week 6, Day 1 - Hamming Questions

Basically follow this post: Research Hamming Questions
- Spend 5 mins answering the questions in each section in a Google doc.
Some sections’ questions felt more effective at eliciting useful thoughts from me than others, but overall I thought this was very useful.

Week 6, Day 2 - Groups and Plans

Form a new group (for accountability / working on a project) and a plan
- Use your answers to the Hamming questions from yesterday to guide your plans!
Useful, but I felt unable to make great plans due to uncertainty about the balance between upskilling and direct work I should be aiming for right now.

Week 6, Day 3 - Idea Feedback

Optional extra reading for this week: Principles for Alignment/Agency Projects. They did something similar to this week's "idea feedback" session last MATS program, and John later wrote up this post on themes which came up a lot during that session.
Go up on stage (in Gathertown) and get John’s feedback on your plans from yesterday.
Fun and interesting, both for my own plans and those of others. I think most of the value came from finding random things John says interesting, which can also be gained from just reading lots of the things he’s written on LessWrong.
Would have probably been more valuable if I’d had a more concrete plan to talk about.

Week 6, Day 4 - Ask Me Anything

We asked John questions and got his answers
Fun, interesting, and useful, for my own questions and those of others.
Before this workshop, I read John’s The Plan - 2022 Update and tried to really understand all the explicit and implicit beliefs about alignment that are contained within it. I also explored some related posts and comments of his as they became relevant. This process left me with several questions, which I asked, and the answers I got were interesting and useful. I think this approach to coming up with questions was a good one.

[-]johnswentworth2y40

A few relevant comments for anybody trying some of the workshops...

The applied linear algebra lecture series covers some material directly relevant to the parts people found difficult in Experiment Week. Lectures 2 and 3 are particularly relevant. (Unfortunately I didn't record those lectures in time for this MATS cohort, but their confusions did inform the lecture content.)

As Rohan noticed, a lot of the exercises probably needed more time/attention than I gave for people to figure things out. Also, different workshops will connect for different people, mostly depending on what background skills/knowledge people already have and how much they've already thought about alignment. Unfortunately, if you just read through the list, there are exercises which you will probably not expect to be very relevant to you but which would in fact be high-value if you did them, so it's hard to avoid just trying them all and seeing what works.

Other than additional time/attention and some expected variation in the extent to which different workshops connect for different people, I think the conjecture workshop was the only workshop where I qualitatively messed up the implementation. I'd previously run conjecture workshops with differently-selected people, and it turns out the things they need were very different from the things this cohort needed. In particular, I should have put much more emphasis on the fact that a conjecture usually needs two sets of properties - one set of properties are assumed, and then the other properties are derived from those. Lots of people in this cohort ended up coming up with an operationalization of some intuitive concept, but never got around to conjecturing what properties were implied by that operationaliation; they didn't have an actual claim.

In addition to the workshops, two of the weeks had optional bonus exercises, which I expect are high-expected-value but didn't really fit in the schedule. Experiment week:

Optional Bonus Exercise for this week: go through the code for both your MNIST classifier, and the hessian/behavioral gradient eigenstuff calculation. First, without running the code, say what the shape is of each variable (i.e. scalar, vector of length 40k, 100 by 1000 matrix, etc), then run the code and check the actual shapes match what you expected. Second, again without running the code, do a fermi estimate of the runtime of each part of the code, then run the code and check how close your fermi estimates were. (For the fermi estimates, assuming you're not running on a GPU, a reasonable estimate for your CPU's speed is 1-10 billion operations per second, and you should aim to get your estimate within a factor of 10.)
If you had trouble following what was going on in any of the coding during today's exercise, then the bonus exercise will probably help; shapes and runtime fermi estimates are things which I usually track in my head when writing numerical code. It's also a relatively fun exercise, since the feedback loop is very tight.

... and writing week:

Optional Bonus Exercise for this coming week: look up either Shannon's paper introducing information theory, Turing's paper on morphogenesis, or any of Einstein's four annus mirabilis papers. Read through it, paying attention mainly to writing style/techniques. These were all highly influential papers on complicated technical topics which nonetheless had a lot of reach. How did the author make things understandable? How does the style differ from e.g. a typical paper today? What takeaways could you incorporate into your own writing, to write more like Shannon/Turing/Einstein?

[-]JakubK2y30

I tried the Shannon/Turing/Einstein writing style exercise in the Distillation for Alignment Practicum and didn't find it very useful. The Einstein paper I read seemed reasonably good at communicating its ideas, but I didn't find many useful techniques besides obvious things like "describe one idea per paragraph" and "define the symbols in your equations."

I bet there are some better papers for learning communication techniques? Maybe from What is the best scientific paper you have read? or Any fun, easy to read scientific papers you’d suggest? or Lists of important publications in science. (The first link has a lot of Shannon/Turing/Einstein fans, so maybe I'm crazy.)

Another idea I'm thinking is that scientific papers are fundamentally worse for communicating ideas than other mediums like textbooks, videos, or more casual writing.

[-]Kay Kozaronek2y10

Thanks for putting this together. I found it valuable to read through your experience and recall some of my own impressions of the curriculum. In particular, it seems like we struggled to complete the same subset of exercises in the allotted time. Hopefully, this will be incorporated in future runs of the workshop.

LESSWRONG
LW

28