Consider trying Vivek Hebbar's alignment exercises

Orpheus16

Vivek Hebbar recently developed a list of alignment problems. I think more people should try them. I'm impressed with how well they (a) get people to focus on core problems, (b) encourage people to come up with their own ideas, (c) encourage people to notice and articulate confusions, and (d) accomplish a-c while also providing a fair amount of structure and guidance.

You can see the problems in this google doc or pasted below.

Note that Vivek is also a mentor for SERI-MATS. These exercises are also the questions that people need to answer to apply to work with him and Nate Soares. Applications are due today.

Problems:

These problems are basically research questions — we expect them to be difficult, and good responses will be valuable as research in their own right. MIRI will award prizes (likely on the order of $5000 for excellent submissions).

Instructions: We recommend that you focus on 1 or 2 of the hard questions and leave the other hard questions blank. It is mandatory to attempt either #1a-c or #2.

Note on word counts: These are guidelines for how long we think a typical good response will be, but feel free to write more. If you have lots of ideas, it’s great to write them all, and don’t bother trying to shorten them.

Hard / time-consuming questions (contest problems):

Problem 1

Pick an alignment proposal[footnote 1] and specific task for the AI.[footnote 2]
1. First explain, in as much concrete detail as possible, what the training process looks like. Then go through Eliezer’s doom list. Pick 2 or 3 of those arguments which seem most important or interesting in the context of the proposal.
2. (~250 words per argument) For each of those arguments:
  1. What do they concretely mean about the proposal?
  2. Does the argument seem valid?
    1. If so, spell out in as much detail as possible what will go wrong when the training process is carried out
  3. What flaws and loopholes do you see in the doom argument? What kinds of setups make the argument invalid?
3. (~200 words) For one argument only, suggest a modification to the proposal, or a research direction, to get around the doom argument.
  1. Do you think your proposed modification/direction will work?
    1. If so, explain why.
    2. If not, do you think your modification is in a space where continued brainstorming and iteration will lead to a correct solution, or is there a fundamental obstacle which stops all things in that space from working? What is this obstacle?
4. (Optional; ~200 words) Overall, how promising or doomed does the alignment proposal seem to you (where ‘promising’ includes proposals which fail as currently written, but seem possibly fixable).
  1. If promising, summarize how it circumvents the usual reasons for doom.
  2. If not promising, what is the most fatal and unfixable issue?
    1. If there are multiple fatal issues, is there a deeper generator for all of them?
    2. What is the broadest class of alignment proposals which is completely ruled out by the issues you found?

Footnote 1: Examples of alignment proposals you could consider. We recommend you pick either a proposal you're familiar with, or something from 11 Proposals so you don't waste time parsing difficult writeups.

Safety via debate
Safety via market-making
Retargeting the search (Note: Not sure if it’s concrete/developed enough as written)
ELK (Note that ELK is a problem statement and not a proposal -- you’d have to specify an alignment proposal which uses a solution to ELK as a component. There’s a sketch of this in the ELK appendix. We don’t know if this is a good idea to attempt for this question.)
Relaxed adversarial training
Recursive reward modeling
Externalized reasoning oversight
Anything from 11 proposals (or pure HCH + IDA, which is just the “imitative amplification” part of proposal #2)
Any other proposal you think of

Footnote 2: Some rough examples of “tasks” are “drive a car”, “solve alignment”, “run a widget factory”, and “protect a diamond vault”. Try to be more specific than this if possible, for instance, “the AI has cameras and motion sensors, and must protect a diamond in a room from human robbers, drones, and other attacks, by controlling various actuators and weapons in the vault”

Problem 2

2. Suppose you have both of the below relaxations:

You have infinite compute (a computer which can run programs arbitrarily fast, and with an arbitrary amount of memory)
Substitute “human values” with “maximize the amount of diamond in the universe"

Solve alignment given these relaxations; that is, describe a scheme involving this infinite computer that results in an AGI which will maximize the amount of diamond in our universe. (Note: This question is largely to test your ability to think concretely. Ideally, you should give detailed pseudocode + maximally concrete descriptions of any physical setups involved; verbal descriptions often hide the true difficulty of specifying a component.)

Write down every difficulty you notice, and don’t gloss over anything. Do you expect your solution to work? Suppose you run it and it fails; how did it fail?

Whether or not you found any promising scheme, still submit all the difficulties you noticed, what ideas you thought of, and why the failed ideas don’t work. Err on the side of including all your thoughts and don't worry about making your solution look polished.

ETA (10/18/22): If you use deep learning in your solution, please be very specific about the setup (architecture, loss function, regularization if any). Also explain why you expect it to generalize the way you claim it will (this may involve arguments about the mechanistic structure of the learned model and the specific inductive bias of your setup).

Problem 3

3. Note: This involves a lot of reading (on the order of 10 to 100 pages). It’s largely to identify good technical distillers, but also gives us some signal for researchers.

Distill down one of the following posts into a ≤100 word summary of the main alignment idea and a ≤100 word summary of what properties of reality are required for that idea to work, followed by summaries of the key points:

The ELK report (Eliciting Latent Knowledge)
Acceptability Verification (research agenda)
John’s agenda (see these four posts; you can distill any subset into a single distillation, and draw on his other work — do whatever makes sense)
Shard theory
How likely is deceptive alignment?
Training stories
Anything else in this rough category

Problem 4

4. Read AGI ruin and/or Paul’s response and/or DeepMind alignment team’s response. Pick a specific point that you have thoughts on (maybe a disagreement, maybe something else) and write an analysis of it.

[-]Thomas Kwa3y61

Could you say more about why you think these exercises are particularly valuable? I'm on Vivek's team and I helped a bit with these exercises, so I'm naturally a fan, but I don't think most people can decide to do these rather than Ngo's exercises or other SERI MATS app questions without more information on why they're good.

LESSWRONG
LW