The premise is that you already have enough alignment done that you can do non-malign optimization over computationally verifiable goals. In other words, it assumes that you can take any function that you know how to program (and that runs fairly efficiently), and get an AI to pick to maximize .
You begin by considering any problem that you believe may have a conceptually simple core, but you don't know that core yet.
Examples of such problems include
- optimizing for a goal that you can't directly observe.
- corrigible behaviour
- non deceptiveness.
Once you have this list, you create large numbers of grid worlds, and similar. In each world, you specify the quantity to be maximized. For example, if you wanted to understand (1.), you would design a grid world where there are ways to create and destroy blue blocks. There are things that look like blue blocks but aren't. (say the world contains mirrors, so you can have one block and several reflections) The agent can only see its immediate surroundings. The goal is to maximize the number of blue blocks.
You are using an already designed, at least fairly powerful AI, searching for a piece of code. (Brute force would work, except it would take too much compute) The piece of code you are searching for is short. Maybe a few pages, maybe less. You might add some predefined functions, writing most of the agents code and just leaving a few gaps. You might have a piece of code specific to each gridworld, and some code shared between them.
The code you are searching for is an agent. A small toy agent that has the property you don't understand. This agent should be forced to learn about its environment. Its code should be sufficiently short, and the environments sufficiently complex and diverse, that it can't get by with hard coded knowledge. It has to learn.
You are searching for something short, a postcard full of mathematics, not millions of network weights. Your ideal is code you could have written, if you understood the topic at hand. You should be able to put 100 different deterministic 2 player games into this setup, and discover min max trees.
Once the program finishes, you look at the generated code. You are hoping that the computer has stumbled upon a simple general principle. You read the code, maybe do a few tests on toy problems, and try to understand why it works. You treat this as raising that particular mathematical formalism to the level of your explicit consideration. You don't trust it.
This approach obviously only works if
- There is a mathematically simple conceptual core of the problem
- You know it when you see it.
- That core generalizes from toy problems
- There are relatively few mathematically simple tricks that only work on toy problems.
- You can easily devise a reasonable number of gridworlds.