Epistemic status: Came up with what I was thinking of as a notation. Realised that there was a potentially useful conceptual tool behind it.
Consider this diagram.
Here the green circle on the left represents humans, with V being human values. The W is the external world. The C is within the computer. The W' is a world model. And the V' is the human values in the world model.
Here the computer has been programmed to form a model of the environment, for example Solomonoff induction or GPT3.
Anything in C is stored on the hard drive. Anything in W' is mirroring a section of the real world and is part of an arbitrary Turing machine or incomprehensible neural network.
The black bars are bridges, ways that something can get from one domain to another. (yes this is basically a graph structure.)
The left most bridge is between the human mind and the world. It is made of eyes and muscles.
The next bridge is between the world and the computer. It is made of keyboards and screens.
The next barrier is the interface between the programmed and the learned. The piece of code that forms 1 hot vectors from a string of text and the piece of code that turns the final activations into a choice of next word. The final barrier is the mirror of the first barrier within the simulation. A virtual representation of keyboards and screens, hidden in the weights of a network.
The dotted arrow represents the path that human values take to get into V'.
We want human values to get out into the world. The easiest way to do that is the solid line. It shows a human acting on the environment directly.
In this setup, the data goes into the computer, and straight back out. It could represent the computer trivially echoing the input. It is easy to add numbers, sort them, loop, take a maximum of a list, etc in the computer section C. So diagram 2 could also represent a calculator. Or any other program not involving AI.
Here is a diagram that shows results flowing all the way from V' back to the world. It represents something like GPT3 or other human imitation based alignment.
And finally, HCH, IDA can be represented by
This diagram shows such a setup, with values flowing down the dotted lines, and actions flowing down the solid lines. The represents the combination process.
Now to categorize alignment approaches. There are the approaches like IDA, possibly debate etc. that try to build an aligned AI out of these available components.
There are interpretability approaches trying to build extra bridges from the red to the blue or green. (Amongst other potential places a bridge could go.)
One approach could be to investigate what other colours of circles can be made, and their bridges.
There may be approaches to AI that can't be productively considered in terms of these components and interfaces. For linear regression, the parameters are sufficiently simple that everything represented is a very simplistic model of reality indeed.
A robot that uses hard coded A* search doesn't really fit this structure. Nor does a hard coded min max chess algorithm.
For AIXI and similar algorithms, you can do the "consider all possible X to maximize Y", so long as X and Y are in the red compute circle.
I think this technique is a useful conceptual tool because several of the badly considered AI ideas I have seen have a stage where some piece of information magically jumps from one region to another. I don't know of the serious alignment researchers making this mistake. I have seen criticisms that can be described as "how does this piece of info get between these regions". So presumably this thought pattern is already being implicitly used by some people. I hope that making the structure explicit helps improve this cognitive step.
With thanks to Miranda and Evan R Murphy for proofreading.
My current belief on this is that the greatest difficulty is going to be finding the "human values" in the AI's model of the world. Any AI smart enough to deceive humans will have a predictive model of humans which almost trivially must contain something that looks like "human values". The biggest problems I see are:
1: "Human values" may not form a tight abstracted cluster in a model of the world at all. This isn't so much conceptual issue as in theory we could just draw a more complex boundary around them, but it makes it practically more difficult.
2: It's currently impossible to see what the hell is going on inside most large ML systems. Interpretability work might be able to allow us to find the right subsection of a model.
3: Any pointer we build to the human values in a model also needs to be stable to the model updating. If that knowledge gets moved around as parameters change, the computational tool/mathematical object which points to them needs to be able to keep track of that. This could include sudden shifts, slow movement, breaking up of models into smaller separate models.
(I haven't defined knowledge, I'm not very confused about what it means to say "knowledge of X is in a particular location in the model" but I don't have space here to write it all up)