Potential dangers of future evaluations / gain-of-function research, which I'm sure you and Beth are already extremely well aware of:
unless that's an objective
I think this is too all-or-nothing about the objectives of the AI system. Following ideas like shard theory, objectives are likely to come in degrees, be numerous and contextually activated, having been messily created by gradient descent.
Because "humans" are probably everywhere in its training data, and because of naiive safety efforts like RLHF, I expect AGI to have a lot of complicated pseudo-objectives / shards relating to humans. These objectives may not be good - and if they are they probably won't constitute alignment, but I wouldn't be surprised if it were enough to make it do something more complicated than simply eliminating us for instrumental reasons.
Of course the AI might undergo a reflection process leading to a coherent utility function when it self-improves, but I expect it to be a fairly complicated one, assigning some sort of valence to humans. We might also have some time before it does that, or be able to guide this values-handshake between shards collaboratively.
I just wanted to say thanks for writing this. It is important, interesting, and helping to shape and clarify my views.
I would love to hear a training story where a good outcome for humanity is plausibly achieved using these ideas. I guess it'd rely heavily on interpretability to verify what shards / values are being formed early in training, and regular changes to the training scenario and reward function to change them before the agent is capable enough to subvert attempts to be changed.
Edit: I forgot you also wrote A shot at the diamond-alignment problem, which is basically this. Though it only assumes simple training techniques (no advanced interpretability) to solve a simpler problem.
One small thing: When you first use the word "power", I thought you were talking about energy use rather than computational power. Although you clarify in "A closer look at the NN anchor", I would get the wrong impression if I just read the hypotheses:
... TAI will run on an amount of power comparable to the human brain ...
... neural network which would use that much power ...
Maybe change "power" to "computational power" there? I expect biological systems to be much more strongly selected to minimize energy use than TAI systems would be, but the same is not true for computational power.
Seems great! I'm excited about potential interpretability methods for detecting deception.
I think you're right about the current trade-offs on the gain of function stuff, but it's good to think ahead and have precommitments for the conditions under which your strategies there should change.
It may be hard to find evals for deception which are sufficiently convincing when they trigger, yet still give us enough time to react afterwards. A few more similar points here: https://www.lesswrong.com/posts/pckLdSgYWJ38NBFf8/?commentId=8qSAaFJXcmNhtC8am
Building good tools for detecting deceptive alignment seems robustly good though, even after you reach a point where you have to drop the gain of function stuff.