Status: half-formed thought on a potential piece of an alignment strategy that I've not heard discussed but probably exists somewhere, might just be missing a concept name.
Any alignment scheme that plans to allow potentially unsafe systems to be trained but then tests their safety before deploying runs the risk of being uncompetitive by requiring the system to be trained many times before a safe system is found.
One solution to this problem is to find ways of 'scrambling' an unsafe system to quickly create many variants of the system. It would require that in some of the new systems, the subsystems that have learned to perform to a high level are retained, or remain sufficiently intact to be retrained far more quickly, without retaining the elements that made the original unsafe.
Has there been much work in this area? I'm not really sure where to start looking, but curious to pursue since it suggests a practical research direction that could be worked on with current systems and plausibly assist alignment proposals.
Thoughts on scrambling as a safety component
Retaining some of the previous structures would be likely to result in the same incentive gradients that led to the development of an unsafe system in the first place, reducing the safety of the system.
More worryingly, it could result in being able to quickly train a variety of unsafe systems, thereby acting as an adversarial attack on whatever safety-checker is in place, so you'd be wary of using such a system unless you were very confident in the robustness of the safety system.
On the other hand, we might expect that if the computation can loosely be decomposed into that which allows good performance in training and that which provides potentially unsafe goals then scrambling might prove an easy way to retrain the offending subsection.
The crux of this would potentially be the extent to which the nature of the non-alignment is bound up with the computation which makes it successful. For example, a reasoner which deduces correct answers indirectly as an emergent property of maximising some non-aligned goal would be difficult to decompose in this way, whereas a system which directly solves the given problem, but only under certain conditions which always hold in the training distribution might be decomposed more easily.
If we were picking functions according to NNs with gradient descent then I would expect this to result in more decomposable functions, at least relative to picking function by more abstract criteria like the universal prior, since gradient descent encourages a learning trajectory where the learner is always directly trying to solve the whole problem, even at an immature stage, making a jump from a direct form to indirect form of solution more difficult, since an indirect form of solution requires a more mature system and creating it would thus require a significant transition in the way in the approach during training.