My two biggest objections to that kind of plan:
1) It feels like passing the buck, which is a known antipattern in thinking about AI.
2) With a "soft" self-improving entity, like a team of people and AIs, most invariants you can define will also be "soft" and prone to drift over many iterations.
That's why I'd prefer a more object-level solution to alignment, if we can have it. But maybe we can't have it.
1) It feels like passing the buck, which is a known antipattern in thinking about AI.
Not sure what you mean by this or by "more object-level solution to alignment". Please explain more?
2) With a “soft” self-improving entity, like a team of people and AIs, most invariants you can define will also be “soft” and prone to drift over many iterations.
Yeah I agree with this part. I think defining an invariant that is both "good enough" and achievable/provable will be very hard or maybe just impossible.
Not sure what you mean by this or by "more object-level solution to alignment". Please explain more?
The proposed setup can be seen as a self-improving AI, but a pretty opaque one. To explain why it makes a particular decision, we must appeal to anthropomorphism, like "our team of researchers wouldn't do such a stupid thing". That seems prone to wishful thinking. I would prefer to launch an AI for which at least some decisions have non-anthropomorphic explanations.
How is this stating anything more than "the whole is safe if all the parts are safe"? Like saying a mathematical proof is valid if all the steps are valid, this is almost useless if you don't know which individual steps are valid or safe.
The idea of "mathematical proof" is useful if someone has never thought of the concept before. For more specifics you need to look at individual proposed proofs. Similarly, people have proposed specific approaches for how to develop a safe AI, which we can look at if we want to know "which individual steps are valid or safe" but having a more general concept seems useful if you hadn't thought of that before. (I did state that this "may be trivial or obvious for a lot of people", and also talked about what I personally got out of thinking this way in the paragraph just below the box.)
Maybe one of the problem of the idea of the "alignment" is that is named as a noun and thus we describe it as a thing which could actually exist, while, in fact, it is only a high-level description of some form of hypothetical relation of two complex systems. In that case, it is not a "liquid" and can't be "distilled". I will illustrate this consideration by the following example:
Imagine that I can safely drive a bike at the speed of 20 km/h and after some training I could extend my safe speed on 1 km/h, so it is reasonable to conclude that I could distill "safe driving" to 21 km/h. Repeating this process, I could reach higher and higher speed of biking. However, it is also obvious that I will have a fatal crash somewhere between 100 and 200 km/h. The reason for it is that on the higher speeds the probability of accidents is exponentially growing. The "accidents" are the real thing, but not "safety" which is only a high-level description of driving habits.
Conclusion: Accidents can be avoided by not riding a bike or limiting bike's speed, but safety can't be unlimitedly stretch. Thus AI development should not be "safety" or "alignment" oriented, but disaster avoidance oriented.
It's a nice property of this model that it prompts consideration of the interaction between humans and AIs at every step (to highlight things like risks of the humans having access to some set of AI systems for manipulation or moral hazard reasons).
This may be trivial or obvious for a lot of people, but it doesn't seem like anyone has bothered to write it down (or I haven't looked hard enough). It started out as a generalization of Paul Christiano's IDA, but also covers things like safe recursive self-improvement.
The reason I started thinking in this direction is that Paul's approach seemed very hard to knock down, because any time a flaw or difficulty is pointed out or someone expresses skepticism on some technique that it uses or the overall safety invariant, there's always a list of other techniques or invariants that could be substituted in for that part (sometimes in my own brain as I tried to criticize some part of it). Eventually I realized this shouldn't be surprising because IDA is an instance of this more general model of safety-oriented AI development, so there are bound to be many points near it in the space of possible safety-oriented AI development practices. (Again, this may already be obvious to others including Paul, and in their minds IDA is perhaps already a cluster of possible development practices consisting of the most promising safety techniques and invariants, rather than a single point.)
If this model turns out not to have been written down before, perhaps it should be assigned a name, like Iterated Safety-Invariant AI-Assisted AI Development, or something pithier?