A general model of safety-oriented AI development

My two biggest objections to that kind of plan:

1) It feels like passing the buck, which is a known antipattern in thinking about AI.

2) With a "soft" self-improving entity, like a team of people and AIs, most invariants you can define will also be "soft" and prone to drift over many iterations.

That's why I'd prefer a more object-level solution to alignment, if we can have it. But maybe we can't have it.

[-]Wei Dai7yΩ5140

1) It feels like passing the buck, which is a known antipattern in thinking about AI.

Not sure what you mean by this or by "more object-level solution to alignment". Please explain more?

2) With a “soft” self-improving entity, like a team of people and AIs, most invariants you can define will also be “soft” and prone to drift over many iterations.

Yeah I agree with this part. I think defining an invariant that is both "good enough" and achievable/provable will be very hard or maybe just impossible.

[-]cousin_it7yΩ3110

Not sure what you mean by this or by "more object-level solution to alignment". Please explain more?

The proposed setup can be seen as a self-improving AI, but a pretty opaque one. To explain why it makes a particular decision, we must appeal to anthropomorphism, like "our team of researchers wouldn't do such a stupid thing". That seems prone to wishful thinking. I would prefer to launch an AI for which at least some decisions have non-anthropomorphic explanations.

[-]Donald Hobson7y90

How is this stating anything more than "the whole is safe if all the parts are safe"? Like saying a mathematical proof is valid if all the steps are valid, this is almost useless if you don't know which individual steps are valid or safe.

[-]Wei Dai7y150

The idea of "mathematical proof" is useful if someone has never thought of the concept before. For more specifics you need to look at individual proposed proofs. Similarly, people have proposed specific approaches for how to develop a safe AI, which we can look at if we want to know "which individual steps are valid or safe" but having a more general concept seems useful if you hadn't thought of that before. (I did state that this "may be trivial or obvious for a lot of people", and also talked about what I personally got out of thinking this way in the paragraph just below the box.)

[-]William_S7yΩ470

Shorter name candidates:

Inductively Aligned AI Development

Inductively Aligned AI Assistants

[-]avturchin7y50

Maybe one of the problem of the idea of the "alignment" is that is named as a noun and thus we describe it as a thing which could actually exist, while, in fact, it is only a high-level description of some form of hypothetical relation of two complex systems. In that case, it is not a "liquid" and can't be "distilled". I will illustrate this consideration by the following example:

Imagine that I can safely drive a bike at the speed of 20 km/h and after some training I could extend my safe speed on 1 km/h, so it is reasonable to conclude that I could distill "safe driving" to 21 km/h. Repeating this process, I could reach higher and higher speed of biking. However, it is also obvious that I will have a fatal crash somewhere between 100 and 200 km/h. The reason for it is that on the higher speeds the probability of accidents is exponentially growing. The "accidents" are the real thing, but not "safety" which is only a high-level description of driving habits.

Conclusion: Accidents can be avoided by not riding a bike or limiting bike's speed, but safety can't be unlimitedly stretch. Thus AI development should not be "safety" or "alignment" oriented, but disaster avoidance oriented.

[-]William_S7yΩ350

It's a nice property of this model that it prompts consideration of the interaction between humans and AIs at every step (to highlight things like risks of the humans having access to some set of AI systems for manipulation or moral hazard reasons).

[+][comment deleted]7y-210

LESSWRONG
LW

LESSWRONG
LW

68

A general model of safety-oriented AI development

68

Ω 20

68

Ω 20