I had an idea for short-term, non-superhuman AI safety that I recently wrote up and ̶w̶i̶l̶l̶ ̶b̶e̶ ̶p̶o̶s̶t̶i̶n̶g̶ have now posted on Arxiv. This post serves to introduce the idea, and request feedback from a more safety-oriented group than those that I would otherwise present the ideas to.
In short, the paper tries to adapt a paradigm that Mobileye has presented for autonomous vehicle safety to a much more general setting. The paradigm is to have a "safety envelope" that is dictated by a separate algorithm than the policy algorithm for driving, setting speed- and distance- limits for the vehicle based on the position of vehicles around it.
For self-driving cares, this works well because there is a physics based model of the system that can be used to find an algorithmic envelope. In arbitrary other systems, it works less well, because we don't have good fundamental models for what safe behavior means. For example, in financial markets there are "circuit breakers" that function as an opportunity for the system to take a break when something unexpected happens. The values for the circuit breakers are set via a simple heuristic that doesn't relate to the dynamics of the system in question. I propose taking a middle path - dynamically learning a safety envelope.
In building separate models for safety and for policy, I think the system can address a different problem being discussed in military and other AI contexts, which is that "Human-in-the-Loop" is impossible for normal ML systems, since it slows the reaction time down to the level of human reactions. The proposed paradigm of a safety-envelope learning system can be meaningfully controlled by humans, because the adaptive time needed for the system can be slower than the policy system that makes the lower level decisions.
1) How do we build heuristic safety envelopes in practice?
This depends on the system in question. I would be very interested in identifying domains where this class of solution could be implemented, either in toy models, or in full systems.
2) Why is this better than a system that optimizes for safety?
The issues with balancing optimization for goals versus optimization for safety can lead to perverse effects. If the system optimizing for safety is segregated, and the policy-engine is not given access to it, this should not occur.
This also allows the safety system to be built and monitored by a regulator, instead of by the owners of the system. In the case of Mobileye's proposed system, a self-driving car could have the parameters of the safety envelope dictated by traffic authorities, instead of needing to rely on the car manufacturers to implement systems that drive safely as determined by those manufacturers.
3) Are there any obvious shortcoming to this approach?
Yes. This does not scale to human- or superhuman- general intelligence, because a system aware of the constraints can attempt to design policies for avoiding them. It is primarily intended to serve as a stop-gap measure to marginally improve the safety of near-term Machine Learning systems.
I have been writing a long draft about "active AI boxing" for a while, but this idea is new for me.
The paper is now live on Arxiv: https://arxiv.org/abs/1811.09246