Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

This paper introduces an interpreter for learning algorithms, for the purpose of clarifying what is happening inside the algorithm.

The interpreter, Local Interpretable Model-agnostic Explanations (LIME), gives the human user some idea of the important factors going into the learning algorithm's decision, such as:

We can use the first Oracle design here for similar purposes, though for giving clarity. The "Counterfactually Unread Agent" can already be used to see what values of a specific random variable will maximise a certain utility function. We could also search across a whole slew of random variables, to see which one is most important in maximising the given utility function, giving outputs like this:

New Comment