In my view, the key purpose of interpretability is to translate model behavior to a representation that is readily understood by humans. This representation may include first-order information (e.g., feature attribution techniques that are common now), but should also include higher-order side-effects induced by the model as it is deployed in an environment. This second-order information will be critical for thinking about un-intended emergent properties that may arise, as well as bound their likelihood under formal guarantees.
In my view, the key purpose of interpretability is to translate model behavior to a representation that is readily understood by humans. This representation may include first-order information (e.g., feature attribution techniques that are common now), but should also include higher-order side-effects induced by the model as it is deployed in an environment. This second-order information will be critical for thinking about un-intended emergent properties that may arise, as well as bound their likelihood under formal guarantees.
If you view alignment a... (read more)