x

Luke

Subscribe

Message

1

5y

Luke

Subscribe

Message

1

5y

Luke

Luke has not written any posts yet.

Replying toA Longlist of Theories of Impact for Interpretability

Luke4y

A Longlist of Theories of Impact for Interpretability

In my view, the key purpose of interpretability is to translate model behavior to a representation that is readily understood by humans. This representation may include first-order information (e.g., feature attribution techniques that are common now), but should also include higher-order side-effects induced by the model as it is deployed in an environment. This second-order information will be critical for thinking about un-intended emergent properties that may arise, as well as bound their likelihood under formal guarantees.

If you view alignment as a controls problem (e.g., [1]), interpretability is giving us a mechanism for assessing (and forecasting) measured output of a system. This step is necessary for taking appropriate corrective action that reduces measured error. In this sense, interpretability is in some sense the inverse of the alignment problem. This notion of interpretability captures many of the points mentioned in the list above, especially #1, #2, #3, #7, #8, and #9.

[1] https://en.wikipedia.org/wiki/Control_theory#/media/File:Feedback_loop_with_descriptions.svg

1

0

LESSWRONG
LW

LESSWRONG
LW

Luke

Luke

Luke

Luke

Luke

Luke