Representational Tethers: Tying AI Latents To Human Ones
This post is part of my hypothesis subspace sequence, a living collection of proposals I'm exploring at Refine. Preceded by ideological inference engines, and followed by an interlude. Thanks Adam Shimi, Alexander Oldenziel, Tamsin Leake, and Ze Shen for useful feedback. TL;DR: Representational tethers describe ways of connecting internal representations employed by ML models to internal representations employed by humans. This tethering has two related short-term goals: (1) making artificial conceptual frameworks more compatible to human ones (i.e. the tension in the tether metaphor), and (2) facilitating direct translation between representations expressed in the two (i.e. the physical link in the tether metaphor). In the long-term, those two mutually-reinforcing goals (1) facilitate human oversight by rendering ML models more cognitively ergonomic, and (2) enable control over how exotic internal representations employed by ML models are allowed to be. Intro The previous two proposals in the sequence describe means of deriving human preferences procedurally. Oversight leagues focus on the adversarial agent-evaluator dynamics as the process driving towards the target. Ideological inference engines focus on the inference algorithm as the meat of the target-approaching procedure. A shortcoming of this procedural family is that even if you thankfully don't have to plug in the final goal beforehand (i.e. the resulting evaluator or knowledge base), you still have to plug in the right procedure for getting there. You're forced to put your faith in a self-contained preference-deriving procedure instead of an initial target. In contrast, the present proposal tackles the problem from a different angle. It describes a way of actively conditioning the conceptual framework employed by the ML model to be compatible with human ones, as an attempt to get the ML model to form accurate conceptions of human values. If this sounds loosely relates to half a dozen other proposals,