Epistemic Status: Pretty confident in the central conclusions, and very confident in the supporting claims from meta-logic. Any low confidence conclusions are presented as such. NB: I give an intentionally revisionary reading of what ELK is (or should be) about. Accordingly, I assume familiarity with the ELK report. Summary here.
Executive Summary
Eliciting Latent Knowledge (ELK) collapses into either the automation of science or the automation of mechanistic interpretability. I promote the latter.
Abstract
After reframing ELK from the perspective of a logician, I highlight the problem of cheap model-theoretic truth: by default reporters will simply learn (or search for) interpretations of the predictor’s net that make the teacher’s answers “true” in the model-theoretic sense, whether or not they are True (correspond with reality)! This will be a problem, even if we manage to avoid human simulators and are guaranteed an honest translator.
The problem boils down to finding a way to force the base optimizer (e.g. gradient descent) to pay attention to the structure of the predictor’s net, instead of simply treating it like putty. I argue that trying to get the base optimizer to care about the True state of affairs in the vault is not a solution to this problem, but instead the expression of a completely different problem – something like automating science. Arguably, this is not the problem we should be focused on, especially if we’re just trying to solve intent alignment. Instead I tentatively propose the following solution: train the reporter on mechanistic interpretability experts, in the hope that it internalizes and generalizes their techniques. I expand this proposal by suggesting we interpret in parallel with training, availing ourselves of the history of a predictor’s net in order to identify and track the birth of each term in its ontology. The over-arching hope here is that if we manage to fully interpret the predictor at an earlier stage in its developm