Things I'm confused about:
How can the mechanism by which the model outputs ‘true’ representations of its processing be verified?
Re ‘translation mechanism’: How could a model use language to describe its processing if it includes novel concepts, mechanisms, or objects for which there are no existing examples in human-written text? Can a model fully know what it is doing?
Supposing an AI was capable of at least explaining around or gesturing towards this processing in a meaningful way - would humans be able to interpret these explanations sufficiently such that the descriptions are useful?
Could a model checked for myopia be deceptive in its presentation of being myopic? How do you actually test this?
Things I'm sort of speculating about:
Could you train a model so that it must provide a textual explanation for a policy that is validated before it proceeds with actually updating its policy so essentially every part of its policy is pre-screened and it can only learns to take interpretable actions? Writing this out though I see how it devolves into providing explanations that sounds great but don’t represent what is actually happening.
Could it be fruitful to focus some effort on building quality human-centric-ish world models (maybe even just starting with something like topography) into AI models to improve interpretability (i.e. provide a base we are at least familiar with off of which they can build in some way as opposed to having no insight into their world representation)?