This post was written under Evan Hubinger's direct guidance and mentorship, as a part of the Stanford Existential Risks Institute ML Alignment Theory Scholars (MATS) program.
TL;DR: If we want to understand how AI models make decisions, and thus assess alignment, we generally want to extract their internal latent information. That information might be inaccessible to us by default and defy easy extraction. Resolving this problem might be critical to AI alignment.
Paul Christiano's conception of 'inaccessible information' seems to be an important and useful lens for interpreting many key concepts in AI alignment. In this post, I will attempt to explain (in an accessible way):
Roughly, inaccessible information is anything a model knows that an external agent cannot reliably learn. For instance, in the process of training, a sufficiently large model might acquire a complicated internal structure that is opaque to direct inspection. If we want to understand the process by which this 'black box' turns inputs into outputs, perhaps to learn new theories/heuristics, check for undesirable future behaviour or inspect the model's limitations (e.g. myopia), it seems natural to also train the model to produce explanations for its outputs. However, the model might learn to produce convincing explanations that are not necessarily true representations of its input-output process. If true and false but convincing explanations are indistinguishable to the training mechanism (e.g. a human with a reward button), the model may converge to the instrumental policy: 'give whatever explanation is most convincing'. This scenario is clearly counter to the desired outcome, as the model does not learn to give strictly true explanations and therefore its outputs are contingent on inaccessible information.
Generally, we might be unable to distinguish true and false but convincing explanations of a model's output if:
A related property to 'explainability' is 'transparency', also called 'interpretability'. A model is transparent to an external agent if inspection of the model is sufficient to understand how it operates. From the perspective of such an external agent, more transparent models generally contain less inaccessible information. Model transparency can be increased by:
The training-induced structure of a machine learning model can be thought of as containing 'latent information' that shapes the model's output. Models that perform well on complicated tasks in diverse environments (e.g. 'What will Alice say next?') might acquire latent information that aids predictions (e.g. a model of Alice's thoughts). Larger models, or models that include highly compressed representations of the environment (i.e. big world, small map), are generally less accessible to direct inspection. A model's latent information might be difficult to render accessible by training it to give explanations or be otherwise transparent if, for example, there is:
In some situations, training might bias AI models towards dishonesty. An otherwise non-deceptive model trained to produce explanations for its outputs might converge to the instrumental policy if the model is powerful enough to produce false but convincing explanations and:
The instrumental policy is short as it simply asks the model, 'What explanation is most convincing?' and leverages the already established power of the model to produce accessible output. A policy to produce true explanations likely requires a further 'translation mechanism' to explain latent information in an accessible way and thus has longer description length.
True explanations might be less convincing than false explanations sometimes if the latter seem more plausible due to incomplete data, or human biases and mistakes. For example, given the model prediction, 'Bob will refuse icecream,' the false explanation, 'Bob refuses icecream on cold days and today is cold,' might be a more convincing narrative about how the model operates than the true explanation, 'Bob refuses to eat icecream every third day and today is a refusal day.' In the latter case, the truth 'looks wrong'.
Christiano has proposed that the instrumental policy might be avoided with the following training process:
Change the learning process so that it does not have a strong inductive bias towards "predict human answers," by allowing the complexity of the honest question-answering to "pay for itself" by constraining the space of possible human-models.
Introduce a bias towards the intended model by using a more complex labeling process to answer questions where a human answers incorrectly.
Be really careful to avoid penalizing honest answers, by only judging comparisons between two answers where we are confident one is better than the other and getting the model to help us.
If an AI model acquires the capabilities of an 'optimiser' and learns to be deceptive, it might try to circumvent attempts to access its latent information. Deceptive alignment of a model occurs when the training objective is at odds with the model's internally optimised 'mesa-objective', causing the model to act deceptively in order to influence the training process and avoid modification. A deceptively aligned model may be instrumentally motivated to obfuscate its internal mechanism and give false but convincing explanations for some outputs when elicited, including its deception status. If such a model can trick or avoid our tests for deception, the model's alignment status and mesa-objective are inaccessible information.
More generally, a model's alignment status might be inaccessible independent of deception if we don't possess tests that can assess mesa-objective inner alignment or training mechanisms that can guarantee a model avoids acquiring pseudo-alignment or even optimisation. Determining whether a model is performing optimisation might be inaccessible, particularly if the model is already deceptive and powerful.
Even if we can reliably circumvent the failure modes associated with model honesty, we might fail to elicit true representations of latent information if an AI's potentially alien world model is not appropriately translated into human terms. The simplest method for an AI to answer a question (e.g. 'Why is Alice coughing?') could involve generating predictive observations and directly mapping these to states in the human world model. However, some potential states (e.g. 'Sickness caused by exotic nanobots') might not be entirely representable in the human world model (which excludes 'exotic nanobots'). In this case, the explanation might default to, 'What would a human say given this observation?' (e.g. 'Alice is sick') rather than, 'What would a human say if they had the AI's world model and observations?' (e.g. 'Nanobots!').
Ideally, an AI will generate observations, infer states in its world model, crucially map these to states in the human world model, and then produce an accessible and true answer (e.g. 'Alice is sick because of [uninterpretable N], which will confound conventional diagnosis and might signify a technological catastrophe!'). It is possible that the mapping between AI and human world models is lossy and some aspect of the AI's decision making process is inaccessible (i.e. the [uninterpretable N]). Christiano argues that such lossy mapping may not be a further problem for alignment, provided that the AI is competitive and consequences are accessible (e.g. even if exotic nanobot capabilities are not accessible, their impacts on humans are). Thus, an impact aligned AI could forestall exotic failures with exotic capabilities while merely avoiding accessible consequences.
It seems to me that AI alignment efforts in general should not ignore the problem of inaccessible information. Developing powerful AI capabilities may require instilling in our models through training large amounts of decision-relevant latent information that is by default inaccessible. Verifying or incentivising acceptable model behaviour likely requires accessing latent information or preventing the occurrence of certain inaccessible properties, such as deception. Focusing on resolving inaccessible information may yield useful tools and techniques for producing robustly capable and aligned models.
I think that powerful AI systems are likely to be inaccessible by default because:
I think that proposed methods for building safe, advanced prosaic AI likely require resolving instances of inaccessible information because:
Generally solving the problem of inaccessible information seems sufficient to solve a significant part of prosaic AI alignment, provided we can act on this information. If we can elicit true explanations for model output during training and inspect relevant model properties with transparency tools, we can hopefully steer the training process away from deceptive alignment and even catch deception. If our aligned AI model is competitive against unaligned AIs, the strategy stealing assumption implies that we can use it to capture flexible influence over the future, even if it has to develop exotic capabilities.
It seems debatable whether generally solving inaccessible information is necessary for prosaic AI alignment if there exists:
However, it seems possible to me that the above safety predicates might in general be 'probabilistic' guarantees against unacceptability. For instance, our best transparency tools and training mechanisms might only be able to guarantee 99% probability that a model possesses a checkable property that almost certainly implies acceptability. As deploying unacceptable AI is a 'high risk' scenario, we would ideally like to have the strongest possible guarantee of acceptability, something like Christiano's ascription universality: if an overseer understands everything that a model might reasonably know, the overseer is universal with respect to the model and all the model's latent information is accessible. Practically, universality is bounded by the overseer's epistemic model. If no further information could be reasonably found such that a human judge would trust the model over the overseer, the overseer 'epistemically dominates' the model. As of yet, there is no formal definition of ascription universality and it is unknown if universal overseers are possible in practise.
MIRI's recently announced Visible Thoughts Project aims to build a dataset of human explanations for AI dungeon master prompts and train a dungeon master AI to give accessible explanations for its output. Unless steps are taken to address the failure modes detailed in the previous section, it seems unlikely to me that the trained explanations will necessarily converge to true explanations. In particular, a near-term dungeon master AI might converge to the instrumental policy and treat 'humans who read dungeon master explanations' to a convincing fiction in which they roleplay as 'AI inspectors'.
Given the enormous stakes of AI alignment and the apparent ubiquity of inaccessible information in prosaic AI, striving for something close to ascription universality seems important. Improving the self-explaining and transparency of models, and the inspection power of trusted overseers is critical to universality and will likely benefit acceptability verification even if universality is impractical. I think the feasibility of addressing inaccessible information is a crux as to whether aligning prosaic AI is practical. Therefore, attempting to address inaccessible information will both aid prosaic alignment efforts and offer a possible test to rule out safe prosaic AI.
If the information won't fit into human ways of understanding the world, then we can't name what it is we're missing. This always makes examples of "inaccessible information" feel weird to me - like we've cheated by even naming the thing we want as if it's somewhere in the computer, and instead our first step should be to design a system that at all represents the thing we want.
I think world model mismatches are possibly unavoidable with prosaic AGI, which might reasonably bias one against this AGI pathway. It seems possible that much of human and AGI world models would be similar by default if 'tasks humans are optimised for' is a similar set to 'tasks AGI is optimised for' and compute is not a performance-limiting factor, but I'm not at all confident that this is likely (e.g. maybe an AGI draws coarser- or finer-grained symbolic Markov blankets). Even if we build systems that represent the things we want and the things we do to get them as distinct symbolic entities in the same way humans do, they might fail to be competitive with systems that build their world models in an alien way (e.g. draw Markov blankets around symbolic entities that humans cannot factor into their world model due to processing or domain-specific constraints).
Depending on how one thinks AGI development will happen (e.g. is the strategy stealing assumption important) resolving world model mismatches seems more or less a priority for alignment. If near-term performance competitiveness heavily influences deployment, I think it's reasonably likely that prosaic AGI is prioritised and world model mismatches occur by default because, for example, compute is likely a performance-limiting factor for humans on tasks we optimise AGI for, or the symbolic entities humans use are otherwise nonuniversal. I think AGI might generally require incorporating alien features into world models to be maximally competitive, but I'm very new to this field.
Yup, this all sounds good to me. I think the trick is not to avoid alien concepts, but to make your alignment scheme also learn ways of representing the world that are close enough to how we want to the world to be modeled.
I think I agree. To the extent that a 'world model' is an appropriate abstraction, I think the levers to pull for resolving world model mismatches seem to be:
I think you are advocating for the latter, or have I misrepresented the levers?
Maybe I don't see a bright line between these things. Adding an "explaining module" to an existing AI and then doing more training is not so different from designing an AI that has an "explaining module" from the start. And training an AI with an "explaining module" isn't so different from training an AI with a "making sure internal states are somewhat interpretable" module.
I'm probably advocating something close to "Ex-ante," but with lots of learning, including learning that informs the AI what features of the world we want it to make interpretable to us.
Things I'm confused about:
How can the mechanism by which the model outputs ‘true’ representations of its processing be verified?
Re ‘translation mechanism’: How could a model use language to describe its processing if it includes novel concepts, mechanisms, or objects for which there are no existing examples in human-written text? Can a model fully know what it is doing?
Supposing an AI was capable of at least explaining around or gesturing towards this processing in a meaningful way - would humans be able to interpret these explanations sufficiently such that the descriptions are useful?
Could a model checked for myopia be deceptive in its presentation of being myopic? How do you actually test this?
Things I'm sort of speculating about:
Could you train a model so that it must provide a textual explanation for a policy that is validated before it proceeds with actually updating its policy so essentially every part of its policy is pre-screened and it can only learns to take interpretable actions? Writing this out though I see how it devolves into providing explanations that sounds great but don’t represent what is actually happening.
Could it be fruitful to focus some effort on building quality human-centric-ish world models (maybe even just starting with something like topography) into AI models to improve interpretability (i.e. provide a base we are at least familiar with off of which they can build in some way as opposed to having no insight into their world representation)?