55

Suppose Alice and Bob are two Bayesian agents in the same environment. They both basically understand how their environment works, so they generally agree on predictions about any specific directly-observable thing in the world - e.g. whenever they try to operationalize a bet, they find that their odds are roughly the same. However, their two world models might have totally different internal structure, different “latent” structures which Alice and Bob model as generating the observable world around them. As a simple toy example: maybe Alice models a bunch of numbers as having been generated by independent rolls of the same biased die, and Bob models the same numbers using some big complicated neural net.

Now suppose Alice goes poking around inside of her world model, and somewhere in there she finds a latent variable  with two properties (the Natural Latent properties):

•  approximately mediates between two different observable parts of the world
•  can be estimated to reasonable precision from either one of the two parts

In the die/net case, the die’s bias () approximately mediates between e.g. the first 100 numbers () and the next 100 numbers (), so the first condition is satisfied. The die’s bias can be estimated to reasonable precision from either the first 100 numbers or the second 100 numbers, so the second condition is also satisfied.

This allows Alice to say some interesting things about the internals of Bob’s model.

First: if there is any latent variable (or set of latent variables, or function of latent variables)  which mediates between  and  in Bob’s model, then Bob’s  encodes Alice’s  (and potentially other stuff too). In the die/net case: during training, the net converges to approximately match whatever predictions Alice makes(by assumption), but the internals are a mess. An interpretability researcher pokes around in there, and finds some activation vectors which approximately mediate between  and . Then Alice knows that those activation vectors must approximately encode the bias . (The activation vectors could also encode additional information, but at a bare minimum they must encode the bias.)

Second: if there is any latent variable (or set of latent variables, or function of latent variables)  which can be estimated to reasonable precision from just , and can also be estimated to reasonable precision from just , then Alice’s  encodes Bob’s  (and potentially other stuff too). Returning to our running example: suppose our interpretability researcher finds that the activations along certain directions can be precisely estimated from just , and the activations along those same directions can be precisely estimated from just . Then Alice knows that the bias  must give approximately-all the information which those activations give. (The bias could contain more information - e.g. maybe the activations in question only encode the rate at which a 1 or 2 is rolled, whereas the bias gives the rate at which each face is rolled.)

Third, putting those two together: if there is any latent variable (or set of latent variables, or function of latent variables)  which approximately mediates between  and  in Bob’s model, and can be estimated to reasonable precision from either one of  or , then Alice’s  and Bob’s  must be approximately isomorphic - i.e. each encodes the other. So if an interpretability researcher finds that activations along some directions both mediate between  and and can be estimated to reasonable precision from either of  or , then those activations are approximately isomorphic to what Alice calls “the bias of the die”.

So What Could We Do With That?

We’ll give a couple relatively-legible examples of the possibilities natural latents potentially unlock, though these definitely aren’t the only applications.

Interpretability

There’s a “conceptually easy” version of interpretability, where you try to reduce a net to some simpler equivalent circuit. That’s pretty common to attempt, but unfortunately it’s not where most of the value is for e.g. AI alignment. The value is in a “conceptually harder” version of interpretability: map some stuff inside the net to some stuff outside the net which the net internals represent. In particular, we’d like to map some net internals to human-interpretable externals, i.e. stuff in the environment which humans represent in their internal world-models.

So: there’s some internal (i.e. latent) structures within the net, and some internal (i.e. latent) structures in a human’s mind, and the hard (and high-value) version of interpretability is about faithfully mapping between those two.

… and in general, that’s a pretty tough problem. Even if humans and nets converge on pretty similar distributions over “observables” (e.g. a foundation model generates text very similar to the distribution of text which humans generate, or an image model has a very similar idea to humans of what real-world images look like) the human and the net can still have wildly different internals. Indeed, their internal ontologies could in-principle be totally different; there might not be any faithful mapping between the two at all… though that’s not what I expect, and even today’s still-fairly-primitive interpretability techniques sure seem to suggest that nets’ internal ontologies are not totally alien.

What the natural latent conditions give us is a tool to bridge the gap between internals of a net’s model and internals of a human’s model. It lets us say anything at all about the latent structure internal to one agent, based on some very simple and general conditions on the internal structure of another agent. And in particular, it lets us establish approximate isomorphism in at least some cases.

Value Learning and The Pointers Problem

The Pointers Problem says: whatever goals/values/utility humans have, the inputs to humans’ goals/values/utility are latent variables in the humans’ world models, as opposed to e.g. low-level states of the entire universe. Or, in plain English: I care about my friends and cool-looking trees and British pantomime theater[1], and people have cared about those sorts of things since long before we knew anything at all about the low-level quantum fields of which all those things consist.

That presents a challenge for any AI system looking to learn and follow human values: since human values are expressed in terms of humans’ own internal high-level concepts, a faithful learning of those values needs to also somehow learn the high-level concepts which humans represent. Such a system needs its own internal representation of the concepts humans would use, and in order to actually pursue humans’ goals/values/utility, the system needs to wire its own goals/values/utility into that same structure - not just e.g. track “quoted human concepts”.

Again, we see the same sort of problem that interpretability faces: a need to faithfully map between the internal latent structures of two minds. And natural latents apply in much the same way: they let us say anything at all about the latent structure internal to one agent, based on some very simple and general conditions on the internal structure of another agent. And in particular, they let us establish isomorphism in at least some cases.

Where We’re Currently Headed With This

The natural latent conditions themselves are still a fairly low-level tool. We currently expect that the sort of concepts humans use day-to-day (like cars and trees, rolling, orange-ness, and generally most of the things we have words for) can generally be grounded in terms of natural latents, but they’re relatively more complicated structures which involve a few different latents and different naturality conditions. For instance, there’s a diagram on our whiteboard right now which talks about two sets of latents which are each natural over different variables conditional on the other, in an attempt to establish naturality/isomorphism in a relatively-simple clustering model.

So we view natural latents as a foundational tool. The plan is to construct more expressive structures out of them, rich enough to capture the type signatures of the kinds of concepts humans use day-to-day, and then use the guarantees provided by natural latents to make similar isomorphism claims about those more-complicated structures. That would give a potential foundation for crossing the gap between the internal ontologies of two quite different minds.

1. ^

Not necessarily in that order.

New Comment

One thing I'd note is that AIs can learn from variables that humans can't learn much from, so I think part of what will make this useful for alignment per se is a model of what happens if one mind has learned from a superset of the variables that another mind has learned from.

This model does allow for that. :) We can use this model whenever our two agents agree predictively about some parts of the world X; it's totally fine if our two agents learned their models from different sources and/or make different predictions about other parts of the world.

As long as you only care about the latent variables that make  and  independent of each other, right? Asking because this feels isomorphic to classic issues relating to deception and wireheading unless one treads carefully. Though I'm not quite sure whether you intend for it to be applied in this way,