The setup here implies a empirical (but conceptually tricky) research direction: try to take two different AIs trained to both do the same prediction task (e.g. predict next tokens of webtext) and try to correspond their internal structure in some way.
It's a bit unclear to me what the desiderata for this research should be. I think we ideally want something like a "mechanistic correspondence", something like a heuristic argument that the two models produce the same output distribution when given the same input.
Back when Redwood was working on model internals and interp, we were somewhat excited about trying to do something along these lines. Probably something trying to use automated methods to do a correspondence that seems accurate based on causal scrubbing.
(I haven't engaged much with this post overall, I just thought this connection might be interesting.)
(I might expand on this comment later but for now) I'll point out that there are some pretty large literatures out there which seem at least somewhat relevant to these questions, including on causal models, identifiability and contrastive learning, and on neuroAI - for some references and thoughts see e.g.:
https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default?commentId=8CngPZyjr5XydW4sC
And for some very recent potentially relevant work, using SAEs:
Towards Universality: Studying Mechanistic Similarity Across Language Model Architectures
Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models
I keep getting stuck on:
Suppose two Bayesian agents are presented with the same spreadsheet - IID samples of data in each row, a feature in each column. Each agent develops a generative model of the data distribution.
It is exceedingly rare that two Bayesian agents are presented with the same data. The more interesting case is when they are presented with different data, or perhaps with partially-overlapping data. Like let's say you've got three spreadsheets, A, B, and AB, and spreadsheets A and AB are concatenated and given to agent X, while spreadsheets B and AB are concatenated and given to agent Y. Obviously agent Y can infer whatever information about A that is present in AB, so the big question is how can X communicate unique information of A to Y, when Y hasn't even allocated the relevant latents to make use of that information yet, and X doesn't know what Y has learned from B and thus what is or isn't redundant?
Haven't fully read the post, but I feel like that could be relaxed. Part of my intuition is that Aumann's theorem can be relaxed to the case where the agents start with different priors, and the conclusion is that their posteriors differ by no more than their priors.
The issue with Aumann's theorem is that if the agents have different data then they might have different structure for the latents they use and so they might lack the language to communicate the value of a particular latent.
Like let's say you want to explain John Wentworth's "Minimal Motivation of Natural Latents" post to a cat. You could show the cat the post, but even if it trusted you that the post was important, it doesn't know how to read or even that reading is a thing you could do with it. It also doesn't know anything about neural networks, superintelligences, or interpretability/alignment. This would make it hard to make the cat pay attention in any way that differs from any other internet post.
Plausibly a cat lacks the learning ability to ever understand this post (though I don't think anyone has seriously tried?), but even if you were trying to introduce a human to it, unless that human has a lot of relevant background knowledge, they're just not going to get it, even when shown the entire text, and it's going to be hard to explain the gist without a significant back-and-forth to establish the relevant concepts.
Sadly, the difference in their priors could still make a big difference for the natural latents, due to the tiny mixtures problem.
Currently our best way to handle this is to assume a universal prior. That still allows for a wide variety of different priors (i.e. any Turing machine), but the Solomonoff version of natural latents doesn't have the tiny mixtures problem. For Solomonoff natural latents, we do have the sort of result you're intuiting, where the divergence (in bits) between the two agents' priors just gets added to the error term on all the approximations.
Yup, totally agree with this. This particular model/result is definitely toy/oversimplified in that way.
Generally the purpose of a simplified model is to highlight:
If the question we want to consider is just "why do there seem to be interpretable features across agents from humans to neural networks to bacteria", then I think your model is doing fine at highlighting the essence and constraints.
However, if the question we want to consider is more normative about what methods we can build to develop higher interpretations of agents, or to understand which important things might be missed, then I think your model fails to highlight both the essence and the key constraints.
Yeah, I generally try to avoid normative questions. People tend to go funny in the head when they focus on what should be, rather than what is or what can be.
But there are positive versions of the questions you're asking which I think maintain the core pieces:
By focusing on what is, you get a lot of convex losses on your theories that makes it very easy to converge. This is what prevents people from going funny in the head with that focus.
But the value of what is is long-tailed, so the vast majority of those constraints come from worthless instances of the things in the domain you are considering, and the niches that allow things to grow big are competitive and therefore heterogenous, so this vast majority of constraints don't help you build the sorts of instances that are valuable. In fact, they might prevent it, if adaptation to a niche leads to breaking some of the constraints in some way.
One attractive compromise is to focus on the best of what there is.
My current model is real-world Bayesian agents end up with a fractal of latents to address the complexity of the world, and for communication/interpretability/alignment, you want to ignore the vast majority of these latents. Furthermore, most agents are gonna be like species of microbes that are so obscure and weak that we just want to entirely ignore them.
Natural latents don't seem to zoom in to what really matters, and instead would risk getting distracted by stuff like the above.
Good summary. I got that from your comment on our previous post, but it was less clear.
The main natural-latents-flavored answer to this would be: different latents are natural over different chunks of the world, and in particular some latents are natural over much bigger (in the volume-of-spacetime sense) parts of the world. So, for instance, the latent summarizing the common features of cats is distributed over all the world's cats, of which there are many in many places on the scale of Earth's surface. On the other hand, the latent summarizing the specifics of one particular cat's genome is distributed over all the cells of that particular cat, but that means it's relevant to a much smaller chunk of spacetime than the common-features-of-cats latent. And since one-cat's-genome latent is relevant to a much smaller chunk of spacetime, it's much less likely to be relevant to any particular agent or decision, unless the agent has strong information that it's going to be nearby that particular cat a lot.
So there's a general background prior that latents distributed over more spacetime are more likely to be relevant, and that general background prior can also be overridden by more agent-specific information, like e.g. nearby-ness or repeated encounters or whatever.
Or perhaps a better/more-human phrasing than my mouse comment is, the attributes that are in common between cats across the world are not the attributes that matter the most for cats. Cats are relatively bounded, so perhaps mostly their aggregate ecological impact is what matters.
Cats seem relatively epiphenomenal unless you're like, a mouse. So let's say you are a mouse. You need to avoid cats and find cheese without getting distracted by dust. In particular, you need to avoid the cat every time, not just on your 5th time.
Suppose two Bayesian agents are presented with the same spreadsheet - IID samples of data in each row, a feature in each column. Each agent develops a generative model of the data distribution. We'll assume the two converge to the same predictive distribution, but may have different generative models containing different latent variables. We'll also assume that the two agents develop their models independently, i.e. their models and latents don't have anything to do with each other informationally except via the data. Under what conditions can a latent variable in one agent's model be faithfully expressed in terms of the other agent's latents?
Let’s put some math on that question.
The n “features” in the data are random variables X1,…,Xn. By assumption the two agents converge to the same predictive distribution (i.e. distribution of a data point), which we’ll call P[X1,…,Xn]. Agent j’s generative model Mj must account for all the interactions between the features, i.e. the features must be independent given the latent variables Λj in model Mj. So, bundling all the latents together into one, we get the high-level graphical structure:
which says that all features are independent given the latents, under each agent’s model.
Now for the question: under what conditions on agent 1’s latent(s) Λ1 can we guarantee that Λ1 is expressible in terms of Λ2, no matter what generative model agent 2 uses (so long as the agents agree on the predictive distribution P[X])? In particular, let’s require that Λ1 be a function of Λ2. (Note that we’ll weaken this later.) So, when is Λ1 guaranteed to be a function of Λ2, for any generative model M2 which agrees on the predictive distribution P[X]? Or, worded in terms of latents: when is Λ1 guaranteed to be a function of Λ2, for any latent(s) Λ2 which account for all interactions between features in the predictive distribution P[X]?
The Main Argument
Λ1 must be a function of Λ2 for any generative model M2 which agrees on the predictive distribution. So, here’s one graphical structure for a simple model M2 which agrees on the predictive distribution:
In English: we take Λ2 to be X¯i, i.e. all but the ith feature. Since the features are always independent given all but one of them (because any random variables are independent given all but one of them), X¯i is a valid choice of latent Λ2. And since Λ1 must be a function of Λ2 for any valid choice of Λ2, we conclude that Λ1 must be a function of X¯i. Graphically, this implies
By repeating the argument, we conclude that the same must apply for all i:
Now we’ve shown that, in order to guarantee that Λ1 is a function of Λ2 for any valid choice of Λ2, and for Λ1 to account for all interactions between the features in the first place, Λ1 must satisfy at least the conditions:
… which are exactly the (weak) natural latent conditions, i.e. Λ1 mediates between all Xi’s and all information about Λ1 is redundantly represented across the Xi’s. From the standard Fundamental Theorem of Natural Latents, we also know that the natural latent conditions are almost sufficient[1]: they don’t quite guarantee that Λ1 is a function of Λ2, but they guarantee that Λ1 is a stochastic function of Λ2, i.e. Λ1 can be computed from Λ2 plus some noise which is independent of everything else (and in particular the noise is independent of X).
… so if we go back up top and allow for Λ1 to be a stochastic function of Λ2, rather than just a function, then the natural latent conditions provide necessary and sufficient conditions for the guarantee which we want.
Approximation
Since we’re basically just invoking the Fundamental Theorem of Natural Latents, we might as well check how the argument behaves under approximation.
The standard approximation results allow us to relax both the mediation and redundancy conditions. So, we can weaken the requirement that the latents exactly mediate between features under each model to allow for approximate mediation, and we can weaken the requirement that information about Λ1 be exactly redundantly represented to allow for approximately redundant representation. In both cases, we use the KL-divergences associated with the relevant graphs in the previous sections to quantify the approximation. The standard results then say that Λ1 is approximately a stochastic function of Λ2, i.e. Λ2 contains all the information about X relevant to Λ1 to within the approximation bound (measured in bits).
The main remaining loophole is the tiny mixtures problem: arguably-small differences in the two agents’ predictive distributions can sometimes allow large failures in the theorems. On the other hand, our two hypothetical agents could in-principle resolve such differences via experiment, since they involve different predictions.
Why Is This Interesting?
This argument was one of our earliest motivators for natural latents. It’s still the main argument we have which singles out natural latents in particular - i.e. the conclusion says that the natural latent conditions are not only sufficient for the property we want, but necessary. Natural latents are the only way to achieve the guarantee we want, that our latent can be expressed in terms of any other latents which explain all interactions between features in the predictive distribution.
Note that, in invoking the Fundamental Theorem, we also implicitly put weight on the assumption that the two agents' latents have nothing to do with each other except via the data. That particular assumption can be circumvented or replaced in multiple ways - e.g. we could instead construct a new latent via resampling, or we could add an assumption that either Λ1 or Λ2 has low entropy given X.