String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?
My explanation would be: our feeble human minds can’t track too many simultaneous interacting causal dependencies. So if we want to (e.g.) explain intuitively why the freezing point of methanol is -98°C as opposed to -96°C, we know we can’t, and we don’t even try, we just say “sorry, there isn’t any intuitive explanation of that, it’s just what you get experimentally, and oh it’s also what you get in this molecular (MD) dynamics simulation, here’s the code”. We don’t bother to make a technical diagram of why it’s -98 not -96 because it would be a zillion arrows going every which way and no one would understand it, so there’s no point in drawing it in the first place.
The MD code, incidentally, is a different structure with different interacting entities (variables, arrays, etc.), and is the kind of thing we humans can intuitively understand, and (relatedly) it can be represented pretty well as a flow diagram with boxes and arrows. So physical chemistry textbooks will talk about the MD code but NOT talk about the subtle detailed aspects of interacting methanol molecules that distinguish a -98°C freezing point from -96.
What’s “The Plan”?
For several years now, around the end of the year, I (John) write a post on our plan for AI alignment. That plan hasn’t changed too much over the past few years, so both this year’s post and last year’s are written as updates to The Plan - 2023 Version.
I’ll give a very quick outline here of what’s in the 2023 Plan post. If you have questions or want to argue about points, you should probably go to that post to get the full version.
So, how’s progress? What are you up to?
2023 and 2024 were mostly focused on Natural Latents - we’ll talk more shortly about that work and how it fits into the bigger picture. In 2025, we did continue to put out some work on natural latents, but our main focus has shifted.
Natural latents are a major foothold on understanding natural abstraction. One could reasonably argue that they’re the only rigorous foothold on the core problem to date, the first core mathematical piece of the future theory. We’ve used that foothold to pull ourselves up a bit, and can probably pull ourselves up a little further on it, but there’s more still to climb after that.
We need to figure out the next foothold.
That’s our main focus at this point. It’s wide open, very exploratory. We don’t know yet what that next foothold will look like. But we do have some sense of what problems remain, and what bottlenecks the next footholds need to address. That will be the focus of the rest of this post.
What are the next bottlenecks to understanding natural abstraction?
We see two main “prongs” to understanding natural abstraction: the territory-first prong, and the mind-first prong. These two have different bottlenecks, and would likely involve different next footholds. That said, progress on either prong makes the other much easier.
What’s the “territory-first prong”?
One canonical example of natural abstraction comes from the ideal gas (and gasses pretty generally, but ideal gas is the simplest).
We have a bunch of little molecules bouncing around in a box. The motion is chaotic: every time two molecules collide, any uncertainty in their velocity is amplified multiplicatively. So if an observer has any uncertainty in the initial conditions (which even a superintelligence would, for a real physical system), that uncertainty will grow exponentially over time, until all information is wiped out… except for conserved quantities, like e.g. the total energy of the molecules, the number of molecules, or the size of the box. So, after a short time, the best predictions our observer will be able to make about the gas will just be equivalent to using a Maxwell-Boltzmann distribution, conditioning on only the total energy (or equivalently temperature), number of particles, and volume. It doesn’t matter if the observer is a human or a superintelligence or an alien, it doesn’t matter if they have a radically different internal mind-architecture than we do; it is a property of the physical gas that those handful of parameters (energy, particle count, volume) summarize all the information which can actually be used to predict anything at all about the gas’ motion after a relatively-short time passes.
The key point about the gas example is that it doesn’t talk much about any particular mind. It’s a story about how a particular abstraction is natural (e.g. the energy of a gas), and that story mostly talks about properties of the physical system (e.g. chaotic dynamics wiping out all signal except the energy), and mostly does not talk about properties of a particular mind. Thus, “territory-first”.
More generally: the territory-first prong is about looking for properties of (broad classes of) physical systems, which make particular abstractions uniquely well-suited to those systems. Just like (energy, particle count, volume) is an abstraction well-suited to an ideal gas because all other info is quickly wiped out by chaos.
What’s the “mind-first prong”?
Here’s an entirely different way one might try to learn about natural abstraction.
Take a neural net, and go train it on some data from real-world physical systems (e.g. images or video, ideally). Then, do some interpretability to figure out how the net is representing those physical systems internally, what information is being passed around in what format, etc. Repeat for a few different net architectures and datasets, and look for convergence in what stuff the net represents and how.
(Is this just interpretability? Sort of. Interp is a broad label; most things called “interpretability” are not particularly relevant to the mind-first prong of natural abstraction, but progress on the mind-first prong would probably be considered interp research.)
In particular, what we’d really like here is to figure out something about how patterns in the data end up represented inside the net, and then go look in the net to learn about natural abstractions out in the territory. Ideally, we could somehow nail down the “how the natural abstractions get represented in the net” part without knowing everything about what natural abstractions even are (i.e. what even is the thing being represented in the net), so that we could learn about their type signature by looking at nets.
More generally: the mind-first prong is about looking for convergent laws governing how patterns get “burned in” to trained/evolved systems like neural nets, and then using those laws to look inside nets trained on the real world, in order to back out facts about natural abstractions in the real world.
Note that anything one can figure out about real-world natural abstractions via looking inside nets (i.e. the mind-first prong) probably tells us a lot about the abstraction-relevant physical properties of physical systems (i.e. the territory-first prong), and vice versa.
So what has and hasn’t been figured out on the territory prong?
The territory prong has been our main focus for the past few years, and it was the main motivator for natural latents. Some key pieces which have already been nailed down to varying extents:
… but that doesn’t, by itself, give us everything we want to know from the territory prong.
Here are some likely next bottlenecks:
And what has and hasn’t been figured out on the mind prong?
The mind prong is much more wide open at this point; we understand it less than the territory prong.
What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.
We have some foundational pieces:
… but none of that directly hits the core of the problem.
If you want to get a rough sense of what a foothold on the core mind prong problem might look like, try Toward Statistical Mechanics of Interfaces Under Selection Pressure. That piece is not a solid, well-developed result; probably it’s not the right way to come at this. But it does touch on most of the relevant pieces; it gives a rough sense of the type of thing which we’re looking for.
Mostly, this is a wide open area which we’re working on pretty actively.