Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)
I think this is probably wrong. Vanilla SLT describes a toy case of how Bayesian learning on neural networks works. I think there is a big difference between Bayesian learning, which requires visiting every single point in the loss landscape and trying them all out on every data point, and local learning algorithms, such as evolution, stochastic gradient descent, AdamW, etc., which try to find a good solution using information from just a small number of local neighbourhoods in the loss landscape. Those local learning algorithms are the ones I'd expect to be used by real minds, because they're much more compute efficient.
I think this locality property matters a lot. It introduces additional, important constraints on what nets can feasibly learn. It's where path dependence in learning comes from. I think vanilla SLT was probably a good tutorial for us before delving into the more realistic and complicated local learning case, but there's still work to do to get us to an actually roughly accurate model of how nets learn things.
If a solution consists of internal pieces of machinery that need to be arranged exactly right to do anything useful at all, a local algorithm will need something like update steps to learn it.[1] In other words, it won't do better than a random walk that aimlessly wanders around the loss landscape until it runs into a point with low loss by sheer chance. But if a solution with internal pieces of machinery can instead be learned in small chunks that each individually decrease the loss a little bit, the leading term in the number of update steps required to find that solution scales exponentially with the size of the single biggest solution chunk, rather than with the size of the whole solution. So, if the biggest chunk had size , the total learning time will be around .[2]
For an example where the solution cannot be learned in chunks like this, see the subset parity learning problem, where SGD really does need a number of update steps exponential in the effective parameter count of the whole solution to learn. Which for most practical purposes means it cannot learn the solution at all.
For a net to learn a big and complicated solution with high Local Learning Coefficient (LLC), it needs a learning story to find the solution's basin in the loss landscape in a feasible timeframe. It can't just rely on random walking, that takes too long. The expected total time it takes the net to get to a basin is, I think, determined mostly by the dimensionality of the mode connections from that basin to the rest of the landscape. Not just by the dimensionality of the basin itself, as would be the case for the sort of global, Bayesian learning modelled by vanilla SLT. The geometry of those connections is the core mathematical object that reflects the structure of the learning process and determines the learnability of a solution.[3] Learning a big solution chunk that increases the total LLC by a lot in one go means needing to find a very low-dimensional mode connection to traverse. This takes a long time, because the connection interface is very small compared to the size of the search space. To learn a smaller chunk that increases the total LLC by less, the net only needs to reach a higher-dimensional mode connection, which will have an exponentially larger interface that is thus exponentially quicker to find.[4]
I agree that vanilla SLT seems like a useful tool for developing the right mental picture of how nets learn things, but it is not itself that picture. The simplified Bayesian learning case is instructive for illuminating the connection between learning and loss landscape geometry in the most basic setting, but taken on its own it's still failing to capture a lot of the structure of learning in real minds.
Where is some constant which probably depends on the details of the update algorithm.
I'm not going to add "I think" and "I suspect" to every sentence in this comment, but you should imagine them being there. I haven't actually worked this out in math properly or tested it.
At least for a specific dataset and architecture. Modelling changes in the geometry of the loss landscape if we allow dataset and architecture to vary based on the mind's own decisions as it learns might be yet another complication we'll need to deal with in the future, once we start thinking about theories of learning for RL agents with enough freedom and intelligence to pick their learning curricula themselves.
To get the rough idea across I'm focusing here on the very basic case where the "chunks" are literal pieces of the final solution and each of them lowers the loss a little and increases the total LLC a little. In general, this doesn't have to be true though. For example, a solution D with effective parameter count 120 might be learned by first learning independent chunks A and B, each with effective parameter count 50, then learning a chunk C with effective parameter count 30 which connects the formerly independent A and B together into a single mechanistic whole to form solution D. The expected number of update steps in this learning story would be .
String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?
My explanation would be: our feeble human minds can’t track too many simultaneous interacting causal dependencies. So if we want to (e.g.) explain intuitively why the freezing point of methanol is -98°C as opposed to -96°C, we know we can’t, and we don’t even try, we just say “sorry, there isn’t any intuitive explanation of that, it’s just what you get experimentally, and oh it’s also what you get in this molecular (MD) dynamics simulation, here’s the code”. We don’t bother to make a technical diagram of why it’s -98 not -96 because it would be a zillion arrows going every which way and no one would understand it, so there’s no point in drawing it in the first place.
The MD code, incidentally, is a different structure with different interacting entities (variables, arrays, etc.), and is the kind of thing we humans can intuitively understand, and (relatedly) it can be represented pretty well as a flow diagram with boxes and arrows. So physical chemistry textbooks will talk about the MD code but NOT talk about the subtle detailed aspects of interacting methanol molecules that distinguish a -98°C freezing point from -96.
Molecular dynamics was also the first counterexample I was thinking of.
So physical chemistry textbooks will talk about the MD code but NOT talk about the subtle detailed aspects of interacting methanol molecules that distinguish a -98°C freezing point from -96.
Using heuristics here get's easier though if you require less precision. I actually think that textbook could totally be written. Maybe not for why it is -98 rather than -96, but different heuristics and knowing the boiling points of other molecules should get you quite far (Maybe why it is -98 rather than -108). I would absolutely read that textbook.
I'd be curious about how your timelines updated. Last year you wrote:
Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they won’t and that the excitement about them is very overblown. But I’m not very confident in that guess.
If the excitement is overblown, then we’re most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think we’re likely to see another AI winter in the next year or so.
If the excitement is not overblown, then we’re probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably our own understanding of how to align much-smarter-than-human AGI.
To me it seems plausible that we're in some intermediate world where progress continues but we still have like 5 years to criticality.
Thanks for your yearly update!
On the plan:
- What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.
I think this won't work because many human-value-laden concepts aren't very natural for an AI. More specifically, in the 2023 version of the plan you wrote:
The standard alignment by default story goes:
- Suppose the natural abstraction hypothesis[2] is basically correct, i.e. a wide variety of minds trained/evolved in the same environment converge to use basically-the-same internal concepts.
- … Then it’s pretty likely that neural nets end up with basically-human-like internal concepts corresponding to whatever stuff humans want.
- … So in principle, it shouldn’t take that many bits-of-optimization to get nets to optimize for whatever stuff humans want.
- … Therefore if we just kinda throw reward at nets in the obvious ways (e.g. finetuning/RLHF), and iterate on problems for a while, maybe that just works?
In the linked post, I gave that roughly a 10% chance of working. I expect the natural abstraction part to basically work, the problem is [...]
I think the natural abstraction part here does not work - not because natural abstractions aren't a thing - but because there's an exception for abstractions that are dependent on the particular mind architecture an agent has.
Concepts like "love", "humor", and probably "consciousness" may be natural for humans but probably less natural for AIs.
But also we cannot just wire up those concepts into the values of an AI and expect the AI's values to generalize correctly. The way our values generalize - how we will decide what to value as we grow smarter and do philosophical reflection - seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes' agenda), we'd need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn't seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).
Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.
Although I agree with this:
Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.
However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it's based on 10ish year timelines):
Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track.
So here are some thoughts on how your progress looks to me, although I've not been following your research in detail anymore since summer 2024 (after your early natural latents posts):
Basically, it seems to me like you're making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn't be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality... Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once - lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like "tree" (as opposed to a particular tree)), although I couldn't explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that's still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment - I didn't continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn't follow your work in detail and if you have concrete plans or evidence of how it's going to be useful for pointing AIs then lmk.
Basically, it seems to me like you're making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn't be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality... Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once - lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like "tree" (as opposed to a particular tree)), although I couldn't explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that's still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment - I didn't continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
+1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you're just about to find some sort of definitive theory of concepts. there's just SO MUCH different stuff going on with concepts! wentworth+lorell's work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell's work (i'd probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there's SO MANY questions! there's a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! "what's the formula for good concepts?" should sound to us like "what's the formula for useful technologies?" or "what's the formula for a strong economy?". there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.
on another note: "retarget the search to human values" sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind's values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit
concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable "safely"/"value-preservingly") they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it's plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:
maybe the position is "humans aren't retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one". it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won't even remotely be a nice cleavage between values and understanding
a response: the issue is that i've chosen an extremely unnatural task. a counterresponse: it's also extremely unnatural to have one's valuing route through an alien species, which is what the proposal wants to do to the AI ↩︎
that said, i think it's also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it's reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don't touch. in these cases, these edits would not affect the far future, at least not in the straightforward way ↩︎
these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general. ↩︎
wentworth+lorell's work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell's work (i'd probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining!
Btw if you mean there are 10k contributions already that are on the level of John's contributions, I strongly disagree with this. I'm not sure whether John's math is significantly useful, and I don't think it's been that much progress relative to "almost on track to maybe solve alignment", but in terms of (alignment) philosophy John's work is pretty great compared to academic philosophy.
In terms of general alignment philosophy (not just work on concepts but also other insights), I'd probably put John's collective works in 3rd place after Eliezer Yudkowsky and Douglas Hofstadter. The latter is on the list mainly because of Surfaces and Essences, which I can recommend (@johnswentworth).
Aka I'd probably put John above people like Wittgenstein, although I admit I don't know that much about the works of philosophers like Wittgenstein. Could be that there are more insights in the collective works of Wittgenstein, but if I'd need to read through 20x the volume because he doesn't write clearly enough that's still a point against him. Even if a lot of John's insights have been said before somewhere, writing insights clearly provides a lot of value.
Although John's work on concepts play a smaller role in what I think makes John a good alignment philosopher than it does in his alignment research. Partially I think John just has some great individual posts like science in a high dimensional world, you're not measuring what you think you're measuring, why agent foundations (coining the word true names), and probably a couple more less known older ones that I haven't processed fully yet. And generally his philosophy that you need to figure out the right ontology is good.
But also tbc, this is just alignment philosophy. In terms of alignment research, he's a bit further down my list, e.g. also behind Paul Christiano and Steven Byrnes.
to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it's plausible that this line of inquiry is just about to find some sort of definitive theory of concepts. (i expect you will still have a meaningfully lower number. i could be convinced it's more like 1000 but i think it's very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10
Unless we have an AI that shares our mind architecture (like in Steven Byrnes' agenda)
I think there's an important distinction here between (a) "including human value concepts" and (b) "being able to point at human value concepts". Systems sharing our mind architecture make (a) more likely but do not make (b) more likely, and I think (b) is required for good outcomes.
Thanks for the yearly update! I have some thoughts on why we care about string diagrams and commutative diagrams so much. (It's not even just "category theory".) I'll poke you later to talk about them in greater depth but for quick commentary:
For string diagrams it's something like "string diagrams are a minimal way to represent both timelike propagation of information and spacelike separation of causal influence". If you want to sketch out some causal graph, string diagrams are the natural best way to do that. From there you start caring about monoidal structure and you're off to the races.
For commutative diagrams the story is different but related, though admittedly I understand what's going on with [commutative diagrams]+[sparse activations] way way less. I'd say it's something like "the existence of a satisfied commutative diagram puts strong constraints on other aspects of the neural net, like what form the latent space(s?) and maps to and from them have to look like and what they have to do and what information has to get preserved or discarded".
For one last observation, a friend's been poking me about the sense that constraints and equipartition/environment are dual to each other, and that there's a correspondence (for bounded systems at least) between phase volume size-change and the sign of something like an informational analogue to thermodynamic temperature. (And also that your approach is importantly incomplete in currently only dealing in theory and not engineering, but for my part I think that that's priced in to how you talk about your plan.)
Bother me on Discord?
The natural latents machinery says a lot about what information needs to be passed around, but says a lot less about how to represent it. What representations are natural?
I like this question. The direction I'm currently thinking is spaces and distributions within them.
What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.
If we consider that inputs and outputs of nets contain distributions which are implied at training time, the net may be storing transformations that do not capture or represent any given aspect of the distribution it operates on, specifically in cases where details of the distribution are irrelevant to the operation it performs. However, this means I am optimistic about unsupervised methods, eg, autoencoders and sequence predictors.
Natural Latents: Latent Variables Stable Across Ontologies
I don’t quite understand this multi-layered construction — at the foundation of everything lies quantum physics with unitary evolution, in which, due to quantum Darwinism, only pointer states are preserved.
https://arxiv.org/html/2510.06867v1
Quantum systems achieve objectivity by redundantly encoding information about themselves into the surrounding environment, through a mechanism known as quantum Darwinism. When this happens, observes measure the environment and infer the system to be in one of its pointer states.
Examples of pointer states are the “alive” and “dead” cat states in Schrödinger’s experiment, since they are encoded into the surrounding environment, even before we open the box.
Now we introduce Natural Latents, which surround surrounding environment? Observing the observer who observes the cat?
Due to quantum Darwinism information is already redundantly stored in the surrounding environment, including human brains and AI databases, and therefore becomes objective. Why is the next layer needed?
In order to understand understanding we need to study things that understand, so observing observers is exactly the thing to do.
Quantum Darwinism reminds me of one part of the Copenhagen catechism, the idea that the quantum-to-classical transition (as we now call it) somehow revolves around "irreversible amplification" of microscopic to macroscopic properties. In quantum Darwinism, the claim instead is that properties become objectively present when multiple observers could agree on them. As https://arxiv.org/abs/1803.08936 points out on its first page, this is more like "inter-subjectivity" than objectivity, and there are also edge cases where the technical criterion simply fails. Like every other interpretation, quantum Darwinism has not resolved the ontological mysteries of quantum theory.
As for this Natural Latents research program, it seems to be studying the compressed representations of the world that brains and AIs form, and looking for what philosophers call "natural kinds", in the form of compressions and categorizations that a diverse variety of learning systems would naturally make.
The authors of the article express their personal viewpoint on the definition of subjectivity.
The definition of what it means to be objective in-and-of-itself is up for debate (this definition can be thought of as inter-subjectivity rather than objectivity per se), but that debate is not purpose of this Letter.
I can also agree that a specially prepared environment, for example one consisting of a wall of entangled qubits, does not ensure objectivity, since it simply continues the chain of superpositions: atom, Geiger counter, vial, cat, wall in the thought experiment. But our world is arranged such that this situation does not occur, at least without deliberate intervention by an experimenter.
I tried to imagine such a thought experiment — it is possible with a qubit, but not with a cat. In fact, this would mean creating a long-lived quantum memory, which I do not rule out. Does this negate objectivity?
I would like to note that a pointer state is the state of a pointer of a measuring device—this is where the name comes from. For example, in the case of Schrödinger’s cat, one can construct a device that indicates whether the cat is alive or dead, thereby ensuring objectivity even in the absence of a human observer.
Moreover, such devices can rely on different measurable signals: an electroencephalogram, a cardiogram, the cat’s heat production, the amount of CO₂ it exhales, and so on. A classical device that would display a superposition of the states ⟨alive⟩ + ⟨dead⟩ cannot be constructed; therefore, such a superposition is not a pointer state. Human sensory organs are themselves such devices, as is the environment surrounding the cat: EEG and ECG signals generate electromagnetic radiation in the environment, heat production raises its temperature, and CO₂ emission increases the ambient CO₂ concentration.
The mere existence of such “devices” already makes pointer states objective, because any number of observers can look at the pointers!
Can good and evil be pointer states? And if they can, then this would be an objective characteristic, understood in the same way by both humans and AI and the alignment problem is already solved!
If you only have unitary evolution, you end up with superpositions of the form
|system state 1> |pointer state 1> + |systems state 2> |pointer state 2> + ... + small cross-terms
Are you proposing that we ignore all but one branch of this superposition?
My favorite point origins of Born’s rule of view is the following. The final state is a superposition, but we are all inside it.
And since these two states are orthogonal, state 1⟩ does not see 2⟩, and vice versa; God only knows.
The works by Zurek (https://arxiv.org/pdf/1807.02092) and the more recent one (https://arxiv.org/html/2209.08621v6) shed more light on this.
Here one has to be very careful with the proof of such a multiverse picture, because, as usual, we replace the observed averaging of outcomes of experiments repeated in time in our world by the squared modulus of the (normalized) amplitude interpreted as the probability of our world which effectively means averaging over an ensemble of parallel worlds, whose number since the birth of the universe may be infinite.
The explanatory idea is there, but even in the 2025 paper it still looks underdeveloped. I don't understand this very well, so I can't give more details.
Can good and evil be pointer states? And if they can, then this would be an objective characteristic
This would appear to be just saying that if we can build a classical detector of good and evil, good and evil are objective in the classical sense.
What’s “The Plan”?
For several years now, around the end of the year, I (John) write a post on our plan for AI alignment. That plan hasn’t changed too much over the past few years, so both this year’s post and last year’s are written as updates to The Plan - 2023 Version.
I’ll give a very quick outline here of what’s in the 2023 Plan post. If you have questions or want to argue about points, you should probably go to that post to get the full version.
So, how’s progress? What are you up to?
2023 and 2024 were mostly focused on Natural Latents - we’ll talk more shortly about that work and how it fits into the bigger picture. In 2025, we did continue to put out some work on natural latents, but our main focus has shifted.
Natural latents are a major foothold on understanding natural abstraction. One could reasonably argue that they’re the only rigorous foothold on the core problem to date, the first core mathematical piece of the future theory. We’ve used that foothold to pull ourselves up a bit, and can probably pull ourselves up a little further on it, but there’s more still to climb after that.
We need to figure out the next foothold.
That’s our main focus at this point. It’s wide open, very exploratory. We don’t know yet what that next foothold will look like. But we do have some sense of what problems remain, and what bottlenecks the next footholds need to address. That will be the focus of the rest of this post.
What are the next bottlenecks to understanding natural abstraction?
We see two main “prongs” to understanding natural abstraction: the territory-first prong, and the mind-first prong. These two have different bottlenecks, and would likely involve different next footholds. That said, progress on either prong makes the other much easier.
What’s the “territory-first prong”?
One canonical example of natural abstraction comes from the ideal gas (and gasses pretty generally, but ideal gas is the simplest).
We have a bunch of little molecules bouncing around in a box. The motion is chaotic: every time two molecules collide, any uncertainty in their velocity is amplified multiplicatively. So if an observer has any uncertainty in the initial conditions (which even a superintelligence would, for a real physical system), that uncertainty will grow exponentially over time, until all information is wiped out… except for conserved quantities, like e.g. the total energy of the molecules, the number of molecules, or the size of the box. So, after a short time, the best predictions our observer will be able to make about the gas will just be equivalent to using a Maxwell-Boltzmann distribution, conditioning on only the total energy (or equivalently temperature), number of particles, and volume. It doesn’t matter if the observer is a human or a superintelligence or an alien, it doesn’t matter if they have a radically different internal mind-architecture than we do; it is a property of the physical gas that those handful of parameters (energy, particle count, volume) summarize all the information which can actually be used to predict anything at all about the gas’ motion after a relatively-short time passes.
The key point about the gas example is that it doesn’t talk much about any particular mind. It’s a story about how a particular abstraction is natural (e.g. the energy of a gas), and that story mostly talks about properties of the physical system (e.g. chaotic dynamics wiping out all signal except the energy), and mostly does not talk about properties of a particular mind. Thus, “territory-first”.
More generally: the territory-first prong is about looking for properties of (broad classes of) physical systems, which make particular abstractions uniquely well-suited to those systems. Just like (energy, particle count, volume) is an abstraction well-suited to an ideal gas because all other info is quickly wiped out by chaos.
What’s the “mind-first prong”?
Here’s an entirely different way one might try to learn about natural abstraction.
Take a neural net, and go train it on some data from real-world physical systems (e.g. images or video, ideally). Then, do some interpretability to figure out how the net is representing those physical systems internally, what information is being passed around in what format, etc. Repeat for a few different net architectures and datasets, and look for convergence in what stuff the net represents and how.
(Is this just interpretability? Sort of. Interp is a broad label; most things called “interpretability” are not particularly relevant to the mind-first prong of natural abstraction, but progress on the mind-first prong would probably be considered interp research.)
In particular, what we’d really like here is to figure out something about how patterns in the data end up represented inside the net, and then go look in the net to learn about natural abstractions out in the territory. Ideally, we could somehow nail down the “how the natural abstractions get represented in the net” part without knowing everything about what natural abstractions even are (i.e. what even is the thing being represented in the net), so that we could learn about their type signature by looking at nets.
More generally: the mind-first prong is about looking for convergent laws governing how patterns get “burned in” to trained/evolved systems like neural nets, and then using those laws to look inside nets trained on the real world, in order to back out facts about natural abstractions in the real world.
Note that anything one can figure out about real-world natural abstractions via looking inside nets (i.e. the mind-first prong) probably tells us a lot about the abstraction-relevant physical properties of physical systems (i.e. the territory-first prong), and vice versa.
So what has and hasn’t been figured out on the territory prong?
The territory prong has been our main focus for the past few years, and it was the main motivator for natural latents. Some key pieces which have already been nailed down to varying extents:
… but that doesn’t, by itself, give us everything we want to know from the territory prong.
Here are some likely next bottlenecks:
And what has and hasn’t been figured out on the mind prong?
The mind prong is much more wide open at this point; we understand it less than the territory prong.
What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.
We have some foundational pieces:
… but none of that directly hits the core of the problem.
If you want to get a rough sense of what a foothold on the core mind prong problem might look like, try Toward Statistical Mechanics of Interfaces Under Selection Pressure. That piece is not a solid, well-developed result; probably it’s not the right way to come at this. But it does touch on most of the relevant pieces; it gives a rough sense of the type of thing which we’re looking for.
Mostly, this is a wide open area which we’re working on pretty actively.