The Plan - 2025 Update

johnswentworth; David Lorell

LESSWRONG
LW

The Plan - 2025 Update — LessWrong

92 The Plan - 2025 Update

by johnswentworth, David Lorell

31st Dec 2025

9 min read

92

What’s “The Plan”?

For several years now, around the end of the year, I (John) write a post on our plan for AI alignment. That plan hasn’t changed too much over the past few years, so both this year’s post and last year’s are written as updates to The Plan - 2023 Version.

I’ll give a very quick outline here of what’s in the 2023 Plan post. If you have questions or want to argue about points, you should probably go to that post to get the full version.

What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.
That plan is not the One Unique Plan we’re targeting; it’s a median plan, among a whole space of possibilities. Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.

Most of the 2023 Plan post explains why natural abstraction seems like a robust bottleneck, with examples. Why is natural abstraction a bottleneck to interp? Why is natural abstraction a bottleneck to deconfusion around embedded agency? Why is natural abstraction a bottleneck to metaphilosophy? Why are half a dozen common Dumb Ideas (for which understanding abstraction does not seem like a bottleneck) all doomed?
What would “understanding abstraction” look like? We’re going to go into more depth on that topic in this post!
Why bother with theory in the first place? If you go look at engineering in practice, it typically works well in exactly those domains where we already have a basically-solid theoretical understanding of the foundations. Going full brute-force iteration typically does not actually work that well, unless the theory is already in place to dramatically narrow down the search space. And there are reasons for that.
How we get feedback along the way: insofar as abstraction is natural, we can learn about it by studying lots of ordinary physical systems, and checking how our math applies to lots of ordinary physical systems.
If timelines are short, we need to outsource some stuff to AI, but we probably end up mostly bottlenecked on humans’ understanding (e.g. to be able to distinguish slop from actual progress). So we mostly plan to remain focused on understanding this sort of foundational stuff until very late in the game.

So, how’s progress? What are you up to?

2023 and 2024 were mostly focused on Natural Latents - we’ll talk more shortly about that work and how it fits into the bigger picture. In 2025, we did continue to put out some work on natural latents, but our main focus has shifted.

Natural latents are a major foothold on understanding natural abstraction. One could reasonably argue that they’re the only rigorous foothold on the core problem to date, the first core mathematical piece of the future theory. We’ve used that foothold to pull ourselves up a bit, and can probably pull ourselves up a little further on it, but there’s more still to climb after that.

We need to figure out the next foothold.

That’s our main focus at this point. It’s wide open, very exploratory. We don’t know yet what that next foothold will look like. But we do have some sense of what problems remain, and what bottlenecks the next footholds need to address. That will be the focus of the rest of this post.

What are the next bottlenecks to understanding natural abstraction?

We see two main “prongs” to understanding natural abstraction: the territory-first prong, and the mind-first prong. These two have different bottlenecks, and would likely involve different next footholds. That said, progress on either prong makes the other much easier.

What’s the “territory-first prong”?

One canonical example of natural abstraction comes from the ideal gas (and gasses pretty generally, but ideal gas is the simplest).

We have a bunch of little molecules bouncing around in a box. The motion is chaotic: every time two molecules collide, any uncertainty in their velocity is amplified multiplicatively. So if an observer has any uncertainty in the initial conditions (which even a superintelligence would, for a real physical system), that uncertainty will grow exponentially over time, until all information is wiped out… except for conserved quantities, like e.g. the total energy of the molecules, the number of molecules, or the size of the box. So, after a short time, the best predictions our observer will be able to make about the gas will just be equivalent to using a Maxwell-Boltzmann distribution, conditioning on only the total energy (or equivalently temperature), number of particles, and volume. It doesn’t matter if the observer is a human or a superintelligence or an alien, it doesn’t matter if they have a radically different internal mind-architecture than we do; it is a property of the physical gas that those handful of parameters (energy, particle count, volume) summarize all the information which can actually be used to predict anything at all about the gas’ motion after a relatively-short time passes.

The key point about the gas example is that it doesn’t talk much about any particular mind. It’s a story about how a particular abstraction is natural (e.g. the energy of a gas), and that story mostly talks about properties of the physical system (e.g. chaotic dynamics wiping out all signal except the energy), and mostly does not talk about properties of a particular mind. Thus, “territory-first”.

More generally: the territory-first prong is about looking for properties of (broad classes of) physical systems, which make particular abstractions uniquely well-suited to those systems. Just like (energy, particle count, volume) is an abstraction well-suited to an ideal gas because all other info is quickly wiped out by chaos.

What’s the “mind-first prong”?

Here’s an entirely different way one might try to learn about natural abstraction.

Take a neural net, and go train it on some data from real-world physical systems (e.g. images or video, ideally). Then, do some interpretability to figure out how the net is representing those physical systems internally, what information is being passed around in what format, etc. Repeat for a few different net architectures and datasets, and look for convergence in what stuff the net represents and how.

(Is this just interpretability? Sort of. Interp is a broad label; most things called “interpretability” are not particularly relevant to the mind-first prong of natural abstraction, but progress on the mind-first prong would probably be considered interp research.)

In particular, what we’d really like here is to figure out something about how patterns in the data end up represented inside the net, and then go look in the net to learn about natural abstractions out in the territory. Ideally, we could somehow nail down the “how the natural abstractions get represented in the net” part without knowing everything about what natural abstractions even are (i.e. what even is the thing being represented in the net), so that we could learn about their type signature by looking at nets.

More generally: the mind-first prong is about looking for convergent laws governing how patterns get “burned in” to trained/evolved systems like neural nets, and then using those laws to look inside nets trained on the real world, in order to back out facts about natural abstractions in the real world.

Note that anything one can figure out about real-world natural abstractions via looking inside nets (i.e. the mind-first prong) probably tells us a lot about the abstraction-relevant physical properties of physical systems (i.e. the territory-first prong), and vice versa.

So what has and hasn’t been figured out on the territory prong?

The territory prong has been our main focus for the past few years, and it was the main motivator for natural latents. Some key pieces which have already been nailed down to varying extents:

The Telephone Theorem: information which propagates over a nontrivial time/distance (like e.g. energy in our ideal gas example) must be approximately conserved.
Natural Latents: in the language of natural latents, information which propagates over a nontrivial time/distance (like e.g. energy in our ideal gas example) must be redundantly represented in many times/places - e.g. we can back out the same energy by looking at many different time-slices, or roughly the same energy by looking at many different little chunks of the gas. If, in addition to that redundancy, that information also mediates between time/space chunks, then we get some ontological guarantees: we’ve found all the information which propagates.
Some tricks which build on natural latents:
- To some extent, natural latent conditions can nail down particular factorizations of high level summaries, like e.g. representing a physical electronic circuit as a few separate wires, transistors, etc. We do this by looking for components of a high-level summary latent which are natural over different physical chunks of the system.
- We can also use natural latent conditions to nail down particular clusterings, like in A Solomonoff Inductor Walks Into A Bar.

… but that doesn’t, by itself, give us everything we want to know from the territory prong.

Here are some likely next bottlenecks:

String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?
The natural latents machinery says a lot about what information needs to be passed around, but says a lot less about how to represent it. What representations are natural?

High level dynamics or laws, like e.g. circuit laws or gas laws. The natural latents machinery might tell us e.g. which variables should appear in high level laws/dynamics, but it doesn’t say much about the relationships between those variables, i.e. the laws/dynamics themselves. What general rules exist for those laws/dynamics? How can they be efficiently figured out from the low level? How can they be efficiently represented in full generality?
How can we efficiently sample the low-level given the high-level? Sure, natural latents summarize all the information relevant at long distances. But even with long-range signals controlled-for, we still don’t know how to sample a small low-level neighborhood. We would need to first sample a boundary which needs to be in-distribution, and getting an in-distribution boundary sample is itself not something we know how to do.

And what has and hasn’t been figured out on the mind prong?

The mind prong is much more wide open at this point; we understand it less than the territory prong.

What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.

We have some foundational pieces:

Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)
Natural latents and the Telephone Theorem might also be relevant insofar as we view the net itself as a low-level system which embeds some high-level logic. But that also doesn’t get at the core question about how environment structure gets represented in the net.
There’s a fair bit to be said about commutative diagrams. They, again, don’t directly address the core representation question. But they’re one of the most obvious foundational tools to try, and when applied to neural nets, they have some surprising approximate solutions - like e.g. sparse activations.

… but none of that directly hits the core of the problem.

If you want to get a rough sense of what a foothold on the core mind prong problem might look like, try Toward Statistical Mechanics of Interfaces Under Selection Pressure. That piece is not a solid, well-developed result; probably it’s not the right way to come at this. But it does touch on most of the relevant pieces; it gives a rough sense of the type of thing which we’re looking for.

Mostly, this is a wide open area which we’re working on pretty actively.

Frontpage

92

New Comment

19 comments, sorted by

top scoring

Click to highlight new comments since: Today at 5:56 PM

[-]Lucius Bushnaq2mo207

Singular learning theory, or something like it, is probably a necessary foundational tool here. It doesn’t directly answer the core question about how environment structure gets represented in the net, but it does give us the right mental picture for thinking about things being “learned by the net” at all. (Though if you just want to understand the mental picture, this video is probably more helpful than reading a bunch of SLT.)

I think this is probably wrong. Vanilla SLT describes a toy case of how Bayesian learning on neural networks works. I think there is a big difference between Bayesian learning, which requires visiting every single point in the loss landscape and trying them all out on every data point, and local learning algorithms, such as evolution, stochastic gradient descent, AdamW, etc., which try to find a good solution using information from just a small number of local neighbourhoods in the loss landscape. Those local learning algorithms are the ones I'd expect to be used by real minds, because they're much more compute efficient.

I think this locality property matters a lot. It introduces additional, important constraints on what nets can feasibly learn. It's where path dependence in learning comes from. I think vanilla SLT was probably a good tutorial for us before delving into the more realistic and complicated local learning case, but there's still work to do to get us to an actually roughly accurate model of how nets learn things.

If a solution consists of internal pieces of machinery that need to be arranged exactly right to do anything useful at all, a local algorithm will need something like $O (e^{1000 c})$ update steps to learn it.^[1] In other words, it won't do better than a random walk that aimlessly wanders around the loss landscape until it runs into a point with low loss by sheer chance. But if a solution with $1000$ internal pieces of machinery can instead be learned in small chunks that each individually decrease the loss a little bit, the leading term in the number of update steps required to find that solution scales exponentially with the size of the single biggest solution chunk, rather than with the size of the whole solution. So, if the biggest chunk had size $50$ , the total learning time will be around $O (e^{50 c})$ .^[2]

For an example where the solution cannot be learned in chunks like this, see the subset parity learning problem, where SGD really does need a number of update steps exponential in the effective parameter count of the whole solution to learn. Which for most practical purposes means it cannot learn the solution at all.

For a net to learn a big and complicated solution with high Local Learning Coefficient (LLC), it needs a learning story to find the solution's basin in the loss landscape in a feasible timeframe. It can't just rely on random walking, that takes too long. The expected total time it takes the net to get to a basin is, I think, determined mostly by the dimensionality of the mode connections from that basin to the rest of the landscape. Not just by the dimensionality of the basin itself, as would be the case for the sort of global, Bayesian learning modelled by vanilla SLT. The geometry of those connections is the core mathematical object that reflects the structure of the learning process and determines the learnability of a solution.^[3] Learning a big solution chunk that increases the total LLC by a lot in one go means needing to find a very low-dimensional mode connection to traverse. This takes a long time, because the connection interface is very small compared to the size of the search space. To learn a smaller chunk that increases the total LLC by less, the net only needs to reach a higher-dimensional mode connection, which will have an exponentially larger interface that is thus exponentially quicker to find.^[4]

I agree that vanilla SLT seems like a useful tool for developing the right mental picture of how nets learn things, but it is not itself that picture. The simplified Bayesian learning case is instructive for illuminating the connection between learning and loss landscape geometry in the most basic setting, but taken on its own it's still failing to capture a lot of the structure of learning in real minds.

^{^}
Where $c$ is some constant which probably depends on the details of the update algorithm.
^{^}
I'm not going to add "I think" and "I suspect" to every sentence in this comment, but you should imagine them being there. I haven't actually worked this out in math properly or tested it.
^{^}
At least for a specific dataset and architecture. Modelling changes in the geometry of the loss landscape if we allow dataset and architecture to vary based on the mind's own decisions as it learns might be yet another complication we'll need to deal with in the future, once we start thinking about theories of learning for RL agents with enough freedom and intelligence to pick their learning curricula themselves.
^{^}
To get the rough idea across I'm focusing here on the very basic case where the "chunks" are literal pieces of the final solution and each of them lowers the loss a little and increases the total LLC a little. In general, this doesn't have to be true though. For example, a solution D with effective parameter count 120 might be learned by first learning independent chunks A and B, each with effective parameter count 50, then learning a chunk C with effective parameter count 30 which connects the formerly independent A and B together into a single mechanistic whole to form solution D. The expected number of update steps in this learning story would be $\approx e^{50 c} + e^{50 c} + e^{30 c} = O (e^{50 c})$ .

[-]Steven Byrnes2mo198

String diagrams. Pretty much every technical diagram you’ve ever seen, from electronic circuits to dependency graphs to ???, is a string diagram. Why is this such a common format for high-level descriptions? If it’s fully general for high-level natural abstraction, why, and can we prove it? If not, what is?

My explanation would be: our feeble human minds can’t track too many simultaneous interacting causal dependencies. So if we want to (e.g.) explain intuitively why the freezing point of methanol is -98°C as opposed to -96°C, we know we can’t, and we don’t even try, we just say “sorry, there isn’t any intuitive explanation of that, it’s just what you get experimentally, and oh it’s also what you get in this molecular (MD) dynamics simulation, here’s the code”. We don’t bother to make a technical diagram of why it’s -98 not -96 because it would be a zillion arrows going every which way and no one would understand it, so there’s no point in drawing it in the first place.

The MD code, incidentally, is a different structure with different interacting entities (variables, arrays, etc.), and is the kind of thing we humans can intuitively understand, and (relatedly) it can be represented pretty well as a flow diagram with boxes and arrows. So physical chemistry textbooks will talk about the MD code but NOT talk about the subtle detailed aspects of interacting methanol molecules that distinguish a -98°C freezing point from -96.

[-]Morpheus2mo42

Molecular dynamics was also the first counterexample I was thinking of.

So physical chemistry textbooks will talk about the MD code but NOT talk about the subtle detailed aspects of interacting methanol molecules that distinguish a -98°C freezing point from -96.

Using heuristics here get's easier though if you require less precision. I actually think that textbook could totally be written. Maybe not for why it is -98 rather than -96, but different heuristics and knowing the boiling points of other molecules should get you quite far (Maybe why it is -98 rather than -108). I would absolutely read that textbook.

[-]Towards_Keeperhood2mo132

I'd be curious about how your timelines updated. Last year you wrote:

Over the past year, my timelines have become even more bimodal than they already were. The key question is whether o1/o3-style models achieve criticality (i.e. are able to autonomously self-improve in non-narrow ways), including possibly under the next generation of base model. My median guess is that they won’t and that the excitement about them is very overblown. But I’m not very confident in that guess.
If the excitement is overblown, then we’re most likely still about 1 transformers-level paradigm shift away from AGI capable of criticality, and timelines of ~10 years seem reasonable. Conditional on that world, I also think we’re likely to see another AI winter in the next year or so.
If the excitement is not overblown, then we’re probably looking at more like 2-3 years to criticality. In that case, any happy path probably requires outsourcing a lot of alignment research to AI, and then the main bottleneck is probably our own understanding of how to align much-smarter-than-human AGI.

To me it seems plausible that we're in some intermediate world where progress continues but we still have like 5 years to criticality.

[-]Towards_Keeperhood2mo114

Thanks for your yearly update!

On the plan:

What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.

I think this won't work because many human-value-laden concepts aren't very natural for an AI. More specifically, in the 2023 version of the plan you wrote:

The standard alignment by default story goes:
Suppose the natural abstraction hypothesis^[2] is basically correct, i.e. a wide variety of minds trained/evolved in the same environment converge to use basically-the-same internal concepts.
… Then it’s pretty likely that neural nets end up with basically-human-like internal concepts corresponding to whatever stuff humans want.
… So in principle, it shouldn’t take that many bits-of-optimization to get nets to optimize for whatever stuff humans want.
… Therefore if we just kinda throw reward at nets in the obvious ways (e.g. finetuning/RLHF), and iterate on problems for a while, maybe that just works?
In the linked post, I gave that roughly a 10% chance of working. I expect the natural abstraction part to basically work, the problem is [...]

I think the natural abstraction part here does not work - not because natural abstractions aren't a thing - but because there's an exception for abstractions that are dependent on the particular mind architecture an agent has.

Concepts like "love", "humor", and probably "consciousness" may be natural for humans but probably less natural for AIs.

But also we cannot just wire up those concepts into the values of an AI and expect the AI's values to generalize correctly. The way our values generalize - how we will decide what to value as we grow smarter and do philosophical reflection - seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes' agenda), we'd need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn't seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).

Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.

Although I agree with this:

Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.

However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it's based on 10ish year timelines):

Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track.

So here are some thoughts on how your progress looks to me, although I've not been following your research in detail anymore since summer 2024 (after your early natural latents posts):

Basically, it seems to me like you're making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):

the intellect mustn't be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality... Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...

Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.

Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once - lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like "tree" (as opposed to a particular tree)), although I couldn't explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.

I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that's still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.

Not that this kind of study of language is necessarily the best way to proceed with alignment - I didn't continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.

ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn't follow your work in detail and if you have concrete plans or evidence of how it's going to be useful for pointing AIs then lmk.

[-]Kaarel2mo*80

Basically, it seems to me like you're making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn't be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality... Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...

Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.

Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once - lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like "tree" (as opposed to a particular tree)), although I couldn't explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.

I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that's still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.

Not that this kind of study of language is necessarily the best way to proceed with alignment - I didn't continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.

+1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you're just about to find some sort of definitive theory of concepts. there's just SO MUCH different stuff going on with concepts! wentworth+lorell's work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell's work (i'd probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there's SO MANY questions! there's a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! "what's the formula for good concepts?" should sound to us like "what's the formula for useful technologies?" or "what's the formula for a strong economy?". there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.

on another note: "retarget the search to human values" sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind's values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit

concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable "safely"/"value-preservingly") they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it's plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:

you could try to make the human feel good about plans for futures that involve learning a bunch of analysis and number theory, and about plan-relevant visions of the future that involve having a proof of the riemann hypothesis in hand in particular, and so on. it seems pretty clear that this doesn't generalize correctly, and in particular that the human isn't actually going to do the deeply unnatural thing of committing suicide after finishing the rest. ^[1] i think it's very unlikely that they'll even focus much on proving the riemann hypothesis in particular. if you're really good at this sort of editing, maybe they will get really into analysis and number theory for a while, i guess, and it might even affect what happens in the very far future. ^[2] but the far future isn't going to look like what you wanted.
with like 100 years of philosophy and neuroscience research, i think one might get into a position where one could edit a human to be locally trying to solve some math problem for like 10 minutes, with the edit doing sth like what happens when one naturally just decides to work on a math problem without it fitting into one's life/plans in any very deep way, eg just to learn. there is retargetable search in humans in that sense, and i think it's probable sth like this will be present in the first AGI as well. but this is different than editing the human to have some specific different long-term values. on longer timescales than 10 minutes, the human will have their mess-values kick in again, implemented in/as eg many context-specific drives, habits, explicit and implicit principles, explicit and implicit meta-principles, understanding of the purposes of various things, ways of harmonizing attitudes, various processes running on various human institutions and other humans, etc. ^[3] it would be a motte and bailey to argue "it is generic for a mind to have at least some sort of targetable search ability" (in a way that considers the 10 min thing one could in principle do to a human an example), and then to go from this to "it is generic for a mind as a whole to have some sort of retargetable search structure, like with an ultimate goal slot in which something can be written".
you could try to edit the human's memories in a really careful way to make them think that they have made a very serious promise to do this riemann hypothesis thing. doing this is probably not possible in all but at most a very small fraction of humans because humans almost universally don't have a strong enough promise capability to actually stick to this over the very long term. (actually, i'd mostly guess it's not possible in any humans, because it's such a fucked thing to promise. what would the story be of how you made this promise, of which you now have fake memories? maybe there's some construction... but the promise-keeping part will have to fight a huge long-term war against all the many other value-bearing components of the human, that are all extremely unhappy about this life path.) also, if it were possible to plant plausible memories of making the promise (maybe with different choices for the details of the promise), you could probably just have the human make the promise the good old-fashioned way. anyway, default AGIs won't be deeply social beings like humans, so it would be extremely weird for an AGI to already have machinery for making promises installed. it's also extremely difficult to do this in a way that the guy never realizes they were just tampered with and so isn't actually bound by the promise (after realizing which they will probably ignore it).
but maybe there's a better sort of thing you could try on a human, that i'm not quickly thinking of?

maybe the position is "humans aren't retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one". it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won't even remotely be a nice cleavage between values and understanding

a response: the issue is that i've chosen an extremely unnatural task. a counterresponse: it's also extremely unnatural to have one's valuing route through an alien species, which is what the proposal wants to do to the AI ↩︎
that said, i think it's also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it's reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don't touch. in these cases, these edits would not affect the far future, at least not in the straightforward way ↩︎
these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general. ↩︎

[-]Towards_Keeperhood1mo30

wentworth+lorell's work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell's work (i'd probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining!

Btw if you mean there are 10k contributions already that are on the level of John's contributions, I strongly disagree with this. I'm not sure whether John's math is significantly useful, and I don't think it's been that much progress relative to "almost on track to maybe solve alignment", but in terms of (alignment) philosophy John's work is pretty great compared to academic philosophy.

In terms of general alignment philosophy (not just work on concepts but also other insights), I'd probably put John's collective works in 3rd place after Eliezer Yudkowsky and Douglas Hofstadter. The latter is on the list mainly because of Surfaces and Essences, which I can recommend (@johnswentworth).

Aka I'd probably put John above people like Wittgenstein, although I admit I don't know that much about the works of philosophers like Wittgenstein. Could be that there are more insights in the collective works of Wittgenstein, but if I'd need to read through 20x the volume because he doesn't write clearly enough that's still a point against him. Even if a lot of John's insights have been said before somewhere, writing insights clearly provides a lot of value.

Although John's work on concepts play a smaller role in what I think makes John a good alignment philosopher than it does in his alignment research. Partially I think John just has some great individual posts like science in a high dimensional world, you're not measuring what you think you're measuring, why agent foundations (coining the word true names), and probably a couple more less known older ones that I haven't processed fully yet. And generally his philosophy that you need to figure out the right ontology is good.

But also tbc, this is just alignment philosophy. In terms of alignment research, he's a bit further down my list, e.g. also behind Paul Christiano and Steven Byrnes.

[-]Kaarel1mo40

to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it's plausible that this line of inquiry is just about to find some sort of definitive theory of concepts. (i expect you will still have a meaningfully lower number. i could be convinced it's more like 1000 but i think it's very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10

[-]TristanTrim2mo30

Unless we have an AI that shares our mind architecture (like in Steven Byrnes' agenda)

I think there's an important distinction here between (a) "including human value concepts" and (b) "being able to point at human value concepts". Systems sharing our mind architecture make (a) more likely but do not make (b) more likely, and I think (b) is required for good outcomes.

[-]Lorxus2mo62

Thanks for the yearly update! I have some thoughts on why we care about string diagrams and commutative diagrams so much. (It's not even just "category theory".) I'll poke you later to talk about them in greater depth but for quick commentary:

For string diagrams it's something like "string diagrams are a minimal way to represent both timelike propagation of information and spacelike separation of causal influence". If you want to sketch out some causal graph, string diagrams are the natural best way to do that. From there you start caring about monoidal structure and you're off to the races.

For commutative diagrams the story is different but related, though admittedly I understand what's going on with [commutative diagrams]+[sparse activations] way way less. I'd say it's something like "the existence of a satisfied commutative diagram puts strong constraints on other aspects of the neural net, like what form the latent space(s?) and maps to and from them have to look like and what they have to do and what information has to get preserved or discarded".

For one last observation, a friend's been poking me about the sense that constraints and equipartition/environment are dual to each other, and that there's a correspondence (for bounded systems at least) between phase volume size-change and the sign of something like an informational analogue to thermodynamic temperature. (And also that your approach is importantly incomplete in currently only dealing in theory and not engineering, but for my part I think that that's priced in to how you talk about your plan.)

Bother me on Discord?

[-]TristanTrim2mo30

The natural latents machinery says a lot about what information needs to be passed around, but says a lot less about how to represent it. What representations are natural?

I like this question. The direction I'm currently thinking is spaces and distributions within them.

What we’d ideally like is to figure out how environment structure gets represented in the net, without needing to know what environment structure gets represented in the net (or even what structure is in the environment in the first place). That way, we can look inside trained nets to figure out what structure is in the environment.

If we consider that inputs and outputs of nets contain distributions which are implied at training time, the net may be storing transformations that do not capture or represent any given aspect of the distribution it operates on, specifically in cases where details of the distribution are irrelevant to the operation it performs. However, this means I am optimistic about unsupervised methods, eg, autoencoders and sequence predictors.

[-]Konjkov Vladimir2mo00

Natural Latents: Latent Variables Stable Across Ontologies

I don’t quite understand this multi-layered construction — at the foundation of everything lies quantum physics with unitary evolution, in which, due to quantum Darwinism, only pointer states are preserved.

https://arxiv.org/html/2510.06867v1

Quantum systems achieve objectivity by redundantly encoding information about themselves into the surrounding environment, through a mechanism known as quantum Darwinism. When this happens, observes measure the environment and infer the system to be in one of its pointer states.

Examples of pointer states are the “alive” and “dead” cat states in Schrödinger’s experiment, since they are encoded into the surrounding environment, even before we open the box.

Now we introduce Natural Latents, which surround surrounding environment? Observing the observer who observes the cat?

Due to quantum Darwinism information is already redundantly stored in the surrounding environment, including human brains and AI databases, and therefore becomes objective. Why is the next layer needed?

[-]TristanTrim2mo40

In order to understand understanding we need to study things that understand, so observing observers is exactly the thing to do.

[-]Mitchell_Porter2mo31

Quantum Darwinism reminds me of one part of the Copenhagen catechism, the idea that the quantum-to-classical transition (as we now call it) somehow revolves around "irreversible amplification" of microscopic to macroscopic properties. In quantum Darwinism, the claim instead is that properties become objectively present when multiple observers could agree on them. As https://arxiv.org/abs/1803.08936 points out on its first page, this is more like "inter-subjectivity" than objectivity, and there are also edge cases where the technical criterion simply fails. Like every other interpretation, quantum Darwinism has not resolved the ontological mysteries of quantum theory.

As for this Natural Latents research program, it seems to be studying the compressed representations of the world that brains and AIs form, and looking for what philosophers call "natural kinds", in the form of compressions and categorizations that a diverse variety of learning systems would naturally make.

[-]Konjkov Vladimir2mo*30

The authors of the article express their personal viewpoint on the definition of subjectivity.

The definition of what it means to be objective in-and-of-itself is up for debate (this definition can be thought of as inter-subjectivity rather than objectivity per se), but that debate is not purpose of this Letter.

I can also agree that a specially prepared environment, for example one consisting of a wall of entangled qubits, does not ensure objectivity, since it simply continues the chain of superpositions: atom, Geiger counter, vial, cat, wall in the thought experiment. But our world is arranged such that this situation does not occur, at least without deliberate intervention by an experimenter.

I tried to imagine such a thought experiment — it is possible with a qubit, but not with a cat. In fact, this would mean creating a long-lived quantum memory, which I do not rule out. Does this negate objectivity?

[-]Konjkov Vladimir2mo10

I would like to note that a pointer state is the state of a pointer of a measuring device—this is where the name comes from. For example, in the case of Schrödinger’s cat, one can construct a device that indicates whether the cat is alive or dead, thereby ensuring objectivity even in the absence of a human observer.

Moreover, such devices can rely on different measurable signals: an electroencephalogram, a cardiogram, the cat’s heat production, the amount of CO₂ it exhales, and so on. A classical device that would display a superposition of the states ⟨alive⟩ + ⟨dead⟩ cannot be constructed; therefore, such a superposition is not a pointer state. Human sensory organs are themselves such devices, as is the environment surrounding the cat: EEG and ECG signals generate electromagnetic radiation in the environment, heat production raises its temperature, and CO₂ emission increases the ambient CO₂ concentration.

The mere existence of such “devices” already makes pointer states objective, because any number of observers can look at the pointers!

Can good and evil be pointer states? And if they can, then this would be an objective characteristic, understood in the same way by both humans and AI and the alignment problem is already solved!

[-]Mitchell_Porter2mo30

If you only have unitary evolution, you end up with superpositions of the form

|system state 1> |pointer state 1> + |systems state 2> |pointer state 2> + ... + small cross-terms

Are you proposing that we ignore all but one branch of this superposition?

[-]Konjkov Vladimir2mo30

My favorite point origins of Born’s rule of view is the following. The final state is a superposition, but we are all inside it.

And since these two states are orthogonal, state 1⟩ does not see 2⟩, and vice versa; God only knows.

The works by Zurek (https://arxiv.org/pdf/1807.02092) and the more recent one (https://arxiv.org/html/2209.08621v6) shed more light on this.

Here one has to be very careful with the proof of such a multiverse picture, because, as usual, we replace the observed averaging of outcomes of experiments repeated in time in our world by the squared modulus of the (normalized) amplitude interpreted as the probability of our world which effectively means averaging over an ensemble of parallel worlds, whose number since the birth of the universe may be infinite.

The explanatory idea is there, but even in the 2025 paper it still looks underdeveloped. I don't understand this very well, so I can't give more details.

[-]green_leaf2mo10

Can good and evil be pointer states? And if they can, then this would be an objective characteristic

This would appear to be just saying that if we can build a classical detector of good and evil, good and evil are objective in the classical sense.

Moderation Log