What would my 12-year-old self think of agent foundations?

This is a linkpost for https://namelessvirtue.com/2025/11/17/what-would-my-12-year-old-self-think-of-agent-foundations/

I knew I wanted to do science and math from a very early age. And I didn’t want to spend my life investigating just some particular phenomenon; I wanted to understand “everything”. You obviously can’t do that in a literal sense, so I focused on understanding things that were increasingly general. Generalizations are in some sense more “efficient” ways of understand things.

In physics, the field that claims to see the “theory of everything”, there are two obvious directions you can go with this. One is “up”, to astronomy and cosmology and the overall structure of the universe. The other is “down”, where you can look at the smallest particles. I was very into both of these, though it’s clear that the “down” direction is in some sense more fundamental. If you understand the laws behind the behavior of the smaller things, you can, in theory, use them to calculate what will happen to the bigger things. At 12 I knew that people had made huge progress toward finding the fundamental physical laws, and I was very excited to catch up on it.

But there also seem to be some other “directions” to generalize in if you want to efficiently understand everything. Mathematics is one of them, which is something like the “symbolic” direction. Philosophy is perhaps in the “conceptual” direction. And there’s also a direction that is something like the study of yourself, of how minds work. The study of what’s up with being the kind of thing that is inside the universe, observing and trying to understand it. This is a meta type of direction.

Over the years my preferences between these has shifted, but my 12-year-old self would not have been too surprised if I ended up going deep in any of these directions. Treatise on ontology? Awesome. Unified theory of the neocortex? Let’s go. But what’s agent foundations?

Well, I moved into agent foundations because I decided we needed to solve a problem, namely existential risk from AI. But it’s mostly trying to help with that problem by figuring out what the heck is going on with the phenomenon of agents. Which is to say, by understanding it.

I think my younger self would be pretty confused for a while that I’m into this, but I could probably explain it given enough time.

Above I mentioned the study of physics at the biggest scale and the smallest scale. There is obviously a lot going on in the middle, but from some perspective it feels mostly arbitrary. Like, there just happens to be water and flowers and binary star systems. All those things are interesting insofar as they are in the set of “everything”, but it doesn’t feel like understanding them has much generalization power.

I claim (to my 12-year-old self) that there is actually a generalized theory of things going on in the middle. That is, a generalized theory, not about specifically what’s going on in the middle, but about what it means that something could be said to be going on in the middle. (That sentence may have lost the reader. It also may have lost my 12-year-old self, but he’s now very excited to understand what I meant by it.)

For example, what exactly does it mean when we say that Newtonian mechanics is a good approximation of the true laws of physics? If you handed someone only the true laws of physics, how could they have figured out, in principle, that Newtonian mechanics was a good approximation of what they were holding? Are there other possible good approximations they could have figured out instead? Can we well-define the set of all possible good approximations, given some true physical laws?

This is relevant to agent foundations because agents have models of the world inside them. They use these models to successfully achieve their goals, so the models must be good approximations by that standard. If we want the agent to achieve our goals, then it probably needs a world model that is compatible with ours, at least in the parts that describe our goals.

We currently do not know how to formally state this, and I think that’s a barrier to being able to ensure it in practice.

Another word you could use for “world model” or “good approximation” is “theory”, so in some sense this part of my work is studying the theory of theories. Which, yeah, my 12-year-old self would be pretty thrilled about.

[-]williawa1mo10

I feel very much like this too. If we could find another person who feels like this I'd suggest putting it on the list of rationalist vices because I think its possible to stray too far in the direction. At least I've come to consider this inclination a flaw in myself. I rarely want to learn about anything that seems historically contingent, or not on the path to understanding the full picture. But IRL those are often the most important things to learn, or important things to look at to understand stuff. Like learning vim and unix commands hahaha. Maybe someone will get so good at vim they see the underlying structure of the whole of reality reflected in that program, but I doubt it.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

26

What would my 12-year-old self think of agent foundations?

26

26