A Compositional Philosophy of Science for Agent Foundations

Jonas Hallgren

I describe the philosophy of science that I follow when it comes to the research that I do. I firstly describe the underlying reasoning behind it by itself which I then follow by walking through how I apply it to my own assumptions about what is needed to make the AI transition go well. I use the word Agent Foundations a bit more broadly in the post and it is more focused on the hierarchical agency side of things.

Epistemic Status: At first none of this was written with AI, I just randomly got a good flow and one-shotted this in one evening, it was quite nice but my friend said that it was way too schizo in terms of structure when I asked for feedback so I got claude to help me restructure it. E.g 10-15% is AI writing.

When you first get into AI Safety a common feeling is confusion followed by confusion with a little bit of “why is this not a bigger issue” on top. Then it is followed by a succession of thinking that you figured out what the problem is only to realise that it is more interconnected and deeper than you thought and then you’re confused again. Then this repeats for 2 years until you suddenly feel that the problem makes more sense and you can see the interconnecting parts. (at least that’s how it was for me) Then you’re back to being confused about what to do about it and you stay that way whilst working on it.

There are a bunch of stacking factors for the difficulty in finding solutions: It’s not a traditional scientific field, there are a bunch of thorny mathematical and philosophical problems built into it. At the same time there are also governance and incentive issues and some part of you feels like it is philosophy complete (you would have to figure out all of philosophy to solve it).

Then you see the excellent research plans in AI safety — Wentworth's, Byrnes's, Kosoy's, Davidad's (and more) and you start to see that there are at least some ways of tackling it yet they seem quite hard to follow through on? And besides they’re dependent on specific details that might be hard to verify when seeing them for the first time. They have a set of intuitions that inform their directions but how do you know whether those intuitions are correct or not? (you spend 3 years confused until it clicks, duh)

How do you know whether they will provide insights that bring the field forward? How do you figure that out for yourself? What is a reasonable philosophy of science for problems that span fields and where you don’t know where the solution is coming from?

I will be describing the one that I’m currently a believer in. The philosophy is highly dependent on the fact that I’m stupid compared to the intelligence of the alignment field as well as the general intelligence of science. It’s one that leans heavily on the skills of distillation and connection between different fields as a way to generate new ways of asking questions since the problem space is a lot larger than the solution space. I believe this philosophy of science is especially prescient when it comes to pre-paradigmatic fields as the questions to ask and connections to other fields aren’t fully formed.

Modelling scientific progress

To understand why connection-making is high-leverage, we need to look at how progress happens and we will begin in the very abstract, that is macrohistory and civilization.

There are many ways to model the progress of civilization. You can look at it as a general increase in energy capture such as in Energy and Civilization, or as a process of ever increasing sophistication of control structures such as in Seeing Like A State. Maybe you could see the decentralized cycles of it in Against The Grain or look at secular cycles with Peter Turchin. But none of these explain the generation of ideas nor networks of ideas. For that, we need cultural evolution.

Joseph Heinrich's The Secret of Our Success makes a compelling case that the central engine of human progress is combinatorial: ideas are modular components, and progress happens when the right components find each other. The key variable isn't individual genius — it's how well ideas spread through a network of people.

Imagine any type of progress as a sort of genetic combination of modular things. You get a fitness landscape with different minima, and it can be quite hard to get to the deeper ones. Minima 3 might be something like 3 genetic mutations away, creating a valley you can't actually cross:

Here's a concrete example from the book. It's nutritious and yummy to eat ants, but not every society knows how to do it. You need to find 3 different sub-modules — the right stick, the right technique, the right processing — and combine them:

Now imagine a social network where these ideas are randomly instantiated with some probability and then spread between connected people. It might look something like this:

Two things fall out of this model. First, the larger the network of people sharing ideas, the more likely you are to find the right combinations. It's a power law — a society with 10,000 people is exponentially better at finding combinations than a smaller one, and this is a factor in how Heinrich explains the quicker development of Eurasia.

Second, we can model ourselves as part of this larger network. Imagine two types of animals: hedgehogs and butterflies.

Butterflies are great at propagating existing ideas that other people have had — high connectivity, lots of cross-pollination. Hedgehogs have lower connectivity scores but a higher probability of coming up with genuinely novel ideas (since they spend more time on their own). In larger networks, being a butterfly becomes relatively more important because finding the right combinations matters more than generating new components from scratch according to Heinrich.

Great, we solved philosophy of science, talk to as many people as you can about AI Safety and everything will be solved! … not.

Propagation of knowledge is only useful if it is actually useful knowledge and in order to make progress on deep issues in AI Safety the general requirements are quite high. Maybe work on the smallest possible problems that give incremental improvements?

Yet even if we make good incremental progress not all problems factorize well, and progress will be bottlenecked by the serial components (amdahl’s law type shit) so what is one to do?

Simple really, just identify the bottlenecks and solve them! Find the cool hedgehogs that have the solutions to the problems and make them talk to each other by translating their language!

I think a good strategy for someone who can understand hedgehogs is to talk to them and be their butterfly. Be a connector of deep cool ideas and combine them. If you can connect larger networks to the existing network of ideas then you're also doing a good thing as it increases connectivity.

Yet as alluded to, being a connector without verification mechanisms is how you get the worst parts of the internet. Too high connectivity without adequate filtering is actively dangerous. The hyperconnectivity of social media networks leads to fewer distinct communities and less clear boundaries between subgroups. This also leads to problems of unaccountability — no clear entry point for things and no clear defender of the epistemic commons. When everything is connected to everything and nobody has skin in the game, you get a system that's very good at spreading ideas and very bad at knowing which ones are true.

This is where Nassim Taleb enters the picture, loudly and profanely, (you hear the word cocksucker echo in the distance). His core insight about skin in the game is really an insight about epistemic filtering: if you don't have to live with the consequences of your ideas, you have no feedback loop. And ideas without feedback loops are just vibes. The entire edifice of science — experiments, predictions, replication — is a set of mechanisms for forcing ideas to face consequences. The question is whether those mechanisms are fast enough for the problem you're working on.

So the methodology needs both: systematic connection-making to find combinations that no single field would discover, AND ruthless verification to kill the ones that don't work. The question is what those verification mechanisms look like when you're working across fields where the standards of evidence are different.

Don’t be a twitter vibe connector, be a scientific connector with a good verification stack! But what is a good verification stack?...

The verification stack

Okay, so we need verification to not spread slop. The obvious answer is "do science." But...

Science seems kind of slow for timelines and sure it can be good to have journals and similar to gatekeep stuff but it seems like that place has become a bit moral mazy? Like, from an outside perspective it just seems too slow and too dependent on whatever the field you're trying to publish in likes?

So what does a faster verification process look like that still has teeth? I think you can model it as Bayesian evidence gathering at different resolutions. You're trying to update on whether your ideas are any good, and different types of evidence give you different amounts of bits. The trick is ordering them by cost — cheap bits first, expensive bits later.

John Wentworth has good ideas about this in his MATS Models. One of the main things I took away was: do thought experiments first. They're almost free and they kill a surprising number of bad ideas. Before you run an expensive simulation, look at the real world — there's a bunch of bits there just waiting to be picked up. Can you explain your idea to someone in a different field and have it make sense? Can you find a concrete example where the pattern you're claiming to see actually shows up? Can you state what would falsify it? These are cheap tests that most bad ideas fail.

Then you simulate. You build something small and computational that forces your framework to actually produce predictions rather than just describe things after the fact. Then you try to build something in the real world and see what happens.

But at every stage of this you need to beware of premature optimisation because you are probably not measuring what you think you're measuring. Knowing when you've actually verified something versus when you've just found confirming evidence seems hard. Which suggests you should spend real time exploring the problem space itself before trying to optimise anything — find different framings, restate the problem in different languages, look at it from unfamiliar angles. This is the Feynman move. He spent a bunch of time playing bongo drums and thinking about problems, a classic shape rotator. And Eliezer makes a related point in Einstein's Arrogance — you probably know where to go before you've rationalised it, because the solution doesn't arrive exactly when you have the right bits. It's really a question that lives in the problem space itself, not the solution space.

So maybe the choice of prior determines a lot, and if science is fundamentally about asking the right question, then the most productive thing you can do is learn to ask it in various ways. This is maybe the purpose of distillation — not just making existing work accessible, but reframing it so that you get multiple viewing angles on the same underlying structure. Each angle gives you a different set of bits, and once you have enough independent angles, you can start to actually tell whether you're seeing something real or just confirming what you already believed. The verification comes from the triangulation, not from any single perspective.

Composition without reduction

So if verification comes from triangulation — from having enough independent angles on the same structure — then the question becomes: how do you actually get those angles? The standard move is reductionism. You explain one field in terms of another. Markets are really just individual decision-making. Biology is really just chemistry. The problem is that this throws away exactly the information you need. If you reduce markets to individual decisions, you lose the emergent properties that make markets interesting in the first place. You need a way to connect fields that respects what each one actually knows.

It turns out that this is kind of the whole point of category theory, or at least the more mortal version called applied category theory (afaik). The core idea — and Sahil explains this well in his Live Theory post and it's something that Topos Institute reeks of — is finding the intersection rather than the reduction. You're not trying to say "field A is really just field B." You're trying to say "field A and field B are both doing something that has this structure in common, and that shared structure is where the interesting bits live."

If you want to combine insights from incompatible research networks — people who use completely different ontologies and don't read each other's papers — you should probably take inspiration from here. (Also, here’s some examples of applied category theory explaining a bunch of cool shit in Machine Learning and Programming.)

There are however some ways that you can do this wrongly, mainly generalisation for the sake of generalisation. If you’re not solving a specific problem with it then your shoulder Taleb will start screaming and then it will be impossible to do good work. This is a trap that some mathematicians who are obsessed with the idea of generalisation fall into as all the cool kids (Grothendieck) among others made a bunch of cool stuff happen from it. I think this might be the problem of category theory from an external perspective as it is mostly math wizards who do category theory and so it is mainly used for math wizardry.

As a mere math wizard apprentice I kind of don't wanna go there so I imagine myself more as a category theory rogue, I kind of take what I like so I take the diagrams and the ideas without most of the rigour and run with it. It's fun and works well as long as you enforce it later through either expert verification or simulation (which has a very nice bridge to functions and functional programming!)

A basic example of using applied category theory in a rogue way is the following: if one looks deeply at agent definitions in biology we can see that they generally follow cybernetic definitions; this is also a parent category of utility structures in AI alignment. So if you wanna do some weird michael levin shit on AI systems then cybernetics seems like quite a nice lens to do it from! (Some initial observations from the phylogeny and taxonomy of agents style work, for now these claims are more “trust me bro” level though).

Generalisation of assumptions of different fields using "applied category theory", both principal agent style utility function definitions in RL and the teleological network science in biology and morphology can be explained and viewed through cybernetic control.

Applying it

Okay so what do I actually think? The methodology above doesn't tell you what to work on — it tells you how to work on hard interdisciplinary problems once you've picked a direction. Someone with different assumptions about the world would apply it completely differently. So here's where I show my hand: here are my assumptions, here's where they led, and here's what I'm currently trying to do about it. (e.g show your work!)

If I look at alignment — the actual thing everyone is worried about — I think it's fundamentally a principal-agent problem. A principal wants an agent to do something, the agent has its own dynamics, and you need some mechanism to keep them roughly pointed in the same direction. Yet why do we think about it as individual principal-agent dynamics? When I look at the real world, most alignment schemes that work at scale are distributed. States align millions of people through governance structures. Economies align producers and consumers through market mechanisms. Biology aligns trillions of cells into coherent organisms through signalling networks. None of these are single-principal, single-agent setups. They're all collective.

Which makes it kind of weird that most technical AI safety research frames alignment as a problem between one human (or one human principal) and one AI system. If there's a solution to alignment that actually scales, it on priors probably looks more like the solutions that already exist in the world — and those are all collective intelligence solutions of one kind or another.

(Yes I agree that the general idea of negotiation with an agent that is a lot smarter than you is basically impossible if it isn't value aligned. Yet that is dependent on us not having smart enough collective intelligence to just make it a smaller part of the entire system.)

I'm not the first person to notice this. Andrew Critch has been on exactly this journey, moving from single-agent alignment toward multi-agent and multi-principal problems. Allan Dafoe's research agenda for AI governance talks about ideal governance — what governance ought to look like, not just what it currently is. Tom Malone's work on superminds at MIT explores how groups of humans and machines can be made collectively intelligent. But when I look at what's actually happening in technical AI governance, it's almost entirely about compute tracking, export controls, hardware monitoring. Important stuff, but where are the deeper computational models? Where is the exploration of what ideal governance actually means in formal terms?

Most of the research seems to be about what is — describing current systems, current failures, current dynamics. Very little is about what ought to be. And "ought" is a hard thing to reason about formally, but it's not impossible if you have the right tools.

Following the methodology — look for fields that study related phenomena under different formalisms, find the intersections — led to some specific places. Karl Friston's work on active inference and its connection to AI alignment offers a principled account of how agents form beliefs and coordinate. Michael Levin's work on morphological computation shows how biological systems solve alignment across scales without central control — cells coordinate into organs coordinate into organisms, all through local signalling. Jan Kulveit, PIBBSS, and ACS have been doing serious work on hierarchical agency — how you define multi-scale agentic structures. We also have boundaries research, which is about where one agent ends and another begins, which turns out to be a prerequisite for talking about multi-agent alignment at all. Cooperative AI and computational social science talks a lot about how to create the conditions for cooperation at scale and mechanism design is about the strategy proofing of these dynamics.

The problem is that these fields don’t talk to each other enough.

So if I want to explore the space of what ought to be in terms of AI governance — not just describe what's currently happening — I need to somehow respect what each of these fields actually knows. Agent foundations because you can't talk about alignment without being precise about what agents are. Computational social science because governance mechanisms are the things that actually do the aligning at scale. Biology because living systems are the existence proof that alignment across scales works, and they solve it very differently from how engineers would design it. Cooperative AI because multi-agent interaction is the substrate all of this runs on.

That's four traditions with four different ontologies, four different mathematical languages, four different literatures. I also think that picking one and reducing the others to it doesn't work for exactly the reason we just talked about. Reducing biology to agent foundations loses what biology knows about growing alignment rather than engineering it. Reducing governance to cooperative AI loses what political theory knows about legitimacy and institutional design. You need composition, not reduction.

Or in other words, you'd need something like a Langlands program for collective intelligence. The Langlands program is maybe the greatest example of the applied category theory move in mathematics — taking number theory and geometry, which looked like completely separate fields, and finding the deep structural correspondences between them. Not reducing one to the other. Finding the dictionary.

So this is what I'm currently trying to do. Open research at a very specific intersection, trying to generate a compositional basis that lets you move between these fields while preserving what each one actually knows. On the theory side, working toward a mathematical unification — what are the shared structures between agent foundations, governance mechanisms, biological coordination, and cooperative AI? On the implementation side, building functional simulation infrastructure to verify that the compositions actually compute something rather than just looking pretty on paper.

It's hard and I'm not sure it'll work. But the methodology says: find the intersections, verify through triangulation, and if a single framework can simultaneously say something true about agents AND governance AND biology AND cooperation, that's a lot of bits from a lot of independent angles.

Conclusion

The argument is roughly: progress on hard interdisciplinary problems is combinatorial — it comes from connecting ideas across networks, not from isolated genius. But connection without verification is just noise, and most of what gets connected is garbage. So you need cheap ways to gather bits before expensive ones: thought experiments first, then simulation, then reality. And you need to spend serious time in the problem space before optimising, because premature optimisation against the wrong metric is worse than no optimisation at all.

When you do start connecting fields, the move is composition rather than reduction — respecting what each field actually knows rather than flattening everything into one formalism. Applied to AI alignment specifically, I think the problem is collective, the principal-agent problem is distributed, and the fields that study this contain a lot of bits that aren't being combined. Whether the specific attempt I'm making works is an open question, but given these assumptions it seems like the right thing to try.

Concretely, if you buy this:

Before optimizing, spend real time restating your problem in at least 2-3 different field languages. If you can't, you probably don't understand it in a way that can enable you to be a field connector.
Order your verification by cost. Thought experiments and "does this match any real-world system" before simulation, simulation before building.
When you find a connection between fields, ask whether you're reducing or composing. If you can't say what each field knows that the other doesn't, you're reducing.
Find hedgehogs and learn their language well enough to translate, but have a falsification criterion before you spread the connection. Your value-add is the combination, not the components.

If any of this seems interesting, here's the actual work in various stages of development.

Thanks to Aaron Halpern, Chris Pang and Gunnar Zarnacke for feedback

Appendix: Current applications of this methodology

Some current applications of this methodology to specific problems, in various stages of development. Most of these will hopefully have a shorter blog post explaining them in the next 6 months or so but if you wanna check them out, please do so.

Public posts:

Overleaf PDFs:

When I write out ideas, I use LLMs to help me. I usually find the papers myself through previous exposure or Elicit but I write rough first drafts with Claude so beware! (And if you're allergic to LLM writing don't check it out.)

A Model of Predictive Governance
- Applying the compositional method to democratic decision making from a cybernetic feedback perspective.
Open Problems in Collective Agent Foundations
- Where the composition method hits walls: problems unifying causal mechanics, complex mechanics and markov blankets.
Initial Attempt at a Langlands style approach to collective intelligence
- The most direct application of the compositional methodology — trying to find the functorial connections between fields.
- (LLM-writing warning)
Procedural Alignment: Homeostatic Mechanisms for Human Empowerment Under Moral Uncertainty
- Applying the method to define measurable empowerment metrics — what Layer 2 (computational verification) would need to track.
- (LLM-writing warning)
A Taxonomy of Agents From The Intentional Stance
- The "butterfly" move applied systematically: surveying agent definitions across 15 papers per field and finding what composes.
- (LLM-writing warning)
Scalar properties of Agency
- Attempting to project the compositional connections onto vector spaces for measurement. Very speculative.
- (LLM-writing warning)
When is distributed sense making optimal?
- A Layer 2 verification attempt: comparing distributed vs centralised algorithms and checking against real-world systems.
- (LLM-writing warning)
Cultural Evolution of Cognitive Tools in Multi-Agent AI Systems
- Applying Cecilia Heyes' Cognitive Gadgets to co-evolution of cognitive frames and AI systems. Mainly idea stage.
- (LLM-writing warning)

^{^}

yes, I'm hereby verbing this noun and let it go viral because I think it's a good concept to have

^{^}

and people who have some good intuitions about them often don't use those intuitions as starting points to try to think clearly about the topic; e.g. this is my sense about what's going on with Denis Noble (where my data on him is a podcast lecture and a recounting of the tldr of his views by Dawkins, followed by criticism, which was clear and ~correct but was also missing what I saw as an important generative intuition)

^{^}

yes, I'm hereby verbing this noun and let it go viral because I think it's a good concept to have

^{^}

[-]Mateusz Bagiński2mo40

It's a power law — a society with 10,000 people is exponentially better at finding combinations than a smaller one, and this is a factor in how Heinrich explains the quicker development of Eurasia.

I haven't read The Secret of Our Success, but on Jared Diamond's model (which seems plausible) it was not just a greater population but also its being longitudinally oriented (longer distances that can be traveled easily, because of a similar climate, unlike in the Americas and Africa), and availability of large economically useful animals (Mesoamerica had nothing, the Andes had only llamas and alpacas).

Or in other words, you'd need something like a Langlands program for collective intelligence. The Langlands program is maybe the greatest example of the applied category theory move in mathematics — taking number theory and geometry, which looked like completely separate fields, and finding the deep structural correspondences between them. Not reducing one to the other. Finding the dictionary.
So this is what I'm currently trying to do. Open research at a very specific intersection, trying to generate a compositional basis that lets you move between these fields while preserving what each one actually knows. On the theory side, working toward a mathematical unification — what are the shared structures between agent foundations, governance mechanisms, biological coordination, and cooperative AI? On the implementation side, building functional simulation infrastructure to verify that the compositions actually compute something rather than just looking pretty on paper.

Cool, I like this formulation/exposition.

One caveat I'd add is that an important disanalogy between the original Langlands program (AFAICT?) is that math (or the specific areas of math involved in the program) back then was much more mature than the disciplines you're trying to langland^[1] here. E.g., I think mainstream biology might be missing some key mature teleological concepts,^[2] which might be supplied by AF, if AF were more mature itself. I don't know much about computational social science or cooperative AI (I probably have some bits related to the latter acquired by osmosis that I don't associate with "cooperative AI"). TLDR, I suspect that to langland successfully here, you will need to be willing to do quite a bit of ontological breaking and reshaping (which makes the entire thing harder than the original Langland, at least along some axes, but hey).

^{^}
yes, I'm hereby verbing this noun and let it go viral because I think it's a good concept to have
^{^}
and people who have some good intuitions about them often don't use those intuitions as starting points to try to think clearly about the topic; e.g. this is my sense about what's going on with Denis Noble (where my data on him is a podcast lecture and a recounting of the tldr of his views by Dawkins, followed by criticism, which was clear and ~correct but was also missing what I saw as an important generative intuition)

25

A Compositional Philosophy of Science for Agent Foundations

25

Modelling scientific progress

The verification stack

Applying it

Conclusion

Appendix: Current applications of this methodology

Public posts:

Overleaf PDFs:

25

25