633

LESSWRONG
LW

632
Agent FoundationsEmpiricismGuaranteed Safe AIPractice & Philosophy of ScienceAIWorld Modeling
Frontpage
2025 Top Fifty: 14%

114

Agent foundations: not really math, not really science

by Alex_Altair
17th Aug 2025
6 min read
25

114

114

Agent foundations: not really math, not really science
44[anonymous]
15aysja
8[anonymous]
13Alex_Altair
13Alex_Altair
8[anonymous]
4Richard_Ngo
4Nate Showell
9Alex_Altair
3Nate Showell
6Cole Wyeth
3Alex_Altair
2Noosphere89
2xpym
4Mitchell_Porter
13gjm
4dsj
2Alex_Altair
13Mateusz Bagiński
8Kaarel
6the gears to ascension
3Daniel C
3Donald Hobson
1Jon Garcia
3Alex_Altair
New Comment
25 comments, sorted by
top scoring
Click to highlight new comments since: Today at 10:10 PM
[-][anonymous]1mo*448

(warning: this comment is significantly longer than the post)

I want to being by saying that I appreciate the existence of this post. Truly and honestly. I think it's important to praise those who at least try explaining difficult topics or hard-to-communicate research intuitions, even when those explanations are imperfect or don't fulfill their intended purpose. For it's only by incentivizing genuine attempts that we have a chance of obtaining better ones or dispelling our confusions. And, at the very least, this post represents a good vehicle for me to express my current disagreement with/disapproval of agent foundations research.[1]

Nevertheless, in the interest of honesty, I will say this post leaves me deeply unsatisfied. Kind of like... virtually every post ever made on LW that tries to explain agent foundations? At this point I don't think there's anything about the authors[2] that causes this, but rather the topic itself which doesn't lend itself nicely to this type of communication (but see below for an alternate perspective[3]).


Let's start with the "Empirics" section. Alex Altair writes:

From where I'm standing, it's hard to even think of how experiments would be relevant to what I'm doing. It feels like someone asking me why I haven't soldered up a prototype. That's just... not the kind of thing agent foundations is. I can imagine experiments that might sound like they're related to agent foundations, but they would just be checking a box on a bureaucratic form, and not actually generated by me trying to solve the problem.

It's... hard to see how experiments would be relevant to what you're doing? Really? The point of experiments is to ensure that the mathematical frameworks you are describing actually map onto something meaningful in reality as opposed to being a nice, quaint, self-consistent set of mathematical symbols and ideas that nonetheless reside in their own separate magisterium without predicting anything important about how real life shapes out. As I have said before:

There's a famous Eliezer quote about how for every correct answer to a precisely-stated problem, there are a million times more wrong answers one could have given instead. I would build on that to say that for every powerfully predictive, but lossy and reductive mathematical model of a complex real-world system, there are a million times more similar-looking mathematical models that fail to capture the essence of the problem and ultimately don't generalize well at all. And it's only by grounding yourself to reality and hugging the query tight by engaging with real-world empirics that you can figure out if the approach you've chosen is in the former category as opposed to the latter.

Conor Leahy has written:

Humans are really, really bad at doing long chains of abstract reasoning without regular contact with reality, so in practice imo good philosophy has to have feedback loops with reality, otherwise you will get confused.

I have written:

Idk man, some days I'm half-tempted to believe that all non-prosaic alignment work is a bunch of "streetlighting." Yeah, it doesn't result in the kind of flashy papers full of concrete examples about current models that typically get associated with the term-in-scare-quotes. But it sure seems to cover itself in a veneer of respectability by giving a (to me) entirely unjustified appearance of rigor and mathematical precision and robustness to claims about what will happen in the real world based on nothing more than a bunch of vibing about toy models that assume away the burdensome real-world details serving as evidence whether the approaches are even on the right track. A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs.

I fear there is a general failure mode here that people who are not professional mathematicians tend to fall into when they think about this stuff. People who have read Wigner's Unreasonable Effectiveness of Mathematics and Hamming's follow-up to it and Eliezer's Sequences vibing about the fundamentally mathematical nature of the universe, and whose main takeaway from them is that elegance and compact-descriptiveness in pure mathematics is some sort of strong predictor of real-world applicability. They see all these examples of concepts that were designed purely to satisfy the aesthetic curiosities of pure mathematicians, but afterwards became robustly applicable in concrete, empirical domains.

But there is a huge selection effect here. You only ever hear about the cool math stuff that becomes useful later on, because that's so interesting; you don't hear about stuff that's left in the dustbin of history. It's difficult for me to even put into words a precise explanation meant for non-mathematicians[4] of how this plays out. But suffice it to say the absolute vast majority of pure mathematics is not going to have practical applicability, ever. The vast majority of mathematical structures selected because they are compact and elegant in their description, or even because they arise "naturally"[5] in other domains mathematicians care about, are cute structures worth studying if you're a pure mathematician, but almost surely irrelevant for practical purposes.

Yes, the world is stunningly well-approximated by a relatively compact and elegant set of mathematical rules.[6] But there are infinitely more short and "nice" sets of rules which don't approximate it.[7] The fact that there is a solution out there doesn't mean other posited solutions which superficially resemble it are also correct, or even close to correct. There is no "theory of the second-best" here. And those rules were found through experiments and observations and rigorous analysis of data, not merely through pre-empirical daydreaming and Aristotelian "science."

Eliezer loves talking about the story of how Einstein knew his theory was correct based on its elegance, and how when he was asked by journalists what he'd do if Eddington falsified his theory, he would say “Then I would feel sorry for the good Lord. The theory is correct.” But that's one story! One. And it's cool and memorable and fun and you give it undue weight for how Deeply Awesome it feels. But it still generates the same selection effect I mentioned before. If you peer through the history of science, even prior to Einstein during the time of Galileo and Kepler, and especially after Einstein and further developments we've had in the field of physics, you'll see the story is not representative of the vast, vast majority of how science is done. That empirics reigns, and approaches that ignore it and try to nonetheless accomplish great and difficult science without binding themselves tight to feedback loops almost universally fail.

But also pay attention to the reference class it's in! Physics, I said above. Why physics? How do we know that's at all representative of the type of science we're interested in? If this didn't have an effect, meaning the histories of different fields of inquiry were fundamentally similar, then it wouldn't matter much which specific subfield we focus on. And yet it does! 

Shankar Sivarajan has written:

The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."

The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is. A physicist's view. It is one I'm deeply sympathetic to, and if your definition of science is Rutherford's, you might be right, but a reasonable one that includes chemistry would have to include AI as well.

Richard Ngo has written:

Villiam: I have an intuition that the "realism about rationality" approach will lead to success, even if it will have to be dramatically revised on the way.

To explain, imagine that centuries years ago there are two groups trying to find out how the planets move. Group A says: "Obviously, planets must move according to some simple mathematical rule. The simplest mathematical shape is a circle, therefore planets move in circles. All we have to do is find out the exact diameter of each circle." Group B says: "No, you guys underestimate the complexity of the real world. The planets, just like everything in nature, can only be approximated by a rule, but there are always exceptions and unpredictability. You will never find a simple mathematical model to describe the movement of the planets."

The people who finally find out how the planets move will be spiritual descendants of the group A. Even if on the way they will have to add epicycles, and then discard the idea of circles, which seems like total failure of the original group. The problem with the group B is that it has no energy to move forward.

The right moment to discard a simple model is when you have enough data to build a more complex model.

Richard Ngo: In this particular example, it's true that group A was more correct. This is because planetary physics can be formalised relatively easily, and also because it's a field where you can only observe and not experiment. But imagine the same conversation between sociologists who are trying to find out what makes people happy, or between venture capitalists trying to find out what makes startups succeed. In those cases, Group B can move forward using the sort of "energy" that biologists and inventors and entrepreneurs have, driven an experimental and empirical mindset. Whereas Group A might spend a long time writing increasingly elegant equations which rely on unjustified simplifications.

Instinctively reasoning about intelligence using analogies from physics instead of the other domains I mentioned above is a very good example of rationality realism.

jamii has written:

Uncontrolled argues along similar lines - that the physics/chemistry model of science, where we get to generalize a compact universal theory from a number of small experiments, is simply not applicable to biology/psychology/sociology/economics and that policy-makers should instead rely more on widespread, continuous experiments in real environments to generate many localized partial theories.


Anyway, enough on that.[8] Let's move on to "What makes agent foundations different?" In the very first paragraph, Alex Altair writes:

One thing that makes agent foundations different from science is that we're trying to understand a phenomenon that hasn't occurred yet (but which we have extremely good reasons for believing will occur). I can't do experiments on powerful agents, because they don't exist.

The first sentence is false. Science routinely tries to understand phenomena it has good reason to believe exist, but it hasn't been able to pinpoint exactly and concretely yet. The paradigmatic and illustrative example of this is the search for room-temperature superconductors. This is primarily done by scientists, not by engineers.

But more to the point, the second sentence also reads as substantively dubious. You don't have all-powerful ASI to experiment on, but here's what you do have:

  • the sole example of somewhat-aligned generally-intelligent occasionally-agentic beings ever created, namely humans
  • somewhat agentic, somewhat intelligent, easy-to-query-and-interact-with AI models that you can (very cheaply!) run recurrent experiments on to test your theories

Does your theory of agency have nothing to say about either of them? Then why on Earth would you assume any partial results you obtain are anywhere close to reliable? Are you assuming a binary dichotomy between something that's "smart and agentic, so our theories apply" on one end, and "dumb and unagentic, so our theories don't apply" on the other end?[9] 

If that's so, and even humans fall into the latter, then I also don't see why your theories would have any applicability in the most safety-critical regime, i.e., when the first powerful models are created. Nate Soares has written:

By default, the first minds humanity makes will be a terrible spaghetti-code mess, with no clearly-factored-out "goal" that the surrounding cognition pursues in a unified way. The mind will be more like a pile of complex, messily interconnected kludges, whose ultimate behavior is sensitive to the particulars of how it reflects and irons out the tensions within itself over time.

If your theory of Agent Foundations has nothing to say about current AI, and nothing to say about current generally-intelligent humans,[10] does it have anything to say about the actual AGI we might create?

Alex Altair also writes:

So, I don't think that what we're lacking is data or information about the nature of agents -- we're lacking understanding of the information we already have.

Well, actually, you are lacking some data or information about something you care about when it comes to agency. Namely, to what extent are the models we're interested in aligning actually well-modeled as agents. When I look at other humans around me (and at myself), I see beings that are well-approximated as agents in some ways, and poorly-approximated as agents in other ways. Figuring out which aspect of cognition falls on which side of this divide seems like an absolutely critical (maybe even the absolutely critical) question I'd want agent foundations to give me a reliable answer to. Are you not even trying to do that?

Let me give a concrete example. A while ago, spurred on by observing many instances of terrible arguments in favor of treating relevant-agents-as-utility-maximizers, I wrote a question post on "What do coherence arguments actually prove about agentic behavior?" Do you think you have an complete answer to this question? And do you think you don't even need any new data or information to answer it?


And finally, Alex Altair talks about how "It's kinda like computer science." And he writes:

One needs to have some kind of life experiences that points your mind toward the relevant concepts, like "computing machines" at all. But once someone has those, they don't necessarily need more information from experiments to figure out a bunch of computability theory.

One does need information from experiments to know that computability theory is at all useful/important in the real world. And also to know when it matters, and when it doesn't or is an incomplete description of what's going on.[11] Which is what I hope agent foundations is about. Something useful for AI safety. Something useful in practice. If it's a cute branch of essentially-math that doesn't necessarily concern itself with saving the world from AI doom, why should anyone give you any money or status in the AI safety community?

In any case, the analogy to computations also feels like the result of a kind of selection effect as well. Conor Leahy has written:

It is not clear to me that there even is an actual problem to solve here. Similar to e.g. consciousness, it's not clear to me that people who use the word "metaphilosophy" are actually pointing to anything coherent in the territory at all, or even if they are, that it is a unique thing. It seems plausible that there is no such thing as "correct" metaphilosophy, and humans are just making up random stuff based on our priors and environment and that's it and there is no "right way" to do philosophy, similar to how there are no "right preferences". I know the other view ofc and still worth engaging with in case there is something deep and universal to be found (the same way we found that there is actually deep equivalency and "correct" ways to think about e.g. computation).

Yes, we have found a deep mathematical way of thinking about computation. Something simple, compact, elegant. But saying agent foundations is like the study of computation... kind of hides the fact that the theory of computation might be sort of a one-off in terms of how nice it is to formalize properly? As I see it, and as Leahy writes above, we have examples of things that make sense intuitively and we could formalize nicely (computation). And we also have examples of things that make sense intuitively and we couldn't formalize nicely (everything else he talks about in that comment). Saying agent foundations is like computation is putting the cart before the horse, it's assuming it falls into the category of nicely-formalizable things, which feels to me like it isn't a representative subset of the set of things-we-try-to-formalize.

  1. ^

    Disclaimer: as an outsider who is not working on AI safety in any way, shape, or form

  2. ^

    There are many of them, they use different writing styles and bring attention to different kinds of evidence or reasoning, etc.

  3. ^

    Spoiler alert: it's hard to communicate why agent foundations makes sense... because agent foundations doesn't make sense

  4. ^

    By which I mean, people literally not in math academia

  5. ^

    In some hard-to-define aesthetic sense

  6. ^

    But that's not even fully correct, frankly. Not to get into the nerdy weeds of this too much, but modern QFT, for instance, requires an inelegant and mathematically dubious cancellation of infinities to allow 

  7. ^

    And yet scientists thought they did, for a long time!

  8. ^

    For now

  9. ^

    It's difficult to believe you'd actually hold this view, since frankly it's really dumb, but I also would have had a difficult time believing you'd say you don't have any experiments to run... and yet you're saying it regardless!

  10. ^

    Since, again, you're not running any experiments to check your theories against them

  11. ^

    As an illustrative example, proving an algorithm can be computed in polynomial time is cool, but maybe the constants involved are so large you can't actually make it work in practice. If all complexity theory did was the former, without also telling me what the domain of applicability of its results is when it comes to what I actually care about, then I'd care about complexity theory a lot less

Reply42211
[-]aysja1mo153

Empirics reigns, and approaches that ignore it and try to nonetheless accomplish great and difficult science without binding themselves tight to feedback loops almost universally fail.

Many of our most foundational concepts have stemmed from first principles/philosophical/mathematical thinking! Examples here abound: Einstein’s thought experiments about simultaneity and relativity, Szilard’s proposed resolution to Maxwell’s demon, many of Galileo’s concepts (instantaneous velocity, relativity, the equivalence principle), Landauer’s limit, logic (e.g., Aristotle, Frege, Boole), information theory, Schrödinger’s prediction that the hereditary material was an aperiodic crystal, Turing machines, etc. So it seems odd, imo, to portray this track record as near-universal failure of the approach.

But there is a huge selection effect here. You only ever hear about the cool math stuff that becomes useful later on, because that's so interesting; you don't hear about stuff that's left in the dustbin of history.

I agree there are selection effects, although I think this is true of empirical work too: the vast majority of experiments are also left in the dustbin. Which certainly isn’t to say that empirical approaches are doomed by the outside view, or that science is doomed in general, just that using base rates to rule out whole approaches seems misguided to me. Not only because one ought to choose which approach makes sense based on the nature of the problem itself, but also because base rates alone don’t account for the value of the successes. And as far as I can tell, the concepts we’ve gained from this sort of philosophical and mathematical thinking (including but certainly not limited to those above) have accounted for a very large share of the total progress of science to date. Such that even if I restrict myself to the outside view, the expected value here still seems quite motivating to me.

Reply
[-][anonymous]1mo85

Many of our most foundational concepts have stemmed from first principles/philosophical/mathematical thinking

Conflating "philosophy" and "mathematics" is another instance of the kind of sloppy thinking I'm warning against in my previous comment. 

The former[1] is necessary and useful, if only because making sense of what we observe requires us to sit down and peruse our models of the world and adjust and update them. And also because we get to generate "thought experiments" that give us more data with which to test our theories.[2]

The latter, as a basic categorical matter, is not the same as the former. "Mathematics" has a siren-like seduction quality to those who are mathematically-inclined. It comes across, based not just on structure but also on vibes and atmosphere, as giving certainty and rigor and robustness. But that's all entirely unjustified until you know the mathematical model you are employing it actually useful for the problem at hand.

So it seems odd, imo, to portray this track record as near-universal failure of the approach.

Of what approach? 

Of the approach that "it's hard to even think of how experiments would be relevant to what I'm doing," as Alex Altair wrote about above? The only reason all those theories you mentioned before ultimately obtained success and managed to be refined into something closely approximated reality is because after some initial, flawed versions of them were proposed, scientists looked very hard at experiments to verify them, iron out their flaws, and in some situations throw away completely mistaken approaches. Precisely the type of feedback loop that's necessary to do science.

This approach, that the post talks about, has indeed failed universally.

I agree there are selection effects, although I think this is true of empirical work too: the vast majority of experiments are also left in the dustbin.

Yes, the vast majority of theories and results are left in the dustbin after our predictions make contact and are contrasted with our observations. Precisely my point. That's the system working as intended.

Which certainly isn’t to say that empirical approaches are doomed by the outside view

... what? What does this have to do with anything that came before it? The fact that approaches are ruled out is a benefit, not a flaw, of empirics. It's a feature, not a bug. It's precisely what makes it work. Why would this ever say anything negative about empirical approaches?

By contrast, if "it's hard to even think of how experiments would be relevant to what I'm doing," you have precisely zero means of ever determining that your theories are inappropriate for the question at hand. For you can keep working on and living in the separate magisterium of mathematics, rigorously proving lemmas and theorems and result with the iron certainty of mathematical proof, all without binding yourself to what matters most.

Not only because one ought to choose which approach makes sense based on the nature of the problem itself

Taking this into account makes agent foundations look worse, not better. 

As I've written about before, the fundamental models and patterns of thought embedded in these frameworks were developed significantly prior to Deep Learning and LLM-type models taking over. "A bunch of models that seem both woefully underpowered for the Wicked Problems they must solve and also destined to underfit their target, for they (currently) all exist and supposedly apply independently of the particular architecture, algorithms, training data, scaffolding etc., that will result in the first patch of really powerful AIs," as I said in that comment. The bottom line was written down long before it was appropriate to do so.

but also because base rates alone don’t account for the value of the successes

And if I look at what agent foundations-type researchers are concluding on the basis of their purely theoretical mathematical vibing, I see precisely the types of misunderstandings, flaws, and abject nonsense that you'd expect when someone gets away with not having to match their theories up with empirical observations.[3]

Case in point: John Wentworth claiming he has "put together an agent model which resolved all of [his] own most pressing outstanding confusions about the type-signature of human values," when in fact many users here have explained in detail[4] why his hypotheses are entirely incompatible with reality.[5]

Such that even if I restrict myself to the outside view, the expected value here still seems quite motivating to me.

I don't think I ever claimed restricting to the outside view is the proper thing to do here. I do think I made specific arguments for why it shouldn't feel motivating.

  1. ^

    Which, mind you, we barely understand at a mechanistic/rigorous/"mathematical" level, if at all

  2. ^

    Which is what the vast majority of your examples are about

  3. ^

    And also the kinds of flaws that prevent whatever results are obtained from actually matching up with reality, even if the theorems themselves are mathematically correct

  4. ^

    See also this

  5. ^

    And has that stopped him? Of course not, nor do I expect any further discussion to. Because the conclusions he has reached, although they don't make sense in empirical reality, do make sense inside of the mathematical models he is creating for his Natural Abstractions work. This is reifying the model and elevating it over reality, an even worse epistemic flaw than conflating the two.

    The one time he confessed he had been working on "speedrun[ning] the theory-practice gap" and creating a target product with practical applicability, it failed. Two years prior, he had written "Note that “theory progressing faster than expected, practice slower” is a potential red flag for theory coming decoupled from reality, though in this case the difference from expectations is small enough that I’m not too worried. Yet." But he didn't seem all that worried now either.

Reply
[-]Alex_Altair1mo131

By contrast, if "it's hard to even think of how experiments would be relevant to what I'm doing," you have precisely zero means of ever determining that your theories are inappropriate for the question at hand.

Here, you've gotten too hyperbolic about what I said. When I say "experiments", I don't mean "any contact with reality". And when I said "what I'm doing", I didn't mean "anything I will ever do". Some people I talk to seem to think it's weird that I never run PyTorch, and that's the kind of thing where I can't think of how it would be relevant to what I'm currently doing.

When trying to formulate conjectures, I am constantly fretting about whether various assumptions match reality well enough. And when I do have a theory that is at the point where it's making strong claims, I will start to work out concrete ways to apply it.

But I don't even have one yet, so there's not really anything to check. I'm not sure how long people are expecting this to take, and this difference in expectation might be one of the implicit things driving the confusion. As many theorems there are that end up in the dustbin, there is even more pre-theorem work that end up in the dustbin. I've been at this for three and change years, and I would not be surprised if it takes a few more years. But the entire point is to apply it, so I can certainly imagine conditions under which we end up finding out whether the theory applies to reality.

Reply1
[-]Alex_Altair1mo137

Which is what I hope agent foundations is about. Something useful for AI safety. Something useful in practice. If it's a cute branch of essentially-math that doesn't necessarily concern itself with saving the world from AI doom, why should anyone give you any money or status in the AI safety community?

Separating this response out for visibility -- it is unequivocally, 100% my goal to reduce AI x-risk. The entire purpose of my research is to eventually apply it in practice.

Reply2
[-][anonymous]1mo83

I believe you, and I want to clarify that I did not (and do not) mean to imply otherwise. I also don't mean to imply you shouldn't get money or status; quite the opposite. 

It's just the post itself[1] that doesn't make the whole "agent foundations is actually for solving AI x-risk" thing click for me.

  1. ^

    And other posts on LW trying to explain this

Reply22
[-]Richard_Ngo2d40

Note that I've changed my position dramatically over the last few years, and now basically endorse something very close to what I was calling "rationality realism" (though I'd need to spend some time rereading the post to figure out exactly how close my current position is).

In particular, I think that we should be treating sociology, ethics and various related domains much more like we treat physics.

I also endorse this quote from a comment above, except that I wouldn't call it "thinking studies" but maybe something more like "the study of intelligent agency" (and would add game theory as a central example):

there is a rich field of thinking-studies. it’s like philosophy, math, or engineering. it includes eg Chomsky's work on syntax, Turing’s work on computation, Gödel’s work on logic, Wittgenstein’s work on language, Darwin's work on evolution, Hegel’s work on development, Pascal’s work on probability, and very many more past things and very many more still mostly hard-to-imagine future things

Reply
[-]Nate Showell1mo4-4

For me, the OP brought to mind another kind of "not really math, not really science": string theory. My criticisms of agent foundations research are analogous to Sabine Hossenfelder's criticisms of string theory, in that string theory and agent foundations both screen themselves off from the possibility of experimental testing in their choice of subject matter: the Planck scale and very early universe for the former, and idealized superintelligent systems for the latter. For both, real-world counterparts (known elementary particles and fundamental forces; humans and existing AI systems) of the objects they study are primarily used as targets to which to overfit their theoretical models. They don't make testable predictions about current or near-future systems. Unlike with early computer science, agent foundations doesn't come with an expectation of being able to perform experiments in the future, or even to perform rigorous observational studies.

Reply
[-]Alex_Altair1mo90

Ah, I think this is a straight-forward misconception of what agent foundations. (Or at least, of what my version of agent foundations is.) I am not trying to forge a theory of idealized superintelligent systems. I am trying to forge a theory of "what the heck is up with agency at all??". I am attempting to forge a theory that can make testable predictions about current and near-future systems.

Reply
[-]Nate Showell1mo3-2

I was describing reasoning about idealized superintelligent systems as the method used in agent foundations research, rather than its goal. In the same way that string theory is trying to figure out "what is up with elementary particles at all," and tries to answer that question by doing not-really-math about extreme energy levels, agent foundations is trying to figure out "what is up with agency at all" by doing not-really-math about extreme intelligence levels.

 

If you've made enough progress in your research that it can make testable predictions about current or near-future systems, I'd like to see them. But the persistent failure of agent foundations research to come up with any such bridge between idealized models and real-world system has made me doubtful that the former are relevant to the latter.

Reply
[-]Cole Wyeth1mo60

I predicted that LLM ICL would perform reasonably well at predicting the universal distribution without finetuing and it apparently does:

https://www.alignmentforum.org/posts/xyYss3oCzovibHxAF/llm-in-context-learning-as-approximating-solomonoff

Would love to see a follow up experiment on this.

I haven’t looked into it yet but apparently Peter Bloem showed that pretraining on a Solomonoff-like task also improves performance on text prediction: https://arxiv.org/abs/2506.20057

Taken together, seems like some empirical evidence for LLM ICL as approximating Solomonoff induction, which is a frame I’ve been using clearly motivated by a type of “agent foundations” or at least “learning foundations” intuition. Of course it’s very loose. I’m working on a better example.


(Incidentally, I would probably be considered to be in math academia)

Reply1
[-]Alex_Altair1mo30

...I also do not use "reasoning about idealized superintelligent systems as the method" of my agent foundations research. Certainly there are examples of this in agent foundations, but it is not the majority. It is not the majority of what Garrabrant or Demski or Ngo or Wentworth or Turner do, as far as I know.

It sounds to me like you're not really familiar with the breadth of agent foundations. Which is perfectly fine, because it's not a cohesive field yet, nor is the existing work easily understandable. But I think you should aim for your statements to be more calibrated.

Reply
[-]Noosphere891mo20

Notably, in the case of string theory, the fact that it predicts everything we currently observe plus new forces at the planck scale is currently better than all other theories of physics, because currently all other theories either predict something we have reason not to observe or limit themselves to a subset of predictions that other theories already predict, so the fact that string theory can predict everything we observe and predict (admittedly difficult to falsify) observations is enough to make it a leading theory.

No comment on whether the same applies to agent foundations.

Reply
[-]xpym1mo20

in the case of string theory, the fact that it predicts

Hmm, my outsider impression is that there's in fact a myriad "string theories", all of them predicting everything we observe, but with no way to experimentally discern the correct one among them for the foreseeable future, which I have understood to be the main criticism. Is this broad-strokes picture fundamentally mistaken?

Reply
[-]Mitchell_Porter1mo42

There are a large number of "string vacua" which contain particles and interactions with the quantum numbers and symmetries we call the standard model, but (1) they typically contain a lot of other stuff that we haven't seen (2) the real test is whether the constants (e.g. masses and couplings) are the same as observed, and these are hard to calculate (but it's improving). 

Reply
[-]gjm1mo137

When Alan Turing figured out computability theory, he was not doing pure math for math's sake; he was trying to grok the nature of computation so that we could actually build better computers.

Are you sure this is true? I am not a Turing expert but my impression is that his practical work on building computing machinery was downstream of his theoretical work on the nature of computation, not vice versa.

Reply
[-]dsj1mo42

My memory from reading Andrew Hodges’ authoritative biography of Turing is that his theory was designed as a tool to solve the Entscheidungsproblem, which was a pure mathematical problem posed by Hilbert. It just happened to be a convenient formalism for others later on. GPT-5 agrees with me.

Reply
[-]Alex_Altair1mo2-5

I'm not very confident about this, but it's my current impression. Happy to have had it flagged!

Reply
[-]Mateusz Bagiński1mo*130

Thanks for writing this up! I strong-upvoted because, as you say, these ideas are not well-communicated, and this post contributes an explanation that I expect to be clarifying to a significant subset of people confused about agent foundations.

Initially, I wasn't quite buying the claim that we don't need any experiments (or more generally, additional empirics) to understand agency and all we need is to "just crunch" math and philosophy. The image I had in mind was something like "This theorem proves something non-trivial — or even significantly surprising — about a class of agents that includes humans, and we are in a position to verify it experimentally, so we should do it, to ensure that we're not fooling ourselves.".

Then, this passage made it click for me and I saw the possibility that maybe we are in a position where armchairs, whiteboards, tons of paper, and Lean code are sufficient.

It's noteworthy that humanity did indeed deliberately invent the first Turing-complete programming languages before building Turing-complete computers, and we have also figured out a lot of the theory of quantum computing before building actual quantum computers.

When Alan Turing figured out computability theory, he was not doing pure math for math's sake; he was trying to grok the nature of computation so that we could actually build better computers. And he was not doing typical science, either. He obviously had considerable experience with computers, but I seriously doubt that, for example, work on his 1936 paper involved running into issues which were resolved by doing experiments. I would say agent foundations researchers have similarly considerable experience with agents.

(Another example/[proof of concept]/[existence proof of the reference class] is Einstein's Arrogance.)

However, the reference class that includes the theory of computation is one possible reference class that might include the theory of agents.[1] But for all (I think) we know, the reference class we are in might also be (or look more like) complex systems studies, where you can prove a bunch of neat things, but there's also a lot of behavior that is not computationally reducible and instead you need to observe, simulate, crunch the numbers. Moreover, noticing surprising real-world phenomena can serve as a guide to your attempts to explain the observed phenomena in ~mathematical terms (e.g., how West et al. explained (or re-derived) Kleiber's law from the properties of intra-organismal resource supply networks[2]).

I don't know what the theory will look like; to me, its shape remains an open a posteriori question.

  1. ^

    Or whatever theory we need to understand agents as the theory that we need to understand agents need not be a theory of agents (but maybe something broader like IDK adaptivity or powerful optimization processes or maybe there's a new ontology that cuts across our intuitive notion of agency and kinda dissolves it for the purpose of joint-carving understanding).

  2. ^

    The explanation of their proof that I was able to understand is the one in this textbook.

Reply
[-]Kaarel1mo*80

However, the reference class that includes the theory of computation is one possible reference class that might include the theory of agents.[1] But for all (I think) we know, the reference class we are in might also be (or look more like) complex systems studies, where you can prove a bunch of neat things, but there's also a lot of behavior that is not computationally reducible and instead you need to observe, simulate, crunch the numbers. Moreover, noticing surprising real-world phenomena can serve as a guide to your attempts to explain the observed phenomena in ~mathematical terms (e.g., how West et al. explained (or re-derived) Kleiber's law from the properties of intra-organismal resource supply networks[2]). I don't know what the theory will look like; to me, its shape remains an open a posteriori question.

along an axis somewhat different than the main focus here, i think the right picture is: there is a rich field of thinking-studies. it’s like philosophy, math, or engineering. it includes eg Chomsky's work on syntax, Turing’s work on computation, Gödel’s work on logic, Wittgenstein’s work on language, Darwin's work on evolution, Hegel’s work on development, Pascal’s work on probability, and very many more past things and very many more still mostly hard-to-imagine future things. given this, i think asking about the character of a “theory of agents” would already soft-assume a wrong answer. i discuss this here

i guess a vibe i'm trying to communicate is: we already have thinking-studies in front of us, and so we can look at it and get a sense of what it's like. of course, thinking-studies will develop in the future, but its development isn't going to look like some sort of mysterious new final theory/science being created (though there will be methodological development (like for example the development of set-theoretic foundations in mathematics, or like the adoption of statistics in medical science), and many new crazy branches will be developed (of various characters), and we will surely ≈resolve various particular questions in various ways (though various other questions call for infinite investigations))

Reply
[-]the gears to ascension1mo60

I still have an intuition that there could be interesting experiments that relate somehow to smooth cellular automata, but I've repeatedly tried to nail down why I think that and come up with confusing non-answers. I feel like basic science on cellular automata might make agent foundations concepts show up as shadows in the image produced by blind surveying of phenomena, and my intuition likes smooth because reality is smooth[citation needed] and so maybe reality-phenomena show up in the smooth ones more. If I was more productive I'd probably have done some throwaway experiments on that by now.

Reply
[-]Daniel C1mo30

With normal science, there's a phenomenon that we observe, and what we want is to figure out the underlying laws. With AI systems, it's more accurate to say that we know the underlying laws (such as the mathematics of computation, and the "initial conditions" of learning algorithms) and we're trying to figure out what phenomena will occur (e.g. what fraction of them will undergo instrumental convergence).

 

I’d say part of agent foundations is the reverse: We know what phenomena will probably occur (extreme optimization by powerful agent) and what phenomena we want to cause (alignment). And we’re trying to understand the underlying laws that could cause those phenomena (algorithms behind general intelligence that have not been invented yet) so that we can steer them towards the outcomes we want.

Reply
[-]Donald Hobson1mo30

One thing that's kind of in the powerful non-fooming corrigible AI bucket is a lot of good approximations to the higher complexity classes. 

There is a sense in which, if you had an Incredibly fast 3 sat algorithm, you could use it with a formal proof checker to prove arbitrary mathematical statements. You could use your fast 3sat + a fluid dynamics simulator to design efficient aerofoils. There is a lot of interesting search and optimization and simulation things that you could do trivially, if you had infinite compute. 

There is a sense that an empty python terminal is already a corigable AI. It does whatever you tell it to. You just have to tell it in python. This feels like it's missing something. But when you try to say what is missing, the line between a neat programming language feature and a corrigable AI seems somehow blurry.

Reply
[-]Jon Garcia1mo10

So is Agent Foundations primarily about understanding the nature of agency so we can detect it and/or control it in artificial models, or does it also include the concept of equipping AI with the means of detecting and predictively modeling agency in other systems? Because I strongly suspect the latter will be crucial in solving the alignment problem.

The best definition I have at the moment sees agents as systems that actively maintain their internal state within a bounded range of viability in the face of environmental perturbations (which would apply to all living systems) and that can form internal representations of arbitrary goal states and use those representations to reinforce and adjust their behavior to achieve them. An AGI whose architecture is biased to recognize needs and goals in other systems, not just those matching human-specific heuristics, could be designed to adopt those predicted needs and goals as its own provisional objectives, steering the world toward its continually evolving best estimate of what other agentic systems want the world to be like. I think this would be safer, more robust, and more scalable than trying to define all human preferences up front.

These are just my thoughts. Take from them what you will.

Reply
[-]Alex_Altair1mo30

I am not personally working on "equipping AI with the means of detecting and predictively modeling agency in other systems", but I have heard other people talk about that cluster of ideas. I think it's in-scope for agent foundations.

Reply
Moderation Log
More from Alex_Altair
View more
Curated and popular this week
25Comments
Agent FoundationsEmpiricismGuaranteed Safe AIPractice & Philosophy of ScienceAIWorld Modeling
Frontpage

These ideas are not well-communicated, and I'm hoping readers can help me understand them better in the comments.

The classical model of the scientific process is that its purpose is to find a theory that explains an observed phenomenon. Once you have any model whose outputs matches your observations, you have a valid candidate theory. Occam's razor says it should be simple. And if your theory can make correct predictions about observations that hadn't previously been made, then the theory is validated.

The classical model of mathematics is that you start with axioms and inference rules, and you derive theorems. There is no requirement that the axioms or theorems need to reflect something in reality to be considered mathematically valid (although they almost always do). Mathematicians have intuitions about what theorems are true before they prove them, and they have opinions about what theorems are important or meaningful, based partly on aesthetics.

What I[1] am trying to do with agent foundations is not really either of these, and I think this is one reason why many people don't "get" agent foundations. We're trying to understand a phenomenon in the real world (agents), but our methods are almost exclusively mathematical (or arguably philosophical). The nature of the phenomenon is substrate-independent, and so we don't need to interact directly with "reality" to do our work. But we're also not totally sure which substrate independent thing it is, so we're still working out what mathematical objects are the right ones to be working with.

I do think this makes it a harder type of research. I just also think it's the type of research we have to do to get a good future.

Empirics

This mismatch becomes especially salient when considering its relationship to empiricism. People sometimes ask (understandably!) agent foundations researchers what experiments they plan to do. And sometimes people imply that because the field is not doing experiments, it is probably detached from reality and not useful. I have found these interactions awkward and unsatisfying for both parties, I think because we don't have a shared concept for me to refer to, somewhere between science and math.

From where I'm standing, it's hard to even think of how experiments would be relevant to what I'm doing. It feels like someone asking me why I haven't soldered up a prototype. That's just... not the kind of thing agent foundations is. I can imagine experiments that might sound like they're related to agent foundations, but they would just be checking a box on a bureaucratic form, and not actually generated by me trying to solve the problem.

I spend my time reading math books, pacing around thinking really hard, and trying to formulate and prove theorems. I am regularly accessing my beliefs about how the ideas can eventually be applied to reality, to guide what math I'm thinking about, but at no point have I thought to myself "what I need now is to run an experiment". The closest thing I do is when I search for whether people have already written papers about the ideas I'm developing, or sanity-checking my thoughts by talking to other researchers.

What makes agent foundations different?

One thing that makes agent foundations different from science is that we're trying to understand a phenomenon that hasn't occurred yet (but which we have extremely good reasons for believing will occur). I can't do experiments on powerful agents, because they don't exist. And of course, the whole point here is that they're fatally dangerous by default, so bringing them into existence would not be worth the information gotten from such an "experiment". I also cannot usefully do experiments on exiting AI models, because they're not displaying the phenomenon that I'm trying to understand.[2]

With normal science, there's a phenomenon that we observe, and what we want is to figure out the underlying laws. With AI systems, it's more accurate to say that we know the underlying laws (such as the mathematics of computation, and the "initial conditions" of learning algorithms) and we're trying to figure out what phenomena will occur (e.g. what fraction of them will undergo instrumental convergence).

So, I don't think that what we're lacking is data or information about the nature of agents -- we're lacking understanding of the information we already have. The reason I'm not thinking about experiments is because I don't feel any pull toward gaining more information of that type. I'm not confused in a way where looking at something in the territory will resolve my confusion. I believe the answers to my research questions are already contained within what we know, in the same way that the truth-value of conjectures is already contained within the logic, axioms, and definitions.

If we were trying to figure out chemistry and material science, then we absolutely would need tons of information, because our everyday observations are simply insufficient information to pin down the true theory of matter. There are tons of ways that the underlying laws of physics of stuff could be, and you can't simply figure it out by thinking about it.

But I don't think that's true for agents. I'm not saying that I think I could have been born in an armchair and then do nothing but think until one day I eventually understand agents. But I am saying that the decades of my life that I've already lived, combined with intensive interactions with other researchers, are sufficient real-world information for me to have about agents.

It's kinda like "computer science"

For some reason, the field that studies the mathematics of computation ended up being called computer science. This might be non-coincidentally related to what I'm trying to express about agent foundations. Computation is substrate-independent, so after we figured out the definition of computation which usefully captured the phenomenon we wanted to engineer, we no longer had to check with reality about it to make progress on important questions.

I don't think that Archimedes could have figured out basically any results of computability theory. This is despite the fact that, in "theory", one could figure that all out by thinking. (He even had humans as examples of general-purpose computers.) But that's not really sufficient. One needs to have some kind of life experiences that points your mind toward the relevant concepts, like "computing machines" at all. But once someone has those, they don't necessarily need more information from experiments to figure out a bunch of computability theory. I think if people in Charles Babbage's era had decided that we needed to grok the nature of computation-in-general in order to save the world, then they could have done it, and done so without figuring out transistors or magnetic memory or whatever. It's noteworthy that humanity did indeed deliberately invent the first Turing-complete programming languages before building Turing-complete computers, and we have also figured out a lot of the theory of quantum computing before building actual quantum computers.

When Alan Turing figured out computability theory, he was not doing pure math for math's sake; he was trying to grok the nature of computation so that we could actually build better computers. And he was not doing typical science, either. He obviously had considerable experience with computers, but I seriously doubt that, for example, work on his 1936 paper involved running into issues which were resolved by doing experiments. I would say agent foundations researchers have similarly considerable experience with agents.

We need a lot of help

I'm pretty sure that there's a nature-carving concept of non-fooming powerful optimizers, and corrigible agents, and other things, and that if we figure them out, we can navigate the future more safely. And I'm pretty sure it doesn't make sense for me to do experiments to figure it out. Instead I have to learn or invent enough math to have the right concepts, and then prove theorems about them, and that will help enable us to build said safe optimizers.

Cosmologists were able to construct an unimaginably precise and deep theory of the origin of the universe, despite never being able to perform interventional experiments. Nuclear physicists were able to get the first nuclear reactions and detonations (mostly) right on the first try.

Maybe if we can get as many agent foundations researchers as there were nuclear physicists or cosmologists, we can collectively discover as much understanding about the nature of agents to navigate to a good future.

  1. ^

    I say "I" here only because I don't want to put words in the mouths of other agent foundations researchers. My sense is that what I'm saying here is true for the whole field, but other researchers should feel free to chime in.

  2. ^

    Other sub-fields of AI safety can usefully do experiments on existing models, because they're asking different questions (like "how can we interpret existing models?" and "in what ways are existing models dangerous?"). This research is much more like a standard science, and that's great! AI safety needs a million different people doing a million different jobs. I think agent foundations is one of those jobs.