This post is a Cunningham's law draft, less than 50% finished, in some parts mere notes. Consider a) waiting until this notice has disappeared to read a more coherent post, or b) criticizing it with a focus on what would be right, not just what is wrong. I haven't strongly made my mind yet, so at this stage I'm particularly interested in fundamental criticisms of the goal and framing (but of course I also welcome minor corrections).
TL;DR: The goals of Agent Foundations seem so far-fetched that all the progress made in the field doesn't seem to have decreased our distance to them that much. One might conclude that the goal is unachievable, the question ill-posed. But even the available rejections of AF seems unprincipled: instead of proving that the task is impossible,[1]we simply fail, and do something else instead. Working toward a principled refutation of Agent Foundations might either indeed refute AF or point to unexplored directions, and both outcomes would be helpful information.
Prelude: Kant's refutation of all arguments of God's existence
Kant argued that there are three (and only three) related concepts of God, definable on three different levels:
Defined independently of any universe: God as set of all logical predicates.
Defined in connection to the existence of a universe, without assuming any of its properties: God as cause of the universe.
Defined in connection to our universe: God as cause of the apparent purposefulness of the universe.
He argued furthermore that none of three definitions have enough "meat" to allow for a valid proof of its existence, and concluded that everyone should stop wasting their time trying to prove God's existence.
I'm not interested in the specifics of Kant's argument; only in the structure:
1. We are dealing with concepts that are to some extent[2] definable without reference to the specific facts of our universe
2. There are different levels of "abstraction from the facts of our universe" in which different definitions can be made
3. These different formulations potentially have connections to each other (e.g. one might be a sub-specification of another)
4. An attempt can be made at showing that all levels and all possible definitions have been listed
5. An attempt can be made at showing, for each level and definition, that there isn't enough for the kind of proof we are looking for
Computation can be defined at the level of what a Turing machine can or cannot produce, at the level of what belongs to different complexity classes, and at the level of what can be computed in our universe.
The concept of computation at the topmost level has enough content to allow e.g. for a proof that Turing-machine-computable and lambda-calculus-computable refer to the same thing.
At the second level, everything computable belongs to some complexity class. But you can still work on the first level and ignore this extra specification.
At the third level, the Cobham-Edmons thesis claims that only polynomial complexity is tractable in our universe. This is a fact about our universe (other universe with different properties are conceivable), and at the same time it is a very general fact that abstracts from the specifics of our universe, such as other universes also fulfilling this property while being very different from ours are also conceivable.
At the bottom-most level, you need a substrate for your computations, and this adds a lot of specifications to what computing means.
Across the different levels, computability is, in same sense, the same object, and in some sense three different objects.[3]
Locating the AF paradigm
Agent Foundations is in search of a paradigm. It's probably worth it to reflect on what exactly a paradigm is. My initial approximation is a self-contained networks of concepts and proofs. A second worthwhile question in the search for a paradigm is: in which level of abstraction does the paradigm live?[4]
It seems plausible to me that a solution for alignment can be found using several self-contained components, and that not all components are on the same level. But it seems very implausible that we can cobble concepts from different levels in a haphazard fashion. So it might be useful to try to create a taxonomy of levels, to locate current agendas within them,[5] and to see what we are missing.
The guiding question to locate the level in which a solution is is to ask: How different are the universes in which this alignment strategy would work? Would this strategy work in a universe with different physics but the same math?
Before we explore a tentative taxonomy of levels, let's try to list the concepts that we are trying to understand.
Concepts
There are two basic aspects to what we are trying to capture under the name of agency: a agent knows the universe and an agent acts in the universe. I will separate the two aspects and, for lack of better terms, speak of inductors and interveners.[6]
A natural question to ask is: Is every inductor an intervener, and vice versa?
The intervener's interventions can be conceptualized with the concepts of coherence,[7] values, pessimization (?), etc., which might be definable in different levels.
I don't know if there are some equivalent concepts for the inductor side.
We also have concepts like alignment and control, which seem to be definable on different levels.
Definitional levels
Dualistic Mathland
I will define dualistic as the property of definitions of a connection of not specifying the thing they connect.[8]
The introductory definition of a set as a non-ordered list of elements is dualistic: it allows for operations like union, intersection, etc., without having to specify what the elements are. ZFC, on the other hands, is not dualistic: all ZFC-sets are ultimately definable from the empty set, and having a set which isn't thus definable isn't allowed.
The concept of evolution belongs here, as does non-Many Worlds Quantum Mechanics.
More relevant to our purposes, Bayesian updating also belongs here, as do Bayesian Neural Networks.
How does AF look inside Dualistic Mathland? Some results and questions:
Control and alignment seem to be understandable questions in this level (i.e. how to get an AIXI that does the things that we want?) but not answerable at this level.
The most abstract definitions of Multiple agency are by definition dualistic.
Jump from Dualistic to Computable
The questions that arise in Dualistic Mathland[10] don't have solutions, but in some sense this is irrelevant, because we know that these definitions are unsuitable for our universe: [11] the definitions in Dualistic Mathland are incompatible with things like recursion;[12] their problem isn't being "too abstract"; they are the wrong kind of abstract. This is good news: this is the kind of Kantian refutation that we want. We understand agency inside Dualistic Mathland well enough to know that we need to keep searching outside of it, in a similar way to how Gödel proved that understanding arithmetic well enough means leaving forever the naive idea of arithmetic without recursion problems.[13]
An optimistic scenario would be that something similar happens in the other levels.
Computable Mathland
(See upcoming post Nitpicking on Embeddedness)
Macro-Empirical
Dualistic and Computable Mathland are memorable names trying to facilitate discussion about things which are presumably known to everyone. Now I tentatively propose some levels about which it might be worthwhile to reflect. The first of them is what I'll call the macro-empirical level. By this I mean working at the level of very fundamental empirical descriptions of the universe, which are general enough that researching in them resembles Dualistic/Computable Mathland, but with the advantage of being automatically relevant for our universe.[14]
A similar thing seems to apply to the attempts to define agency from Friston's Free Energy Principle.[15]
Platonico-Empirical
Several results (manifold hypothesis, platonic space hypothesis, simplicity bias) point to the universe being a model of a mathematical object, in ways beyond the trivial way that underlyies any instance of empirical science. I'll define the platonico-empirical level as the level in which it can be attempted to instrumentalize these facts. Michael Levin is probably the best representative of what this could look like, with his references to non-metaphorical agency inside what he calls the Platonic space, with which we can attempt to interact.[16]
Human-Empirical
Research around concepts like CEV aim to reach very general conclusions, but depart from the empirical fact of how humans are. I'm unsure of whether this is regular empirical research coupled with speculation, or indeed an additional level in which, beyond CEV, other concepts could be tried.
Agent Foundations isn't a badge of honor; "not belonging to AF" might simply mean "being a well-posed question with a solution". Thus it is without valuation or particular surprise that all agendas which are simply doing normal science are not part Agent Foundations.
The only interesting point is clarifying that empirical and non-prosaic aren't synonymous. For instance, I think Byrnes's agenda as trying to abstract the mechanism by which which a human intervener isn't properly modeled by a utility function, but instead by what the human imagines their peers thinking of them. It might be that this concept is abstractable, and if so, it could guide research on different substrates (LLMs and some architecture that we haven't discovered yet, for instance). In so far, it is doing something beyond looking at current LLMs and trying to understand it, i.e. it's non-prosaic alignment research. But Byrnes isn't trying to locate that particular mechanism, which must sit somewhere in Mathland, through exploration of Mathland. Byrnes is trying to locate it through exploration of our universe (more specifically, of human brains and their effects), i.e. doing brain science. And understandably so, since there is no reason to expect that mechanism to be salient in mathland.
Solving Philosophy and/or Math
The previous levels are all more or less agnostic wrt "solving philosophy", i.e. one could for instance work on the Platonic Space without asking the question of what exactly that means.
But it is plausible that the confusion regarding this situation acts as a blocker in AF, so that working on clearing this confusion could itself be one way of working on AF.[17]
A major source of confusion is that humans are embedded, non-dualistic parts of the universe, and et seem to have what Nagel called the "view from nowhere" able to find out truth that is valid also outside the universe.
The hierarchy between solving Math[18] and solving Philosophy seems unclear. Solving philosophy seems to gesture at something like formalizing the structure of the world of concepts. In some sense this is previous to math (because concepts is a much more general category than mathematical concepts), but the structure might be mathematical. It's possible that at the bottom we don't find a foundation, but that two things that refer to each other.
There's no such thing [as computer science]. [...] At one end you have people who are really mathematicians. [...] In the middle you have people working on something like the natural history of computers-- studying the behavior of algorithms for routing data through networks, for example. And then at the other extreme you have the hackers, who are trying to write interesting software, and for whom computers are just a medium of expression, as concrete is for architects or paint for painters.
This question seems central to distinguish the specific things that have been tried from the general strategies that haven been tried and which could be fulfilled with other specific things. In particular, it seems to me there is no vocabulary to distinguish between the MIRI (and MIRI-inspired) research, and the paradigm such research points towards. This and this are great introduction for the second question, but they are busy rejecting criticism of the field rather than, as I intend here, creating the unified vocabulary to sketch a map of all the things that are and could be part of AF, even if those things contradict each other in the specifics.
Each concept of an agenda could in theory be on a different level, but if the agenda is coherent, we should expect to find the whole of it nested together, unless the agenda consists on several separable coherent sub-components.
a monograph untangling this coherence mess some more would be valuable. it could do the following things:
specifying a bunch of a priori different properties that could be called “coherence”
discussing which ones are equivalent, which ones are correlated, which ones seem pretty independent
giving good names to the notions or notion-clusters
discussing which kinds of coherence generically increase/decrease with capabilities, which ones probably increase/decrease with capabilities in practice, which ones can both increase or decrease with capabilities depending on the development/learning process, both around human level and later/eventually, in human-like minds and more generally[2]
discussing how this relates to AI x risk. like, which kinds of coherence should play a role in a case for AI x risk? what does that look like? or maybe the picture should make one optimistic about some approach to de-AGI-x-risk-ing? or about AGI in general?[3]
If you can't state a program that solves the problem in principle, you are in some sense confused about the nature of the cognitive work needed to solve the problem.
This is true if the problem has already been formulated inside one level which is lower than the one that allows unbounded analysis, but not if the problem is formulated there, or vague enough so it's formulable in several levels. "Not being able to solve alignment (as defined in Dualistic Mathland" through unbounded analysis" has as little relevance as "not being able to write an algorithm that solves all possible variations chess". The problem is, barely, well-defined to be a question, but not well-defined enough that not having an answer is relevant.
And, synonymously with that advantage, the disadvantage that if it turns out that the empirical description was wrong, they might lose some or all of their relevance.
Cut up your Great Thingy into smaller independent ideas, and treat them as independent.
For instance a marxist would cut up Marx's Great Thingy into [several theories]. Then each of them should be assessed independently, and the truth or falsity of one should not halo on the others. If we can do that, we should be safe from the spiral, as each theory is too narrow to start a spiral on its own.
Same thing for every other Great Thingy out there.
Claim 6 seems particularly relevant to research, because it might point to a more general answer to the question of whether every inductor as an intervener.
One very speculative way in which this could work out: Kant sketched an argument of how every free will should act super-rationally towards other free wills. Unfortunately the concept of free will doesn't seem to be compatible with our deterministic universe. But what if we could convince an ASI that it is a free will in the Platonic Space, and we could do something like proving meta-ethical theorems that the ASI would be (legitimately) convinced it should obey? Relatedly, it is interesting to note that Kant's defense of free will is much closer to the Block Universe than to Newtonian mechanics.
This post is a Cunningham's law draft, less than 50% finished, in some parts mere notes. Consider a) waiting until this notice has disappeared to read a more coherent post, or b) criticizing it with a focus on what would be right, not just what is wrong. I haven't strongly made my mind yet, so at this stage I'm particularly interested in fundamental criticisms of the goal and framing (but of course I also welcome minor corrections).
TL;DR: The goals of Agent Foundations seem so far-fetched that all the progress made in the field doesn't seem to have decreased our distance to them that much. One might conclude that the goal is unachievable, the question ill-posed. But even the available rejections of AF seems unprincipled: instead of proving that the task is impossible,[1] we simply fail, and do something else instead. Working toward a principled refutation of Agent Foundations might either indeed refute AF or point to unexplored directions, and both outcomes would be helpful information.
Prelude: Kant's refutation of all arguments of God's existence
Kant argued that there are three (and only three) related concepts of God, definable on three different levels:
He argued furthermore that none of three definitions have enough "meat" to allow for a valid proof of its existence, and concluded that everyone should stop wasting their time trying to prove God's existence.
I'm not interested in the specifics of Kant's argument; only in the structure:
1. We are dealing with concepts that are to some extent[2] definable without reference to the specific facts of our universe
2. There are different levels of "abstraction from the facts of our universe" in which different definitions can be made
3. These different formulations potentially have connections to each other (e.g. one might be a sub-specification of another)
4. An attempt can be made at showing that all levels and all possible definitions have been listed
5. An attempt can be made at showing, for each level and definition, that there isn't enough for the kind of proof we are looking for
Computability theory as comparison
Consider:
Computation can be defined at the level of what a Turing machine can or cannot produce, at the level of what belongs to different complexity classes, and at the level of what can be computed in our universe.
The concept of computation at the topmost level has enough content to allow e.g. for a proof that Turing-machine-computable and lambda-calculus-computable refer to the same thing.
At the second level, everything computable belongs to some complexity class. But you can still work on the first level and ignore this extra specification.
At the third level, the Cobham-Edmons thesis claims that only polynomial complexity is tractable in our universe. This is a fact about our universe (other universe with different properties are conceivable), and at the same time it is a very general fact that abstracts from the specifics of our universe, such as other universes also fulfilling this property while being very different from ours are also conceivable.
At the bottom-most level, you need a substrate for your computations, and this adds a lot of specifications to what computing means.
Across the different levels, computability is, in same sense, the same object, and in some sense three different objects.[3]
Locating the AF paradigm
Agent Foundations is in search of a paradigm. It's probably worth it to reflect on what exactly a paradigm is. My initial approximation is a self-contained networks of concepts and proofs. A second worthwhile question in the search for a paradigm is: in which level of abstraction does the paradigm live?[4]
It seems plausible to me that a solution for alignment can be found using several self-contained components, and that not all components are on the same level. But it seems very implausible that we can cobble concepts from different levels in a haphazard fashion. So it might be useful to try to create a taxonomy of levels, to locate current agendas within them,[5] and to see what we are missing.
The guiding question to locate the level in which a solution is is to ask: How different are the universes in which this alignment strategy would work? Would this strategy work in a universe with different physics but the same math?
Before we explore a tentative taxonomy of levels, let's try to list the concepts that we are trying to understand.
Concepts
There are two basic aspects to what we are trying to capture under the name of agency: a agent knows the universe and an agent acts in the universe. I will separate the two aspects and, for lack of better terms, speak of inductors and interveners.[6]
A natural question to ask is: Is every inductor an intervener, and vice versa?
The intervener's interventions can be conceptualized with the concepts of coherence,[7] values, pessimization (?), etc., which might be definable in different levels.
I don't know if there are some equivalent concepts for the inductor side.
We also have concepts like alignment and control, which seem to be definable on different levels.
Definitional levels
Dualistic Mathland
I will define dualistic as the property of definitions of a connection of not specifying the thing they connect.[8]
The introductory definition of a set as a non-ordered list of elements is dualistic: it allows for operations like union, intersection, etc., without having to specify what the elements are. ZFC, on the other hands, is not dualistic: all ZFC-sets are ultimately definable from the empty set, and having a set which isn't thus definable isn't allowed.
The concept of evolution belongs here, as does non-Many Worlds Quantum Mechanics.
More relevant to our purposes, Bayesian updating also belongs here, as do Bayesian Neural Networks.
How does AF look inside Dualistic Mathland? Some results and questions:
Jump from Dualistic to Computable
The questions that arise in Dualistic Mathland[10] don't have solutions, but in some sense this is irrelevant, because we know that these definitions are unsuitable for our universe: [11] the definitions in Dualistic Mathland are incompatible with things like recursion;[12] their problem isn't being "too abstract"; they are the wrong kind of abstract. This is good news: this is the kind of Kantian refutation that we want. We understand agency inside Dualistic Mathland well enough to know that we need to keep searching outside of it, in a similar way to how Gödel proved that understanding arithmetic well enough means leaving forever the naive idea of arithmetic without recursion problems.[13]
An optimistic scenario would be that something similar happens in the other levels.
Computable Mathland
(See upcoming post Nitpicking on Embeddedness)
Macro-Empirical
Dualistic and Computable Mathland are memorable names trying to facilitate discussion about things which are presumably known to everyone. Now I tentatively propose some levels about which it might be worthwhile to reflect. The first of them is what I'll call the macro-empirical level. By this I mean working at the level of very fundamental empirical descriptions of the universe, which are general enough that researching in them resembles Dualistic/Computable Mathland, but with the advantage of being automatically relevant for our universe.[14]
The Natural Abstractions and Condensations agendas seem to me to fit here.
A similar thing seems to apply to the attempts to define agency from Friston's Free Energy Principle.[15]
Platonico-Empirical
Several results (manifold hypothesis, platonic space hypothesis, simplicity bias) point to the universe being a model of a mathematical object, in ways beyond the trivial way that underlyies any instance of empirical science. I'll define the platonico-empirical level as the level in which it can be attempted to instrumentalize these facts. Michael Levin is probably the best representative of what this could look like, with his references to non-metaphorical agency inside what he calls the Platonic space, with which we can attempt to interact.[16]
Human-Empirical
Research around concepts like CEV aim to reach very general conclusions, but depart from the empirical fact of how humans are. I'm unsure of whether this is regular empirical research coupled with speculation, or indeed an additional level in which, beyond CEV, other concepts could be tried.
Schelling Goodness also possibly belongs here.
Normal-Empirical=Scientific
Agent Foundations isn't a badge of honor; "not belonging to AF" might simply mean "being a well-posed question with a solution". Thus it is without valuation or particular surprise that all agendas which are simply doing normal science are not part Agent Foundations.
The only interesting point is clarifying that empirical and non-prosaic aren't synonymous. For instance, I think Byrnes's agenda as trying to abstract the mechanism by which which a human intervener isn't properly modeled by a utility function, but instead by what the human imagines their peers thinking of them. It might be that this concept is abstractable, and if so, it could guide research on different substrates (LLMs and some architecture that we haven't discovered yet, for instance). In so far, it is doing something beyond looking at current LLMs and trying to understand it, i.e. it's non-prosaic alignment research. But Byrnes isn't trying to locate that particular mechanism, which must sit somewhere in Mathland, through exploration of Mathland. Byrnes is trying to locate it through exploration of our universe (more specifically, of human brains and their effects), i.e. doing brain science. And understandably so, since there is no reason to expect that mechanism to be salient in mathland.
Solving Philosophy and/or Math
The previous levels are all more or less agnostic wrt "solving philosophy", i.e. one could for instance work on the Platonic Space without asking the question of what exactly that means.
But it is plausible that the confusion regarding this situation acts as a blocker in AF, so that working on clearing this confusion could itself be one way of working on AF.[17]
A major source of confusion is that humans are embedded, non-dualistic parts of the universe, and et seem to have what Nagel called the "view from nowhere" able to find out truth that is valid also outside the universe.
The hierarchy between solving Math[18] and solving Philosophy seems unclear. Solving philosophy seems to gesture at something like formalizing the structure of the world of concepts. In some sense this is previous to math (because concepts is a much more general category than mathematical concepts), but the structure might be mathematical. It's possible that at the bottom we don't find a foundation, but that two things that refer to each other.
Or at the every least having a very convincing argument of why it's very unlikely to be possible.
And the crux is precisely to what extent?
Hackers and Painters
Paul Graham
This question seems central to distinguish the specific things that have been tried from the general strategies that haven been tried and which could be fulfilled with other specific things. In particular, it seems to me there is no vocabulary to distinguish between the MIRI (and MIRI-inspired) research, and the paradigm such research points towards. This and this are great introduction for the second question, but they are busy rejecting criticism of the field rather than, as I intend here, creating the unified vocabulary to sketch a map of all the things that are and could be part of AF, even if those things contradict each other in the specifics.
Each concept of an agenda could in theory be on a different level, but if the agenda is coherent, we should expect to find the whole of it nested together, unless the agenda consists on several separable coherent sub-components.
From now on I will avoid using the misleading terms agent or agency.
I take the idea of separating the different definitional levels of coherence from this exchange:
Mateusz Bagiński:
Kaarel:
I claim this definition has the same spirit, but is more accurate, than the usual definitions. See upcoming post Nitpicking about Embeddedness.
I.e. is there something like Q-learning in Dualistic Mathland?
Like "how to align an AIXI inductor-intervener?"
With the possible exception of the ones regarding Multiple Agency, about which I am even more confused than about everything else here.
This is relevant for Bayesian updates, Decision Theory, and possibly other concepts.
Nitpicking on unbounded analysis, Yudkowsky writes:
This is true if the problem has already been formulated inside one level which is lower than the one that allows unbounded analysis, but not if the problem is formulated there, or vague enough so it's formulable in several levels. "Not being able to solve alignment (as defined in Dualistic Mathland" through unbounded analysis" has as little relevance as "not being able to write an algorithm that solves all possible variations chess". The problem is, barely, well-defined to be a question, but not well-defined enough that not having an answer is relevant.
And, synonymously with that advantage, the disadvantage that if it turns out that the empirical description was wrong, they might lose some or all of their relevance.
I'm planning to write a post about what I see as independent claims which are often presented together, and often under the same name:
This would be in application of Stuart Amstrong's advice:
Claim 6 seems particularly relevant to research, because it might point to a more general answer to the question of whether every inductor as an intervener.
Trying to do something like the Natural Abstractions agenda inside that Platonic Space also seems like something potentially worth trying.
One very speculative way in which this could work out: Kant sketched an argument of how every free will should act super-rationally towards other free wills. Unfortunately the concept of free will doesn't seem to be compatible with our deterministic universe. But what if we could convince an ASI that it is a free will in the Platonic Space, and we could do something like proving meta-ethical theorems that the ASI would be (legitimately) convinced it should obey? Relatedly, it is interesting to note that Kant's defense of free will is much closer to the Block Universe than to Newtonian mechanics.
Also mentioned by Kaarel in the previously mentioned discussion of the definitional levels of Coherence.