Agent Foundations is the attempt to conceptually understand agency[1]. Some sensible attitudes towards this attempt are:
a) AF is a well-defined task like "solving computability" which, if ever successfully solved[2], ends up as a self-contained network of concepts and proofs.
b) AF is ill-defined:
b1) Because it is a meaningless task that points at nothing, like "mathematically proving the existence of the Christian God". There is no network of concepts and proofs around solving a non-task.
b2) Because the ill-defined task gestures at bits and parts of other tasks, but in a confused way, such that properly understanding those other tasks dissolves the original question[3].
b3) Because the task is only meaningful inside a certain framework, so the network of concepts and proofs it could yield isn't self-contained, but entangled with concepts and proofs from the broader framework[4].
On conceptual grounds, either a) or b)[5] must be true: textbooks from the future will either present a self-contained network of concepts and proofs about agency, or they won't. This rules out the intuitively appealing middle position:
c) AF needs more empirical grounding
Either AF is possible, in which case it doesn't need empirical grounding, or AF doesn't exist as a coherent field, in which case the empirical work should be directed not at AF but at whatever remains after the misleading framing is dissolved.
Put differently: the sprawling body of posts and papers about the not-yet-clearly defined goals of a potential AF — call it "AF today" — might be pre-paradigmatic (meaning it could coalesce into a new paradigm that conceptually describes agency), or it might not have the potential to flourish into one paradigm. But there is no meaningful position between "future paradigm" and "no paradigm."
Building physical computers doesn't enrich or improve computability theory — it is a way to implement it. One can debate whether non-implemented or non-implementable parts of computability theory are a waste of time, but that debate cannot take place within computability theory itself, which has no language for anything beyond the abstract concept of computation. Similarly, if AF exists as a coherent field, "grounding it empirically" could mean many valuable things — implementing its results, or identifying which parts of it are worth pursuing[6] — but all of that would take place outside AF proper.
Implicitly assuming that the concept of agency retains enough "meat" when abstracted from all (existing or possible) individual types of agents that something meaningful can be done with it. Numbers are a meaty abstraction; houses aren't.
Like someone in the early 1800s trying to understand the blueness of lapis lazuli, of the blue sky, of the Cherenkov radiation, and blue as a metaphor for death in German Romanticism.
At least some of the attempts at using the Free Energy Principle to solve alignment seem to be examples of this: they claim that there is a useful abstract concept of agent, but it lives inside the bigger framework of FEP, such that a change in our understanding of the latter would force us to modify our understanding of agency.
It might be that we can say something about generic agents, but not enough to solve alignment on that level, just like there is an abstract concept of evolution which is not very actionable until it is implemented to the specific circumstances of life on Earth.
Agent Foundations is the attempt to conceptually understand agency[1]. Some sensible attitudes towards this attempt are:
a) AF is a well-defined task like "solving computability" which, if ever successfully solved[2], ends up as a self-contained network of concepts and proofs.
b) AF is ill-defined:
b1) Because it is a meaningless task that points at nothing, like "mathematically proving the existence of the Christian God". There is no network of concepts and proofs around solving a non-task.
b2) Because the ill-defined task gestures at bits and parts of other tasks, but in a confused way, such that properly understanding those other tasks dissolves the original question[3].
b3) Because the task is only meaningful inside a certain framework, so the network of concepts and proofs it could yield isn't self-contained, but entangled with concepts and proofs from the broader framework[4].
On conceptual grounds, either a) or b)[5] must be true: textbooks from the future will either present a self-contained network of concepts and proofs about agency, or they won't. This rules out the intuitively appealing middle position:
c) AF needs more empirical grounding
Either AF is possible, in which case it doesn't need empirical grounding, or AF doesn't exist as a coherent field, in which case the empirical work should be directed not at AF but at whatever remains after the misleading framing is dissolved.
Put differently: the sprawling body of posts and papers about the not-yet-clearly defined goals of a potential AF — call it "AF today" — might be pre-paradigmatic (meaning it could coalesce into a new paradigm that conceptually describes agency), or it might not have the potential to flourish into one paradigm. But there is no meaningful position between "future paradigm" and "no paradigm."
Building physical computers doesn't enrich or improve computability theory — it is a way to implement it. One can debate whether non-implemented or non-implementable parts of computability theory are a waste of time, but that debate cannot take place within computability theory itself, which has no language for anything beyond the abstract concept of computation. Similarly, if AF exists as a coherent field, "grounding it empirically" could mean many valuable things — implementing its results, or identifying which parts of it are worth pursuing[6] — but all of that would take place outside AF proper.
Implicitly assuming that the concept of agency retains enough "meat" when abstracted from all (existing or possible) individual types of agents that something meaningful can be done with it. Numbers are a meaty abstraction; houses aren't.
"Solving" in both cases includes "understanding which tasks can be meaningfully asked, but are provably not solvable".
Like someone in the early 1800s trying to understand the blueness of lapis lazuli, of the blue sky, of the Cherenkov radiation, and blue as a metaphor for death in German Romanticism.
At least some of the attempts at using the Free Energy Principle to solve alignment seem to be examples of this: they claim that there is a useful abstract concept of agent, but it lives inside the bigger framework of FEP, such that a change in our understanding of the latter would force us to modify our understanding of agency.
Whereas a) is one scenario, b) might encompass more scenarios than the ones I've listed, so I'm not claiming "a) or b1) or b2) or b3) has to be true".
It might be that we can say something about generic agents, but not enough to solve alignment on that level, just like there is an abstract concept of evolution which is not very actionable until it is implemented to the specific circumstances of life on Earth.