Since the beginning of 2022, I've been doing independent AI alignment research in agent foundations. The stage and scope of research that I am doing is roughly equivalent to starting a PhD program.
It has been somewhat popular for people in alignment to write up research agendas over the last couple of years. I've been vaguely hesitant to post a research agenda for myself, because to me it sounds like a claim of something stronger than is true. Specifically it feels like a relatively strong claim that I intend to be doing these specific things for a long time, that I expect to get results, and maybe even that I'm part of some larger organization. It's not clear to me how other readers interpret this, but hopefully I can dispel that impression with sufficient disclaimers throughout.
The purposes of this post are to:
I've written in the semi-casual tone common to LessWrong posts, and mentioned many things throughout that may not make sense if you come from outside the rationalist community.
Like many people in AI alignment, I am not working in the standard academic track, so it's worthwhile to give more context. If you just want to know concretely what I work on, skip down to the section on research problems.
As implied above, I expect lots of things about my research to change continuously. I have no external commitments that would heavily incentivize me to complete any particular task, so I stop doing something as soon as I decide it's not valuable anymore. (There are obvious downsides to this as well, but I'm comfortable with my ability to navigate them.)
My current level of math ability is something like halfway through the first semester of grad school. My most natural inclination in researching would be, step 1: learn all math. (I would never get to step 2). To combat this, I have tried to work on problems first, and only start to study a field of math when I feel like it would have helped a few times already. I think this has been quite a successful heuristic to follow so far, and I think it's also made me more confident in my recent decision to start focusing a lot more of my time on doing the equivalent of taking grad classes. I continue to do regular iterations of trying to solve research problems (see this section for more on my iteration strategy).
Sometimes when I tell people what I'm doing, they ask "why aren't you working on [some other thing in AI alignment]?" Part of the answer is that theory is my comparative advantage, and part of it is my entire model of how the future will go. This isn't the right place to go into all of my relevant models, but I can say a few things.
I am severely concerned about existential risks from AI, but I don't think it's 99.9% likely or anything like that.
I am not aiming for anything like provably-safe AGI; I am aiming for using proofs as a method for understanding the behavior of systems.
If I am good at anything, it is understanding complex concepts and then communicating them to others.
I think a marginal person is better put toward slowing down AI capabilities and deployment than toward agent foundations, but not enough to counter my comparative advantage in theorizing.
I don't think that other people have the other things handled; in fact I don't think anyone has anything handled in AI safety. So I'm doing agent foundations and crossing my fingers that other people eventually get e.g. policy handled. But if I was doing policy, I would have my fingers crossed that other people eventually get agent foundations handled.
And as for how humanity might overall bring about a positive singularity, I don't really see a successful path where a singleton AI figures out our values and then maximizes the universe for them. Instead, I think we need to use (less) powerful AI as a tool, which we use to figure out how to upload/upgrade our own minds into becoming the superintelligent entity. Again, that doesn't mean I think we probably have that much time to spare, it's just the best chance I see overall.
Agent foundations is a term that people in AI alignment use to refer to the (thus far lack of) formal foundations on which we hope to build our understanding of powerful AI systems. In the same way that Newton's Principia allowed us to answer questions about the motion of physical objects, and that the Church-Turing thesis allowed us to understand the potential and limitations of computing, we are hoping to find a mathematical formulation of what it means for something to be an agent, so that we can answer questions about how agents may act, and how to build them safely.
In other scientific fields, activities like experimental work, building prototypes, and improving instrumentation are critical to making progress in understanding. The main thing that differentiates AI is that we may not get more than one chance to build something safe and aligned. That makes theoretical work unusually important.
It is important to the way I work that my moment-to-moment motivator be to understand things (see the Methods section) and I think this is also a proper way to form new theories. But a theory should also solve problems. So I keep at hand a collection of problems which I use to direct my theorizing. This is a loose collection which changes over time. The criteria for selecting these problems is that their solutions should be plausibly useful toward AI risk, and that they may plausibly be solved by the tools and methods discussed below.
AI alignment is often described as "pre-paradigmatic", and agent foundations is one of the ways we are trying to find a paradigm. Lacking a paradigm means that confusion is the norm, and that the basic objects of study are not even understood. Thus my main research activities, for a variety of concepts, are the following.
Research is a constant struggle with the explore/exploit tradeoff. Because I have no advisors or teammates, it is especially important for me to keep track of where I spend my time, and reflect on how things are going.
I've found that my specific activities tend to fall under one of five categories:
I naturally tend to make progress by focusing on one thing for long periods. But I also want to ensure that I keep exploring a variety of things. So I generally let myself stay on one thing on the order of a few days, and I do periodic reviews to make sure I endorse how that ends up averaging out. Currently I've spent the last couple weeks almost entirely focusing on bullet points 3 and 4, and I would feel some anxiety about that if I couldn't look back through my log and see that I've spent entire previous weeks doing nothing on 3 and 4 and instead focusing on 1 and 5.
As a pre-paradigmatic field, we work with a collection of phenomena that is confusing and poorly explained. We also have a collection of phenomena that are well-explained under what might be called the previous paradigms, and those should remain explainable under whatever new frameworks we are proposing. Here are some examples for agency and optimization;
When I'm working on more specific problems I may have other lists. For example, I often consider whether my ideas apply equally throughout this taxonomy of dynamical systems.
Every researcher will have a set of mathematical tools that they are fond of (or prone to resorting to). I'm still underdeveloped, but here are my core tools so far.
I often find myself (productively) asking questions like; Is this concept or phenomenon I'm trying to understand inherently probabilistic? Or inherently subjective, in the Bayesian prior sense? What is the set of all possible things in this context, and what would be the reasonable maximum entropy prior to have over them?
Agents will necessarily have an equivalent of epistemic rationality and instrumental rationality. Thus much of the lessons of the Bayesian worldview are relevant to understanding agents. The von Neumann–Morgenstern model of rationality is a core related framing.
Computability theory is the result of humanity's attempt to understand what an effective procedure is. An effective procedure is, well, any clearly defined process that you can actually do. There is near-unanimous agreement that computability successfully captures that notion, and thus it can deliver enormous insight into questions about, well, actually doing things.
Algorithmic information theory (topics such as Kolmogorov complexity and Solomonoff induction) is a relatively natural extension to computability where we say that, very loosely, because computability captures everything that you could figure out, it also captures the essence of information, and of predictability. The only world model that a person or agent could effectively make predictions from is a computable world model. The only thing that a person or agent could try to optimize for is a computable preference ordering.
AIXI is an example of what it could look like to formalize agency using algorithmic information theory. (For my purposes, it is not sufficient, but it is useful to have for comparison.)
Dynamical systems is the study of states changing over time. Arguably, it is the study of all possible universes. More practically, it helps us understand what properties are essential for modelling a system's behavior. I'm mostly interested in embedded agency (or optimization), which means that I need to be modelling the agent as being "inside" the system. Thus I need to work with a dynamical system whose state could contain a representation of an agent.
Unlike probability and computability theory, I did not start out knowing virtually anything about dynamical systems. Instead, I kept asking questions whose answers led me into this field. I'm still unclear on exactly what results from the study of dynamical systems will be load bearing for me, but it has felt substantially clarifying to have this perspective available.
These fields are also fully interconnectable. One can study non-deterministic Turing machines. The Solomonoff prior tells us how to assign probabilities to computable outcomes. A Turing machine is an example of a dynamical system. The evolution rule of a given dynamical system can be proved Turing complete or not. We could consider a probability distribution over the initial states of a dynamical system. Markov processes could be considered probabilistic versions of dynamical systems.
As a document outlining scope, it's worth including some words about what I am not doing.
I'm unclear on the role of experiments in agent foundations. Unlike physics, or ML interpretability, the context here is not that there are observable phenomena out there in the world that we can't explain. Instead, it's that we think there is a phenomenon that will be very powerful and dangerous, and we need to clarify exactly what's going on there. I do think that theories of agent foundations could make predictions.
I'm not averse to running experiments, but if I did so now it would not be an honest activity; I would be doing it because it's a virtue of science that I am supposed to check off, and not because I actually believed it would advance my research.
It's also unclear to me exactly what the extend of "experiments" is. If writing code and running it is an experiment, then is proving something about the behavior of Turing machines an experiment? In general, I look for sources of feedback on my ideas rather than experiments to run.
I basically don't believe in giving time estimates for this kind of research, because any such statement would feel like a lie. I only expect to get even incremental results on the order of years. That said, we can use the model of a PhD program as a reference class here. I can look back over the last year and tell that I've learned a ton of technical content, but haven't exactly produced notable results, which is reasonably consistent with what a PhD student would do. That may also be true over the next year. But by the end of year 3, I should really expect to have some kind of content that could coherently be written up into a paper, even if it's not yet catching anyone's eye. And by the end of six years, I certainly hope to have a substantial body of results, even if they turn out to be negative.
Most object-level questions about ML. It's critical that I use potential application to ML as a guiding constraint in my research. But beyond that, I don't think it makes sense for me to actually spin up any neural networks, or (at this stage) try to prove theorems about concrete transformer architectures. Certainly someone should be doing that, but there are far more people doing that than doing agent foundations.
Any stuff about the brain or psychology. As with ML it can be insightful to consider how or whether my ideas may apply to the human brain. But my goal is not to understand the most complicated existing optimizer; my goal is to understand what we even mean by optimization. I think it's unlikely that we can figure out useful stuff about the brain any time soon, for the same reasons that nutrition seems to be a virtually impossible science.
Functional decision theory and game theory. These are core topics in agent foundations. But they're comparatively well-trodden by MIRI. And, for whatever reason, they've just never really interested me (relatively speaking). They don't feel important to answering the questions that I'm compelled by.
Hopefully that gives you a good sense of what I've been doing over the last year, and what I plan to work on over the next few years. I never publish as much as I wish I could, but surely I will continue to release some content. (If you're interested in following, you can subscribe to a user's posts from their user page.)
I wish there was a better explainer for what we mean by agent foundations, and for why we think it's valuable. I'm unlikely to be the one to produce that, but I would consider it if I got more feedback that it would be useful.
I welcome any questions, feedback, or criticism of my plans!
Some in agent foundations work toward understand what an ideal agent should do; others try to characterize what systems could be considered agents at all; still other try to understand how to design agents that are safe or limited in principle.
An analogy justifying work on agent foundations is provided by MIRI's Rocket Alignment Problem.
Of course, people are building AGIs anyway. This means that it's critical to have a second group of people who are working in parallel to understand as quickly as possible what is being built, to maintain the safety of those systems. And for those people, the activities of experimentation and instrument building are still essential.
And of course that is why I am focusing on AI alignment rather than, say, pure mathematics or a theory of everything.
There's also a natural "meta" category (such as writing this post). As a rationalist, it's constantly tempting to go meta on everything all the time, for example, spending all day trying to figure out how to do research faster. To combat this tendency, I generally don't credit that time as work accomplished, and so it doesn't get its own explicit category.
In a research capacity, as opposed to a social capacity. I do that too, but don't credit it as work accomplished.
At least in the sense meant in agent foundations. In other fields, it's common to use "agent" to mean something more like, one of many autonomous little thingies moving around in the world.
A question that comes to mind here is "what are your feedback loops?" You do mention this a bit here:
I've found that my specific activities tend to fall under one of five categories:Try to make progress on original research problems like the ones listed aboveStay up to date on the progress in AI alignmentIncrease my own technical knowledge and skillsInteract with the alignment communityCommunicate my research resultsI naturally tend to make progress by focusing on one thing for long periods. But I also want to ensure that I keep exploring a variety of things. So I generally let myself stay on one thing on the order of a few days, and I do periodic reviews to make sure I endorse how that ends up averaging out. Currently I've spent the last couple weeks almost entirely focusing on bullet points 3 and 4, and I would feel some anxiety about that if I couldn't look back through my log and see that I've spent entire previous weeks doing nothing on 3 and 4 and instead focusing on 1 and 5.
I have some intuitive sense that agent foundations is important, but also some intuitive worry of "man how would we know if we were making real progress?".
I do think in some sense it's reasonable/expected to take a few years to see results but a thing on my mind these days are "can we do better than 'a few years' in terms of feedback loops?"
Topics outside the scope of my researchMost object-level questions about ML. It's critical that I use potential application to ML as a guiding constraint in my research. But beyond that, I don't think it makes sense for me to actually spin up any neural networks, or (at this stage) try to prove theorems about concrete transformer architectures. Certainly someone should be doing that, but there are far more people doing that than doing agent foundations.
I'm confused about this part. My understanding is that you're trying to build a model of what actual AI agents will be like. But you don't think that learning more about what current neural networks are like, and how they're trained, is helpful for that goal? Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?
Yeah, this is why we need a better explainer for agent foundations. I won't do it justice in this comment but I'll try to say some helpful words. (Have you read the Rocket Alignment Problem?)
Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?
I can give an easy "no" to this question. I do not necessarily expect future AIs to work in a whole new paradigm.
My understanding is that you're trying to build a model of what actual AI agents will be like.
This doesn't really describe what I'm doing. I'm trying to help figure out what AIs we should build, so I'm hoping to affect what actual AI agents will be like.
But more of what I'm doing is trying to understand what the space of possible agents looks like at all. I can see how that could sound like someone saying, "it seems like we don't know how to build a safe bridge, so I'm going to start by trying to understand what the space of possible configurations of matter looks like at all" but I do think it's different than that.
Let me try putting it this way. The arguments that AI could be an existential risk were formed before neural networks were obviously useful for anything. So the inherent danger of AIs does not come from anything particular to current systems. These arguments use specific properties about the general nature of intelligence and agency. But they are ultimately intuitive arguments. The intuition is good enough for us to know that the arguments are correct, but not good enough to help us understand how to build safe AIs. I'm trying to find the formalization behind those intuitions, so that we can have any chance at building a safe thing. Once we get some formal results about how powerful AIs could be safe even in principle, then we can start thinking about how to build versions of existing systems that have those properties. (And yes, that's a really long feedback loop, so I try to recurringly check that my trains of ideas could still in principle apply to ML systems.)
An agent is some kind of entity in the world that does something like maximizing a utility function.
See "Why The Focus on Expected Utility Maximisers?" for an informal argument against this and Ringstrom (2022) for a formal construction of a mathematical environment where utility maximisation doesn't happen.
Do specific properties of our universe – locality, conservation of energy, non-decrease in entropy, ergodicity, etc – help us develop a richer theory of optimization and agents?
See this post, Fields et al. (2022), Boyd et al. (2022), and Vanchurin (2021) for some insights about the connections between physics and learning.
Optimization is an intuitively recognizable phenomenon where a system's state gets steered into a smaller region of the state space. (An AGI will be a powerful optimizer.) What exactly do we mean by that? Can we find a formal definition that satisfactorily addresses the key examples?
In your list of research problems, there seems to be a lot of focus on optimisation. I want to point out that it seems to me that the interest in this concept peaked around 2019-2020 ("Risks from Learned Optimization in Advanced Machine Learning Systems", Hubinger et al. 2019) and since that time the concept was falling out of fashion, which means that researchers don't find it very useful or consequential. E.g., any LLM is an optimiser on many different levels: in-context learning, fine-tuning, pre-training, and even LLM-environment interactions are examples of processes which we can call optimisation, but there is not much of useful conclusions that we can draw from this.
Can optimizing systems be crisply taxonimized? (Perhaps we could identify more or less dangerous types of optimizing systems.)
Optimisers (which, according to the footnote, is just any identifiable system) in general can be taxonomised (e.g., Friston et al. (2022) suggested one such taxonomisation: inert, active, conservative, and strange particles aka agents), but crisp differences lie only in the domain of very simple systems. Any intelligent system of any real interest will be of the "highest" crisp type according to any such crisp taxonomy (strange particle aka agent, in Friston et al.'s classification). Furthermore, any intelligent systems of real interest will be complex (as I also pointed out in this comment), and thus their classification couldn't be crisp, it will necessarily be nebulous.
(AI) psychology is where these taxonomisations are considered, but this is not the domain of agent foundations. E.g., Perez et al. estimate all psychological qualities (self-awareness, deceptiveness, etc.) in degrees not because they are not smart enough to come up with crisp criteria for these qualities and behaviours, but because this is impossible on this level of complexity.
Moreover, any environmentally identifiable system, even a rock or a molecule, could be seen as an optimiser in the course of its existence. "Inert particles" in the language of Friston et al. (2022) are optimising (inferring) their state.
The relevant problem with cognitive science (for which "agent foundations" is just an idiosyncratic alias coined in the MIRI circles) is that intelligent systems are typically complex, which means they can be modelled in many different ways.
One (probable) consequence of this is that no "agent foundations" theory will lead to building a real HRAD AGI (rather than a toy or a "proof of concept") because as long as the agent architecture will be clean (engineered strictly according to some "agent foundations" theory) and GOFAI-ish it won't be complex enough to really respond intelligently to the complexity of the environment (cf. the law of requisite variety).
Another very important consequence is that this probably reduces the effectiveness of theoretical work in the field, especially outside of major AGI labs. Basically, this is because cognitive science is "swampy": a lot of different theories seem plausible and at least partially "correct" (i.e., model the reality with nonzero fidelity), and at the same time, no single theory is completely correct, to the level of the theory of general relativity, for instance. Therefore, lead scientists at major AGI labs pick their "favourite" theories and it's probably just impossible to move their positions with theoretical arguments. (See this argument laid out in more detail in this draft, in which I argue that independent theoretical work in AI safety is ineffective.)