Since the beginning of 2022, I've been doing independent AI alignment research in agent foundations. The stage and scope of research that I am doing is roughly equivalent to starting a PhD program.

It has been somewhat popular for people in alignment to write up research agendas over the last couple of years. I've been vaguely hesitant to post a research agenda for myself, because to me it sounds like a claim of something stronger than is true. Specifically it feels like a relatively strong claim that I intend to be doing these specific things for a long time, that I expect to get results, and maybe even that I'm part of some larger organization. It's not clear to me how other readers interpret this, but hopefully I can dispel that impression with sufficient disclaimers throughout.

The purposes of this post are to:

  • produce for myself a clearer statement of what I am doing and why,
  • make myself more legible to any other interested parties, and
  • invite some criticism of my plans.

I've written in the semi-casual tone common to LessWrong posts, and mentioned many things throughout that may not make sense if you come from outside the rationalist community.

Context

Like many people in AI alignment, I am not working in the standard academic track, so it's worthwhile to give more context. If you just want to know concretely what I work on, skip down to the section on research problems.

All is flux

As implied above, I expect lots of things about my research to change continuously. I have no external commitments that would heavily incentivize me to complete any particular task, so I stop doing something as soon as I decide it's not valuable anymore. (There are obvious downsides to this as well, but I'm comfortable with my ability to navigate them.)

Still learning

My current level of math ability is something like halfway through the first semester of grad school. My most natural inclination in researching would be, step 1: learn all math. (I would never get to step 2). To combat this, I have tried to work on problems first, and only start to study a field of math when I feel like it would have helped a few times already. I think this has been quite a successful heuristic to follow so far, and I think it's also made me more confident in my recent decision to start focusing a lot more of my time on doing the equivalent of taking grad classes. I continue to do regular iterations of trying to solve research problems (see this section for more on my iteration strategy).

My views on AI and the future

Sometimes when I tell people what I'm doing, they ask "why aren't you working on [some other thing in AI alignment]?" Part of the answer is that theory is my comparative advantage, and part of it is my entire model of how the future will go. This isn't the right place to go into all of my relevant models, but I can say a few things.

I am severely concerned about existential risks from AI, but I don't think it's 99.9% likely or anything like that.

I am not aiming for anything like provably-safe AGI; I am aiming for using proofs as a method for understanding the behavior of systems.

If I am good at anything, it is understanding complex concepts and then communicating them to others.

I think a marginal person is better put toward slowing down AI capabilities and deployment than toward agent foundations, but not enough to counter my comparative advantage in theorizing.

I don't think that other people have the other things handled; in fact I don't think anyone has anything handled in AI safety. So I'm doing agent foundations and crossing my fingers that other people eventually get e.g. policy handled. But if I was doing policy, I would have my fingers crossed that other people eventually get agent foundations handled.

And as for how humanity might overall bring about a positive singularity, I don't really see a successful path where a singleton AI figures out our values and then maximizes the universe for them. Instead, I think we need to use (less) powerful AI as a tool, which we use to figure out how to upload/upgrade our own minds into becoming the superintelligent entity. Again, that doesn't mean I think we probably have that much time to spare, it's just the best chance I see overall.

What I mean by "agent foundations"

Agent foundations is a term that people in AI alignment use to refer to the (thus far lack of) formal foundations on which we hope to build our understanding of powerful AI systems. In the same way that Newton's Principia allowed us to answer questions about the motion of physical objects, and that the Church-Turing thesis allowed us to understand the potential and limitations of computing, we are hoping to find a mathematical formulation of what it means for something to be an agent,[1] so that we can answer questions about how agents may act, and how to build them safely.[2]

In other scientific fields, activities like experimental work, building prototypes, and improving instrumentation are critical to making progress in understanding. The main thing that differentiates AI is that we may not get more than one chance to build something safe and aligned. That makes theoretical work unusually important.[3]

Research problems

It is important to the way I work that my moment-to-moment motivator be to understand things (see the Methods section) and I think this is also a proper way to form new theories. But a theory should also solve problems.[4] So I keep at hand a collection of problems which I use to direct my theorizing. This is a loose collection which changes over time. The criteria for selecting these problems is that their solutions should be plausibly useful toward AI risk, and that they may plausibly be solved by the tools and methods discussed below.

  • Optimization is an intuitively recognizable phenomenon where a system's state gets steered into a smaller region of the state space. (An AGI will be a powerful optimizer.) What exactly do we mean by that? Can we find a formal definition that satisfactorily addresses the key examples?
  • How can we objectively measure whether and how much optimization is occurring inside a system? Or what is being optimized for?
  • Can we look at a single state and effectively determine whether it will lead to optimization in the future?
  • Can optimizing systems be crisply taxonimized? (Perhaps we could identify more or less dangerous types of optimizing systems.)
  • An agent is some kind of entity in the world that does something like maximizing a utility function. It has something like a world model and something like a boundary with inputs and outputs. As with optimization, what exactly do we mean by that?
  • For an optimizing process to be successful in a wide range of worlds, it seems like it would have be more like an agent. That is, it would have to build a model of which world it's in, and choose actions accordingly. How can we prove this?
  • Do specific properties of our universe – locality, conservation of energy, non-decrease in entropy, ergodicity, etc – help us develop a richer theory of optimization and agents?
  • How can we take an ML system and say whether it is (or contains) an optimizer or an agent? (This could help us do safety evaluations on real systems.)
  • Are there powerful optimizing systems whose "domain" of optimization is inherently limited? Optimization is in some sense the "potential" to change the future state. But, like with potential energy, it should be possible, at least in theory, to contain/constrain this potential within a limited domain. What, exactly, is that theory? (This could help us design systems whose effects are limited to our intended domain.)

Methods

AI alignment is often described as "pre-paradigmatic", and agent foundations is one of the ways we are trying to find a paradigm. Lacking a paradigm means that confusion is the norm, and that the basic objects of study are not even understood. Thus my main research activities, for a variety of concepts, are the following.

  • Deconfusing myself about it. I think it would be fair to say that this is the sole guiding principle behind my research choices. Everything else can be said to fall under this.
  • Deconfusing others. Being able to explain something to other people is a superpowered way to find gaps in your own understanding of something. And, as an outside-view heuristic, if you can't eventually convince other people that your understanding of the concept is valid, then perhaps you are still wrong, even if you don't feel confused.
  • Formally defining the concepts. It's one thing to be deconfused at the conceptual level. To make progress in agent foundations, we should be able to talk about things at the level of mathematics. (I'm usually trying to deconfuse by means of formalization, but not always.) The math should be generally agree to intuitively captures the relevant aspects of the concepts.
  • Analyze key examples. The framework should be able to clarify the status of key examples (some listed below). Often, I make progress by directly trying to understand the example, and later generalizing. Sometimes the generalizations drift away from the original example, and I have to reanalyze it.
  • Proving theorems and deriving other properties. The purpose of the formalization is to be able to deduce (and justify) further facts about the concepts. These results should translate into something meaningful, and they should be applicable to the real world, even if through a long chain of approximations.

Iterate between types of tasks

Research is a constant struggle with the explore/exploit tradeoff. Because I have no advisors or teammates, it is especially important for me to keep track of where I spend my time, and reflect on how things are going.

I've found that my specific activities tend to fall under one of five categories:[5]

  1. Try to make progress on original research problems like the ones listed above
  2. Stay up to date on the progress in AI alignment
  3. Increase my own technical knowledge and skills
  4. Interact with the alignment community[6]
  5. Communicate my research results

I naturally tend to make progress by focusing on one thing for long periods. But I also want to ensure that I keep exploring a variety of things. So I generally let myself stay on one thing on the order of a few days, and I do periodic reviews to make sure I endorse how that ends up averaging out. Currently I've spent the last couple weeks almost entirely focusing on bullet points 3 and 4, and I would feel some anxiety about that if I couldn't look back through my log and see that I've spent entire previous weeks doing nothing on 3 and 4 and instead focusing on 1 and 5.

Maintain a collection of examples

As a pre-paradigmatic field, we work with a collection of phenomena that is confusing and poorly explained. We also have a collection of phenomena that are well-explained under what might be called the previous paradigms, and those should remain explainable under whatever new frameworks we are proposing. Here are some examples for agency and optimization;

  • Numerical optimization
  • A ball rolling down a hill
  • A bottle cap keeping water in a bottle versus no bottle cap allowing water to spill out
  • An ideal gas at thermal equilibrium
  • Biological evolution
  • Humans
  • AIs, existing or hypothetical
  • Oracle AIs
  • The entire universe

When I'm working on more specific problems I may have other lists. For example, I often consider whether my ideas apply equally throughout this taxonomy of dynamical systems.

Tools

Every researcher will have a set of mathematical tools that they are fond of (or prone to resorting to). I'm still underdeveloped, but here are my core tools so far.

Probability theory & the Bayesian mindset

I often find myself (productively) asking questions like; Is this concept or phenomenon I'm trying to understand inherently probabilistic? Or inherently subjective, in the Bayesian prior sense? What is the set of all possible things in this context, and what would be the reasonable maximum entropy prior to have over them?

Agents will necessarily have an equivalent of epistemic rationality and instrumental rationality.[7] Thus much of the lessons of the Bayesian worldview are relevant to understanding agents. The von Neumann–Morgenstern model of rationality is a core related framing.

Computability theory

Computability theory is the result of humanity's attempt to understand what an effective procedure is. An effective procedure is, well, any clearly defined process that you can actually do. There is near-unanimous agreement that computability successfully captures that notion, and thus it can deliver enormous insight into questions about, well, actually doing things.

Algorithmic information theory (topics such as Kolmogorov complexity and Solomonoff induction) is a relatively natural extension to computability where we say that, very loosely, because computability captures everything that you could figure out, it also captures the essence of information, and of predictability. The only world model that a person or agent could effectively make predictions from is a computable world model. The only thing that a person or agent could try to optimize for is a computable preference ordering.

AIXI is an example of what it could look like to formalize agency using algorithmic information theory. (For my purposes, it is not sufficient, but it is useful to have for comparison.)

Dynamical systems

Dynamical systems is the study of states changing over time. Arguably, it is the study of all possible universes. More practically, it helps us understand what properties are essential for modelling a system's behavior. I'm mostly interested in embedded agency (or optimization), which means that I need to be modelling the agent as being "inside" the system. Thus I need to work with a dynamical system whose state could contain a representation of an agent.

Unlike probability and computability theory, I did not start out knowing virtually anything about dynamical systems. Instead, I kept asking questions whose answers led me into this field. I'm still unclear on exactly what results from the study of dynamical systems will be load bearing for me, but it has felt substantially clarifying to have this perspective available.

Interconnections

These fields are also fully interconnectable. One can study non-deterministic Turing machines. The Solomonoff prior tells us how to assign probabilities to computable outcomes. A Turing machine is an example of a dynamical system. The evolution rule of a given dynamical system can be proved Turing complete or not. We could consider a probability distribution over the initial states of a dynamical system. Markov processes could be considered probabilistic versions of dynamical systems.

Some things I am not doing

As a document outlining scope, it's worth including some words about what I am not doing.

Experiments

I'm unclear on the role of experiments in agent foundations. Unlike physics, or ML interpretability, the context here is not that there are observable phenomena out there in the world that we can't explain. Instead, it's that we think there is a phenomenon that will be very powerful and dangerous, and we need to clarify exactly what's going on there. I do think that theories of agent foundations could make predictions.

I'm not averse to running experiments, but if I did so now it would not be an honest activity; I would be doing it because it's a virtue of science that I am supposed to check off, and not because I actually believed it would advance my research.

It's also unclear to me exactly what the extend of "experiments" is. If writing code and running it is an experiment, then is proving something about the behavior of Turing machines an experiment? In general, I look for sources of feedback on my ideas rather than experiments to run.

Time estimates

I basically don't believe in giving time estimates for this kind of research, because any such statement would feel like a lie. I only expect to get even incremental results on the order of years. That said, we can use the model of a PhD program as a reference class here. I can look back over the last year and tell that I've learned a ton of technical content, but haven't exactly produced notable results, which is reasonably consistent with what a PhD student would do. That may also be true over the next year. But by the end of year 3, I should really expect to have some kind of content that could coherently be written up into a paper, even if it's not yet catching anyone's eye. And by the end of six years, I certainly hope to have a substantial body of results, even if they turn out to be negative.

Topics outside the scope of my research

Most object-level questions about ML. It's critical that I use potential application to ML as a guiding constraint in my research. But beyond that, I don't think it makes sense for me to actually spin up any neural networks, or (at this stage) try to prove theorems about concrete transformer architectures. Certainly someone should be doing that, but there are far more people doing that than doing agent foundations.

Any stuff about the brain or psychology. As with ML it can be insightful to consider how or whether my ideas may apply to the human brain. But my goal is not to understand the most complicated existing optimizer; my goal is to understand what we even mean by optimization. I think it's unlikely that we can figure out useful stuff about the brain any time soon, for the same reasons that nutrition seems to be a virtually impossible science.

Functional decision theory and game theory. These are core topics in agent foundations. But they're comparatively well-trodden by MIRI. And, for whatever reason, they've just never really interested me (relatively speaking). They don't feel important to answering the questions that I'm compelled by.

Conclusion

Hopefully that gives you a good sense of what I've been doing over the last year, and what I plan to work on over the next few years. I never publish as much as I wish I could, but surely I will continue to release some content. (If you're interested in following, you can subscribe to a user's posts from their user page.)

I wish there was a better explainer for what we mean by agent foundations, and for why we think it's valuable. I'm unlikely to be the one to produce that, but I would consider it if I got more feedback that it would be useful.

I welcome any questions, feedback, or criticism of my plans!

 

  1. ^

    Some in agent foundations work toward understand what an ideal agent should do; others try to characterize what systems could be considered agents at all; still other try to understand how to design agents that are safe or limited in principle.

  2. ^

    An analogy justifying work on agent foundations is provided by MIRI's Rocket Alignment Problem.

  3. ^

    Of course, people are building AGIs anyway. This means that it's critical to have a second group of people who are working in parallel to understand as quickly as possible what is being built, to maintain the safety of those systems. And for those people, the activities of experimentation and instrument building are still essential.

  4. ^

    And of course that is why I am focusing on AI alignment rather than, say, pure mathematics or a theory of everything.

  5. ^

    There's also a natural "meta" category (such as writing this post). As a rationalist, it's constantly tempting to go meta on everything all the time, for example, spending all day trying to figure out how to do research faster. To combat this tendency, I generally don't credit that time as work accomplished, and so it doesn't get its own explicit category.

  6. ^

    In a research capacity, as opposed to a social capacity. I do that too, but don't credit it as work accomplished.

  7. ^

    At least in the sense meant in agent foundations. In other fields, it's common to use "agent" to mean something more like, one of many autonomous little thingies moving around in the world.

New Comment
9 comments, sorted by Click to highlight new comments since:

A question that comes to mind here is "what are your feedback loops?" You do mention this a bit here:

I've found that my specific activities tend to fall under one of five categories:[5]

  1. Try to make progress on original research problems like the ones listed above
  2. Stay up to date on the progress in AI alignment
  3. Increase my own technical knowledge and skills
  4. Interact with the alignment community[6]
  5. Communicate my research results

I naturally tend to make progress by focusing on one thing for long periods. But I also want to ensure that I keep exploring a variety of things. So I generally let myself stay on one thing on the order of a few days, and I do periodic reviews to make sure I endorse how that ends up averaging out. Currently I've spent the last couple weeks almost entirely focusing on bullet points 3 and 4, and I would feel some anxiety about that if I couldn't look back through my log and see that I've spent entire previous weeks doing nothing on 3 and 4 and instead focusing on 1 and 5.

I have some intuitive sense that agent foundations is important, but also some intuitive worry of "man how would we know if we were making real progress?". 

I do think in some sense it's reasonable/expected to take a few years to see results but a thing on my mind these days are "can we do better than 'a few years' in terms of feedback loops?"

[-]aysja106

I worry that overemphasizing fast feedback loops ends up making projects more myopic than is good for novel research. Like, unfortunately for everyone, a bunch of what makes good research is good research taste, and no one really understands what that is or how to get it and so it's tricky to design feedback loops to make it better. Like, I think the sense of "there's something interesting here that no one else is seeing" is sort of inherently hard to get feedback on, at least from other people. Because if you could, or if it were easy to explain from the get-go, then probably other people would have already thought it. You can maybe get feedback from the world, but it often takes awhile to turn that sense into something concrete. E.g., Einstein was already onto the idea that there was something strange about light and relativity from when he was 16, but he didn't have a shippable idea until about ten years later.  

I don't think it always takes ten years, but deconfusion work is just... weird, e.g., Einstein was prone to bouts of "psychic tension," and would spend weeks in a "state of confusion." He wasn't part of academia, shared his thoughts with few people, and did very little experimentation. Which isn't to say that there was literally no feedback involved, there was, but it's a complicated story which I think unfortunately involves a bunch of feedback directly from his research taste (i.e., his taste was putting constraints like "unity," "logical," "physical meaning" on the hypothesis space). And this certainly isn't to say "let's not try to make it go faster," obviously, if one could make research faster that seems great. But I think it's a careful balancing act, and I worry that putting too much pressure on speed and legibility is going to end up causing people to do science under the streetlight. I really do not want this to happen. Field founding science is a bunch weirder than normal science, and I want to take care in giving research taste enough space to find its feet. 

I agree there's a risk of overemphasizing fast-feedback loops that damages science. 

My current belief is that gaining research taste (or b is something that shouldn't be that mysterious, and mostly it seems to be something that

  • does require quite a bit of effort (which is why I think it isn't done by default)
  • also requires at least some decent meta-taste on how to gain taste (but, my guess is Alex Altair in particular has enough of this to navigate it)

And.. meanwhile I feel like we just don't have the luxury of not at least trying on this axis to some degree.

(I don't know that I can back up this statement very much, this seems to be a research vein I currently believe in that no one else currently seems to)

It is plausible to me (based on things like Alex's comment on this other post you recently responded to, and other convos with him) that Alex-in-particular is already basically doing all the things that make sense to do here. 

But, like, looking here:

But I think it's a careful balancing act, and I worry that putting too much pressure on speed and legibility is going to end up causing people to do science under the streetlight. I really do not want this to happen. Field founding science is a bunch weirder than normal science, and I want to take care in giving research taste enough space to find its feet. 

I think the amount I'm pushing for this here is "at all", and it feels premature to me to jump to "this will ruin the research process".

Topics outside the scope of my research

Most object-level questions about ML. It's critical that I use potential application to ML as a guiding constraint in my research. But beyond that, I don't think it makes sense for me to actually spin up any neural networks, or (at this stage) try to prove theorems about concrete transformer architectures. Certainly someone should be doing that, but there are far more people doing that than doing agent foundations.

I'm confused about this part. My understanding is that you're trying to build a model of what actual AI agents will be like. But you don't think that learning more about what current neural networks are like, and how they're trained, is helpful for that goal? Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?

My impression is that Alex is trying to figure out what things like "optimization" are actually like, and that this analysis will apply to a wider variety of systems than just ML. Which makes sense to me—imo, anchoring too much on current systems seems unlikely to produce general, robust solutions to alignment. 

Yeah, this is why we need a better explainer for agent foundations. I won't do it justice in this comment but I'll try to say some helpful words. (Have you read the Rocket Alignment Problem?)

Do you expect there will be a whole new paradigm, and that current neural networks will be nothing like future AIs?

I can give an easy "no" to this question. I do not necessarily expect future AIs to work in a whole new paradigm.

My understanding is that you're trying to build a model of what actual AI agents will be like.

This doesn't really describe what I'm doing. I'm trying to help figure out what AIs we should build, so I'm hoping to affect what actual AI agents will be like.

But more of what I'm doing is trying to understand what the space of possible agents looks like at all. I can see how that could sound like someone saying, "it seems like we don't know how to build a safe bridge, so I'm going to start by trying to understand what the space of possible configurations of matter looks like at all" but I do think it's different than that.

Let me try putting it this way. The arguments that AI could be an existential risk were formed before neural networks were obviously useful for anything. So the inherent danger of AIs does not come from anything particular to current systems. These arguments use specific properties about the general nature of intelligence and agency. But they are ultimately intuitive arguments. The intuition is good enough for us to know that the arguments are correct, but not good enough to help us understand how to build safe AIs. I'm trying to find the formalization behind those intuitions, so that we can have any chance at building a safe thing. Once we get some formal results about how powerful AIs could be safe even in principle, then we can start thinking about how to build versions of existing systems that have those properties. (And yes, that's a really long feedback loop, so I try to recurringly check that my trains of ideas could still in principle apply to ML systems.)

An agent is some kind of entity in the world that does something like maximizing a utility function.

See "Why The Focus on Expected Utility Maximisers?" for an informal argument against this and Ringstrom (2022) for a formal construction of a mathematical environment where utility maximisation doesn't happen.

Do specific properties of our universe – locality, conservation of energy, non-decrease in entropy, ergodicity, etc – help us develop a richer theory of optimization and agents?

See this post, Fields et al. (2022),  Boyd et al. (2022), and Vanchurin (2021) for some insights about the connections between physics and learning.

Optimization is an intuitively recognizable phenomenon where a system's state gets steered into a smaller region of the state space. (An AGI will be a powerful optimizer.) What exactly do we mean by that? Can we find a formal definition that satisfactorily addresses the key examples?

In your list of research problems, there seems to be a lot of focus on optimisation. I want to point out that it seems to me that the interest in this concept peaked around 2019-2020 ("Risks from Learned Optimization in Advanced Machine Learning Systems", Hubinger et al. 2019) and since that time the concept was falling out of fashion, which means that researchers don't find it very useful or consequential. E.g., any LLM is an optimiser on many different levels: in-context learning, fine-tuning, pre-training, and even LLM-environment interactions are examples of processes which we can call optimisation[1], but there is not much of useful conclusions that we can draw from this.

Can optimizing systems be crisply taxonimized? (Perhaps we could identify more or less dangerous types of optimizing systems.)

Optimisers (which, according to the footnote[1], is just any identifiable system) in general can be taxonomised (e.g., Friston et al. (2022) suggested one such taxonomisation: inert, active, conservative, and strange particles aka agents), but crisp differences lie only in the domain of very simple systems. Any intelligent system of any real interest will be of the "highest" crisp type according to any such crisp taxonomy (strange particle aka agent, in Friston et al.'s classification). Furthermore, any intelligent systems of real interest will be complex (as I also pointed out in this comment), and thus their classification couldn't be crisp, it will necessarily be nebulous.

(AI) psychology is where these taxonomisations are considered, but this is not the domain of agent foundations. E.g., Perez et al. estimate all psychological qualities (self-awareness, deceptiveness, etc.) in degrees not because they are not smart enough to come up with crisp criteria for these qualities and behaviours, but because this is impossible on this level of complexity.

 

  1. ^

    Moreover, any environmentally identifiable system, even a rock or a molecule, could be seen as an optimiser in the course of its existence. "Inert particles" in the language of Friston et al. (2022) are optimising (inferring) their state.

Of course, people are building AGIs anyway. This means that it's critical to have a second group of people who are working in parallel to understand as quickly as possible what is being built, to maintain the safety of those systems. And for those people, the activities of experimentation and instrument building are still essential.

The relevant problem with cognitive science (for which "agent foundations" is just an idiosyncratic alias coined in the MIRI circles) is that intelligent systems are typically complex, which means they can be modelled in many different ways.

One (probable) consequence of this is that no "agent foundations" theory will lead to building a real HRAD AGI (rather than a toy or a "proof of concept") because as long as the agent architecture will be clean (engineered strictly according to some "agent foundations" theory) and GOFAI-ish it won't be complex enough to really respond intelligently to the complexity of the environment (cf. the law of requisite variety).

Another very important consequence is that this probably reduces the effectiveness of theoretical work in the field, especially outside of major AGI labs. Basically, this is because cognitive science is "swampy": a lot of different theories seem plausible and at least partially "correct" (i.e., model the reality with nonzero fidelity), and at the same time, no single theory is completely correct, to the level of the theory of general relativity, for instance. Therefore, lead scientists at major AGI labs pick their "favourite" theories and it's probably just impossible to move their positions with theoretical arguments. (See this argument laid out in more detail in this draft, in which I argue that independent theoretical work in AI safety is ineffective.)