Below is a transcript (edited for grammar) of a video interview I had with Vanessa Kosoy, cross-posted from my blog. It is intended as a (relatively) beginner-friendly explanation of the goals of the Learning Theoretic Agendaand why more theoretical work is needed for ensuring AI that is safe and reliable at scale.
Intro (by Will Petillo): Discussing the future of AI tends to get philosophical. What does it mean to have "goals" or "understanding"? Is power-seeking a default consequence of wanting things...or a human quirk resulting from our unique evolutionary history? What motivates kindness? Framing questions in this way makes them accessible, allowing everyone to participate in the conversation. But this lack of precision also makes such questions intractable, as disagreements become a clash of intuitions.
Today's guest on "Guardians of Alignment" is Vanessa Kosoy, an independent researcher supported by the Machine Intelligence Research Institute (MIRI) and the Long-Term Future Fund (LTFF), on building mathematical theories for safe AI. This focus on understanding from first principles puts her work in stark contrast to the "move fast and break things" experimental approach of leading AI labs. In this interview and elsewhere, Vanessa defends the value of a more theory-based approach and explains what it means to explore machine learning as a basic science.
Will Petillo: How did you get into AI safety?
Vanessa Kosoy: I've always been an autodidact, so I kind of just tend to learn things by myself. When I was little, I thought I would be a theoretical physicist. I actually have a bachelor's in math, but after completing a bachelor's, instead of going to academia, I decided to go for a career in the industry instead, in software.
I had a long career in the software industry, specifically in algorithm engineering, mostly computer vision, various roles, algorithm engineer, team leader, R&D manager. I also had my own startup. I was a consultant and then something like 10 years ago. I got exposed to this whole topic of existential risks from AI and started thinking that, well, that actually seems kind of important. So I started pivoting to that and initially it was just me doing research in my free time. Then it was supported by MIRI. And then I more recently also got support from the Long Term Future Fund, which enabled me to go full time.
Will Petillo: So what was that process like? You were working just a self directed way, and then later you gained support from Miri and other sources? How did that come about?
Vanessa Kosoy: I started reading some of the stuff MIRI and people on Less Wrong were writing. I started working on my own ideas and writing posts on Less Wrong. After that, I got invited to some workshops, some events, and eventually MIRI said that, well, it seems like you're doing some nice work here, so maybe we will also pay you for it. And I was great, because that also enabled me to do more of that and spend less time doing other things.
Will Petillo: So a little background for the audience here. Less Wrong is a popular blog. It's initially on rationality, but also on AI related things. MIRI is the Machine Intelligence Research Institute. I like to describe them as the people working on alignment before it was cool. Tell me a little bit more about MIRI as an institution.
Vanessa Kosoy: MIRI were more or less the first to talk about existential risks for artificial intelligence. Eliezer Yudkowsky started talking about this in the year 2000, and initially Miri was just Yudkowsky, and then over the years they managed to get some funding to get other researchers on board. They were thinking about this question of: how do we make artificial intelligence safe, and how do we even approach the question? What kind of mathematical theory can we come up with to solve this question? That's even before the deep learning revolution started, and way before the whole hype with large language models in recent years. Most of their time has been committed to trying to come up with some foundational mathematical theory that will help us with AI alignment.
More recently, they pivoted to outreach and trying to influence policy because of their beliefs that the timelines are really short and we don't have time for developing this theory, unfortunately.
Will Petillo: Have you been a part of that pivot?
Vanessa Kosoy: No, my outlook is different. I'm more conservative you could say. I think that the timelines are not quite as short as a lot of the people in the risk community believe. I think that if the effort to, in the policy channels to regulate AI development and delay AI development to stop the really dangerous kind of AI from coming about will succeed, then it's only buying us time. And then the question is: buying us time for what? And I think that theoretical foundations is definitely still the most important thing among the things we should be doing with the time that we have, or with the time that we will succeed to buy by some policy plan, one way or another.
I think that practically in any world, creating this foundational theory is the key. And that's what I'm doing. That's definitely where my personal skills and advantages lie in, working on math and not on policy.
Will Petillo: You mentioned timelines. Intuitively, and I know it's impossible to really predict these things with any kind of precision, but just in terms of what motivates you, what do you see as your timeline for when this stuff needs to be solved?
Vanessa Kosoy: Some people think that AGI is coming in 10 years or even less. I think that's a little extreme. But when things need to be solved, the sooner the better, right? If we have the solution in five years, then we're in a better place than if we only have the solution in 10 years, which is still better than if we only have the solution in 20 years and so on.
Realistically, my personal opinion is that there are still multiple decades before we actually arrive at the sort of AI that poses existential risks. So that gives us a little more time, but not infinite time.
Will Petillo: What made it click for you that this is something you want to work on?
Vanessa Kosoy: It started as more like a curiosity because it actually started with me discovering completely randomly, you could say some papers on AGI—not even on AI alignment or risk or anything like that, but just some paper by Jürgen Schmidhuber and Marcus Hutter with some ideas about AGI. I always was a math nerd, so it seemed really cool that there are those mathematical frameworks for thinking about AGI. I started reading about that and eventually also found Less Wrong, where some people were also discussing this kind of thing.
On the one hand, I was reading more and more things that Eliezer Yudkowsky wrote on the topic, and people on LessWrong wrote on the topic, but I also started thinking about the mathematical models. Eventually it clicked for me that when you're thinking about the actual math, then it makes perfect sense that there's no mathematical reason why AI would have to care about humans or care about anything at all aligned with what we as humans care about.
On the other hand, it also seemed obvious that it could be much more capable than us. It's pretty clear intuitively, I think, but for me, I like to understand everything through math. So when I saw that you could actually put that into a mathematical model, then it really clicked for me that this is the real thing, this is something we should really be concerned about.
Will Petillo: This idea that there's no reason to assume that AI will necessarily be nice, that sounds a lot like the Orthogonality Thesis that Nick Bostrom wrote about. Intelligence and how nice something is don't have to go together; any set of values can go with any level of intelligence. Is that essentially what your insight was?
Vanessa Kosoy: Yeah, that's exactly the terminology. In hindsight, it just seems like an obvious thing. But for me, it was necessary to see that you could actually think of it in terms of mathematical objects; values could be formalized as a utility function. And then an agent could be formalized as some kind of an optimizer, some kind of a Bayes-optimal policy or whatever for this utility function. You could actually put rigorous meaning behind each of those terms and see that it actually all makes sense—it's not just some kind of a philosophical hand waving trick.
Will Petillo: What, in your view, is the fundamental reason that makes the problem of AI alignment hard?
Vanessa Kosoy: I think the problem is hard. I think that what makes it hard is, well, first of all, that we're aiming at a very narrow target because human values are very complicated and specific. We care about a lot of very detailed things: love, friendship, beauty (in our own subjective understanding of it), sex, all of those things are human things that exist because of some complicated accidents of evolution, the way it happened on some very particular planet. It's a very particular point in the history of a particular universe. This set of values is a very, very narrow fraction of the huge space of possible values or minds that you could imagine.
So most minds are definitely not anything that would be nice to us by our standards. Worse than that, there's also this phenomenon which Nate Soares formulated nicely in one of his relatively recent posts, where he wrote that there is a basin of attraction around agent capabilities, but there is no basin of attraction about agent alignment. What this means is that if you apply enough optimization pressure, even with brute force techniques, you will eventually generate highly capable agents. And an example of that is evolution, right? Evolution is a very primitive brute force algorithm that eventually created the human brain, which is a much more sophisticated algorithm. If you put enough brute force optimization into looking for things that are successful in open-world environments, eventually you will come across intelligent agents. And this is even before you put in recursive self improvement in the equation, which makes it even stronger as this kind of basin of attraction where you converge to. Whereas nothing of the sort is true about being aligned to human values in particular. It's very plausible that we could, by just kind of blind or semi blind trial and error, arrive at creating highly capable agents long before we understand enough to actually make those agents aligned.
Will Petillo: That sounds like going against the other Bostrom-popularized idea of Instrumental Convergence, that things like survival will be wanted by almost any system optimizing hard enough.
Vanessa Kosoy: There's this notion of instrumentally convergent goals, which are certain goals that most intelligent agents will pursue because they help them to achieve their terminal goals, whatever their terminal goals are. And these are things like survival, like gaining more resources, becoming more intelligent, and so on. But human values are not that. If we manage to construct an AI that survives and gains a lot of resources, that's nice for the AI, I guess, but it doesn't help us in terms of alignment to our values at all. That's a very different kind of thing.
Will Petillio: Does the fact that modern AI is trained on human-generated data and exists in human society not help?
Vanessa Kosoy: I think it helps, but it kind of leaves a lot of questions. One question is: okay, you learn from human generated data, but how do you generalize from there? Because it's really not clear what conditions are needed to get a good generalization, especially when the concept you're learning is something extremely complicated.
The higher the complexity of the concept you're learning, the more data points you need to learn it. What we're doing with the so-called Large Language Models, which are the hype in recent years, is trying to imitate humans. Which is, I mean, nice. It could be that it will lead to something good with some probability—not a very high probability. But the problem with it is that in order to use that, you need to generalize far outside the training distribution. Here we actually need to look at what the goal is.
The problem is that it is technically possible to create super intelligent AI, which will be dangerous. To solve this problem, it's not enough to create some kind of AI which would not be dangerous, because otherwise they could just write an algorithm that doesn't do anything. That's not dangerous, mission accomplished. We need to be able to create AIs that are sufficiently powerful to serve as defense systems against those potentially dangerous AIs. So those have to be systems that have superhuman capabilities at building sophisticated models of the world and building complex long term plans based on that. And that's something that is far outside the training distribution of a Large Language Model or anything that's based on human imitation. It is extremely unclear whether we can actually rely on the algorithms we have to generalize that far out of the training distribution without completely losing all of their alignment properties.
Will Petillo: To summarize, LLMs are fundamentally imitative, which doesn't seem particularly dangerous in itself, but it also limits what they can do. And so we can't really expect that development is just going to stop here. Eventually there might be something like Reinforcement Learning added onto it—maybe not necessarily that algorithm, but something that can be as creative as Alpha Zero is in Go and finding a really creative move that no one's seen before. So we need to be prepared for things that are much more powerful because those are going to be useful and the economics that led to building LLMs are going to lead to building bigger things. Is that what you're getting at?
Vanessa Kosoy: Yeah, that sounds pretty much on the spot. Either it will be Reinforcement Learning or...well, I don't want to speculate too much on what is needed in order to make AI more powerful because it's not good information to have out there.
Will Petillo: Fair enough. Moving on to the things that you actually work on, one idea I've seen adjacent to this is the term "Agent Foundations". And also the "Learning Theoretic Agenda". What are those things?
Vanessa Kosoy: Agent Foundations is this abstract idea that says we need to create a foundational mathematical theory which explains what agents are. What does it mean, mathematically, for an algorithm to be an agent? What types of agents are possible? What capabilities can they have or not have? And so on. The Learning Theoretic Agenda is more specific than that, in the sense that it's like a very specific program that is trying to achieve this goal. Specifically, by tools that build on statistical and computational learning theory, algorithmic information theory, control theory, this kind of thing. This is the program I created to answer this challenge of coming up with those agent foundations.
Will Petillo: OK, so Agent Foundations is like the question of "how do minds work?", that encompasses AI, and the Learning Theoretic Agendas is like, "how do we design algorithms that push this in a good direction?" Is that right?
Vanessa Kosoy: I wouldn't put it that way. I would just say that Agent Foundations is just trying to understand how minds work and people have been trying to do it in various ways. MIRI have historically had all sorts of proof-theoretic models that try to approach this, and then there's Garrabrant's Logical Induction, and there are various ideas under this very broad umbrella, whereas the Learning Theoretic Agenda is a very specific approach.
It's this approach that starts with AIXI and classical reinforcement learning theory as the starting points and then looks what are the missing ingredients from that in order to have a foundational theory of agents and start building towards those missing ingredients with ideas such as Infra-Bayesianism and Infra-Bayesian Physicalism and Metacognitive Agents, and so on.
Will Petillo: The kind of agents and minds that you're talking about here, is this tied to the frontier Large Language Models, or is it more broad to AI or any kind of thinking entity?
Vanessa Kosoy: When I say agent, I mean something very broad. Much broader than existing AIs or even just AIs. Certainly including humans, potential aliens, or whatever. So, for me, an agent is a system that has particular goals and is learning sophisticated models of the world in which it is embedded and uses those models to build long term plans in order to achieve its goals. So this is the informal description of what I mean by an "agent". The whole goal of this program is to go from this to a completely formal mathematical definition and study all the implications of this definition.
Will Petillo: So not even beyond LLMs, it's even broader than Machine Learning. What's the reason for that approach? Given how dominant Machine Learning is, why not focus on the things that seem to be the most widely used?
Vanessa Kosoy: First of all, let's keep some order in the terminology. I would distinguish between AI, Machine Learning, and Deep Learning. AI is this thing that people started thinking about since the 1950s, about how to build thinking systems, without really having good understanding of what does it even mean, but just some kind of intuitive notion that there is such a thing as thinking and we should be able to replicate it in a machine.
Machine learning is a more specific approach to this that emerged...well, I didn't want to point a finger exactly when, but probably something like the eighties. Machine Learning is specifically this idea that the central element of thinking is learning and learning means that you're interacting with some unknown environment and you need to create a model of this environment. So you need to take the data that you see and use it to create a model. And that's analogous to how scientists do experiments, gather data, and then build theories based on this data.
This general idea is called Machine Learning—or, more accurately would be to call it just learning. The "machine" part comes from trying to come up with ways to actually implement this inside a machine. This is a field that has a lot of mathematical theory. The mathematical theory behind machine learning is what's known as statistical and computational learning theory, and that's actually the foundation of the Learning Theoretic Agenda. That's why it's called "learning theoretic".
There has been a hypothesis that this kind of notion of learning captures most of the important bits of what we mean by thinking. And I think that this hypothesis has been extremely well-supported by recent developments in technology. This is something that I completely endorse, and it is basically the basis of my whole research program. So there's no contradiction here, because learning is still a very general thing. Humans also do learning. Aliens would also have to do learning.
Deep Learning is a more specific set of algorithms for how to actually accomplish learning efficiently in a machine, and that's what started the deep learning revolution around 2010, although the algorithms existed in some form for decades before that. But it took a while to get the details right and also to have the right hardware to run them. Deep Learning's unfortunate feature is that we don't understand it mathematically. A lot of people are trying to understand it, but we don't have a good theory of why it actually works. That's why it's not the focus of my research program, because I'm trying to come up with some mathematical understanding. I definitely have a hope that eventually people will crack this kind of mystery of how Deep Learning works, and then it will be possible to integrate it into the theory I'm building.
But even if we had this theory, then it still seems really important to think in the broadest possible generality, because, well, first of all, we don't know that the algorithms that exist today will be the algorithms that bring about AGI. And also because the broadest generality is just the correct level of abstraction to think about the problem, to get at those concepts of what does it even mean for a system to be "aligned". There's some philosophical problems that need to be solved here, and they are specific to some very particular algorithms. Also, there is the fact that I actually want this theory to include humans because I might want to use this theory to formalize things like value learning. How do I design an AI system that learns values from humans?
Will Petillo: Seeing the Wikipedia-level, or just browsing the internet description, of Machine Learning and Deep Learning, it's very easy to use them interchangeably. I think I've seen the description as Deep Learning is just the idea that you add multiple layers of neurons. And so because there's multiple layers, it's "deep"
Vanessa Kosoy: Let me try to clarify the distinction. Machine Learning talks about taking data and building models from it. The type of model you're building can be very different. Before Deep Learning, we had algorithms such as Support Vector Machines, Polynomial Regression is also a very simple type of Machine Learning—fitting a model to data. Various methods used in statistics can be regarded as a sort of Machine Learning. There is some space of models or hypotheses and you're trying to use the data in the optimal way to infer what the correct hypothesis is, or to have some probability distribution of your hypothesis if you're doing a Bayesian approach.
But different types of hypothesis classes lead to very different results in terms of the power of the algorithms, but also in terms of what we know to say about how to learn those hypothesis classes. And what do we know to prove mathematically under which conditions we can actually learn them? For example, for Support Vector Machines, the mathematical theory is basically solved. There's kernel methods that build on top of that, and that also has very solid mathematical theory. Deep learning is a particular type of learning algorithm which uses those artificial neural network architectures.
It's not just multiple layers, there's a lot of details that matter. For example, the fact that the activation functions are ReLU, that turns out to be pretty important for what kind of regularization method you use in training. For example, dropouts are basically what started the Deep Learning revolution. If you're working with sequences, then we have transformers, which is a very specific network architecture. So there is actually a lot of very specific details that people came up with over the years, mostly with the process of trial and error, to just see what works. We don't have a good theory for why those specific things work well. We don't even understand the space of models those things are actually learning, because you can prove theoretically that if you take a neural network and you just let it learn another neural network, then in some situations it will be infeasible.
But for real world problems, neural networks succeed to learn a lot of the time. This is ostensibly because the real world has some particular properties that make it learnable, or there's some particular underlying hypothesis class that the neural networks are learning, and which captures a lot of real world phenomena, but we don't even have a mathematical description of what this underlying hypothesis class is. We have some results for some very simple cases, like two layer or three layer neural networks, or some other simplifying assumptions, but we're not close yet to having the full answer.
Will Petillo: Deep Learning assumes certain things about how the world is, in terms of what it can pick up, and it happens to work fairly well, but it's not really clear what it's assuming.
Vanessa Kosoy: Yeah, that's exactly right. So we have different no-go theorems, which say that for arbitrary data, even if the data is fully realizable, and even if the data is such that the neural network can perfectly express an exactly correct model, the problem is infeasible. In general, gradient descent will not converge to the right model, and also no other algorithm will converge because the problem will just be intractable. There are some properties that the world has, and since this Deep Learning is successful in such a big variety of very different cases, it feels like those properties should have some simple mathematical description.
It's not like some properties that are extremely specific to texts or to audio or to images. It's some properties that are extremely general and hold across a wide range of different modalities and problems. Those are properties that I can speculate, for example, as having to do with compositionality, how the real world is often well described as being made of parts, and how things can be decoupled according to different spatial scales, or different temporal scales on which the dynamics is happening. But we don't have a theory that actually explains it.
Will Petillo: You mentioned ReLU as one of the examples of a thing that just works. As I understand it, ReLU is basically like taking the output, changing it in a way that could be illustrated as a graph where it's flat on one side and a diagonal line past zero. Whereas before, models typically used Sigmoid as the activation function, which was more like a smoothly curved line that prevents numbers from getting too big. For some reason, ReLU works better. The sense I'm getting from your explanation is that this change impacts what kinds of things the neural network is able to understand in a direction that's more matching with reality. But all these changes are developed with a "throw stuff at the wall and see what sticks" kind of way, simply measuring the results without really understanding why ReLU is better than sigmoid.
Vanessa Kosoy: That's more or less right. We just have to be careful about what we mean when we say what the neural network can "understand". It's a pretty complicated concept because it's not just what the neural network can express with some set of weights, it's what the neural network can actually learn through a process of gradient descent. It has to do not just with the space of functions that the neural network can describe, but with the entire loss landscape that is created in this space of weights when we look at a particular data set.
Will Petillo: When you're describing a gradient descent and loss landscape the analogy I hear thrown around a lot is a ball rolling down a hill—there's a constant gravity force, and you want the ball to get down to sea level. But often it doesn't because it finds some local minima, like a hole or something, where any direction it can go is up, so it's not going to roll anymore. So you have to shape the landscape such that down always gets the ball to sea level.
Vanessa Kosoy: Yeah, that's a pretty good explanation. Gradient descent is something that we have good mathematical theory for how it converges to the global minimum for convex functions, but the loss for neural networks is non-convex...but it still happens to be such that it works. People have gained some insights about why it works, but we still don't have the full answer.
Will Petillo: OK, so if the landscape is really bumpy, then you don't expect the ball to get to sea level, so the fact that it somehow does anyways demands an explanation that we don't really have. I can see how that kind of framing raises a lot of questions regarding unpredictability.
Moving on, you mentioned AIXI at one point. What is that?
Vanessa Kosoy: AIXI is this idea by Marcus Hutter, which is supposed to be a mathematical model of the perfect agent. The way it works is: there is a prior, which is the Solomonoff prior. For those who don't know what that is, it's basically some way to mathematically formalize the concept of Occam's Razor. And Occam's Razor is this idea that simple hypotheses should be considered a priori more likely than more complicated hypotheses. And this is really at the basis of all rational reasoning. Hutter took the Solomonoff prior, which is a very clever way to mathematically formalize this notion of Occam's razor, and say, well, let's consider an agent that's living in a universe sample from the Solomonoff prior. And this agent has some particular reward function that it's maximizing. And let's assume it's just acting in a Bayes-optimal way. So it's just following the policy that will lead it to maximize its expected utility according to this prior. And let's call this AIXI. Which is a really cool idea...only it has a bunch of problems with it, starting with the "minor" problem that it's uncomputable. There's not an algorithm that exists to implement it even in theory.
Will Petillo: I think I heard it explained once as imagining the entire universe described as a bunch of bits—ones and zeros. At the start, all of them could either be a one or a zero, then you get a little bit of data and now you've locked in a few of those numbers and have cut the space of all things that could be in half. As you keep learning, you get more and more certain.
Vanessa Kosoy: It's actually a little more nuanced than that. Just the fact that you have a lot of possibilities doesn't mean that it's uncomputable. Maybe the exact thing is uncomputable, but you could still imagine that there is some clever algorithm that approximates this Bayesian inference process. For example, if you look at classical reinforcement learning theory, then there are things like algorithms for learning in arbitrary Markov decision processes with n states. In a Markov decision process with n states, there are still an exponentially large space of possible ways it could be, and we still have actually efficient algorithms that converge to the right thing out of this exponentially large thing by exploiting some properties of the problem.
The thing with AXI is that its prior is such that even individual hypotheses in the prior are already arbitrarily computationally expensive, because in its prior it considers every possible program, so every possible program that you can write on a universal Turing machine is a possible hypothesis for how the world works. And some of those programs are extremely expensive computationally. Some of those programs don't even halt, they just enter infinite loops. And you can't even know which, because this is the halting problem, right? This is why AIXI is a non-starter for doing something computable, not to mention computationally tractable.
Will Petillo: The minor problem of uncomputability aside, a "perfect algorithm"...what does that mean? Would AIXI be safe if it were somehow, magically computed?
Vanessa Kosoy: No, it doesn't make it safe in any way. It's "perfect" in the sense of it's the most powerful algorithm you could imagine. Again, under some assumptions. I mean, there are other problems with this, such as that it assumes that the outside world is simpler than the agent itself. There are multiple problems with this, but if you can put all those problems aside then you could argue that this is the best possible agent. And in this sense, it's perfect. It's very, very, very not safe. In order for it to be safe, we would need to somehow plug the right utility function into it. And that would still be a very non-trivial problem.
Will Petillo: What kind of algorithms do you look for, assuming you're trying to find things that are computable?
Vanessa Kosoy: Computability is just one of the issues. I'm imagining that there will be something that I call the frugal universal prior, which is some kind of prior that we can mathematically define, which will simultaneously be rich enough to capture a very big variety of phenomena. And on the other hand, we'll have some clever algorithm that can actually allow efficient learning for this prior using, for example, some compositionality properties of the hypothesis on this prior or something of that sort.
But even knowing this prior, there's a lot of other conceptual problems that you also need to deal with. Like what I call the problem of privilege, where the formalization of Occam's Razor and AXI privileges the observer, and you need to understand how to deal with that. And there's the problem of realizability where you cannot actually have a hypothesis which gives you a precise description of the universe, but only some kind of approximate or partial description, and you need to understand how to deal with that. Then there's also the fact that you want your utility function to be not just a function of your observations, but also some parameters that you cannot directly observe. You also want to be able to prove some Frequentist guarantees for this algorithm. To know how much data this algorithm actually needs to know to learn particular facts and have a good theory of that. There's a whole range of different questions that come up when studying AIXI like models.
Will Petillo: Studying AIXI like models, that is what you're working on?
Vanessa Kosoy: You could, yeah, if you wanted to put it in one sentence, I guess.
Will Petillo: What are some interesting problems that you're interested in solving? I've seen Newcomb's problem floating around and stuff adjacent to this.
Vanessa Kosoy: Newcomb's problem is something Eliezer Yudkowsky wrote about a lot as an example of something that's very confusing for classical accounts of rationality. You have two boxes that you need to choose from. One box has a thousand dollars. The other box has either nothing or a million dollars. You can either choose the first box or you could take the money that's in both boxes. Normally, taking the money that's in both boxes is always strictly superior to just taking the one box.
Except that in this spot experiment there is some entity called Omega who can predict what you do, and so it only puts the $1,000,000 in the other box if it knows that you will only take that box and won't try to take the thousand dollar box as well. So only if you're the type of agent that would predictably (for Omega) take only one box, only in this case you will get out of this room with $1,000,000. Whereas in the other case you will only have $1,000. So arguably it's better to take one box instead of two boxes, as opposed to what many classical accounts of rationality would say. This is one example of an interesting thought experiment.
For me, this thought experience is a special case of the problem of non-realizability, where you need to deal with environments that are so complex that you're not able to come up with a full description of the environment that you can actually simulate. Because in this example, the environment contains this agent Omega, which simulates you, and this means that you cannot simulate it, because otherwise it would create this kind of circular paradox. I've actually also shown that my theory of dealing with non-realizability, which I call Infra-Bayesianism, actually leads to optimal behavior in these kinds of Newcomb-like problem scenarios.
Will Petillo: And the reason for studying Newcomb-like problems is not because we expect to be faced with Omega offering us boxes at some point, but because it's just an illustrative way of thinking about how to deal with things when you can't know what's going on. And also because it might be easy to say, "yeah, well, I'll just take one box because I'll get more that way," but when you really dive into what's a coherent, non hand wavy reason as to why, then there's some interesting insights that can potentially come up from that. Have there been any discoveries that you found through exploring these kinds of things?
Vanessa Kosoy: I would say that Infra-Bayesianism itself is an interesting discovery, that some theoretical account of agents that can reason about complicated worlds that are much too complicated for the agent to simulate. Now I describe the motivation by stating the problem of non-realizability, but the way I actually arrived at thinking about this is through thinking about so-called logical uncertainty. The reason people started thinking about it was because of so-called Updateless Decision Theory, which came from thinking about Newcomb time paradoxes. So it all comes from that line of reasoning even though, after the fact, you can motivate it by some much more general abstract thinking.
Will Petillo: What's the connection between these decision theory type questions and making a safer AI?
Vanessa Kosoy: The idea is creating a general mathematical theory of agents. The way it's going to help us with making AI safer, there are several reasons, the most obvious is that in having this theory, we hopefully will be able to come up with rigorous models of what it means for a system to be an aligned agent. Having this rigorous definition, we'll be able to come up with some algorithms for which we can prove those algorithms are actually safe agents. Or at least we could have some conjecture that says that this given model of such and such conjectures, we think that those algorithms are safe agents. Like in cryptography, you have some conjectures which have very strong evidential support.
We could have at least some semi formal arguments because now when people are debating whether a particular design is safe or not safe, it all boils down to those hand waving philosophical arguments, which don't have any really solid ground. Whereas this gives us tools for much more precise, crisp thinking about these kinds of questions. It hypothetically also gives us much more power to leverage empirical research because maybe we will be able to take the empirical research we have, plug it into the mathematical theory, and get some answers about how we expect those results to actually extrapolate to various regimes where we haven't done so.
Will Petillo: Would this line of research that you're working on eventually be usable in evaluating things like Large Language Models or Deep Learning based systems, to be able to say with greater certainty the extent to which they're safe or unsafe?
Vanessa Kosoy: I think there are multiple paths to impact. So there is a path to impact where we will eventually come up with a theory of Deep Learning. Or, if not a fully proven theory, then at least some strong conjectures about how Deep Learning works that can interface with the theory of agents I'm building. And then we could use this composite theory to prove things or at least to have strong arguments about properties of systems built in Deep Learning.
There could be a different path to impact where we use this theory to come up with completely new types of algorithms for building AI, which are not Deep Learning, but for which we have a good theoretical understanding.
There's also some third possibility that we won't have a good theory, but we could at least reason by analogy, similarly to how many Deep Learning algorithms are designed by analogy to some algorithms for which we have mathematical theory. Deep Q learning, for example, is analogous to simple Q learning, for which we have mathematical theory. So we could imagine a world in which we have some kind of idealist toy model algorithms for which we have some rigorous arguments why they are aligned and then we have some more heuristic algorithms, which we cannot directly prove things about, but which are arguably analogous to those toy models.
Will Petillo: So I heard three paths to impact. One is potentially building a different form of AI that's verifiable from the ground up and does the same things as Deep Learning based AI, but in a more rigorous sort of way. A second is evaluating, or at least better understanding, Deep Learning or whatever is state of the art. And then a third, in between the two, is having a simpler form of AI that analogizes to the state of the art things, so that you can use the former to understand the latter.
Vanessa Kosoy: Yeah, that sounds about right.
Will Petillo: I'd like to focus a bit on using foundational research to understand other things like Deep Learning, getting at this theory-based approach. Bringing in a totally opposite counterpoint, one could argue: no, you should just look at the things that are being used and collect data about it and then build your theory by finding patterns in the data. When the theories are shown to be wrong—as a result of more data—then update your theories then. Why work on theory in advance?
Vanessa Kosoy: The biggest reason is that you cannot reliably extrapolate from empirical research without having an underlying theory. Because you might, for example, take some measurements and find some trend...but then there is some phase transition later on that you don't see in the trend, but which happens and behavior changes to a completely different regime. And because you don't have a theoretical explanation, you won't notice—or people will just switch to using completely different algorithms, which behave in completely different ways.
You might have those empirical models of existing AIs, but those empirical models are very myopic. They're always looking one step ahead. And then you don't see the cliff that's three steps ahead of you. Updating those theoretical, empirical models on new things that happen—it might just not be quick enough. Eventually you fall off the cliff and then it's too late to say, "oh, actually, that trend line was wrong!"
Luckily we are in a domain where we have tools to do research even without empirical data. Of course, we should use the empirical data that we have, but we're not bottlenecked on empirical data because the thing we're studying is algorithms, and algorithms are mathematical objects, so they can be studied mathematically. This is very different from studying some phenomenon of physics, where if you don't have the data, then there's no way to generate the data without having it. Here, this really should all boil down to math. More precisely, it should boil down to math plus whatever properties the real world phenomena have that we want to assume in our mathematical theories.
And here, yeah, here it is something that we need empirical input for. But on the other hand, we already have a really good understanding of physics. So given the knowledge of physics and other scientific domains that we have, it's very plausible that we have enough information to answer all the questions purely through mathematical inquiry, even if we had no empirical data at all. Which is not to say we shouldn't also use empirical data, to supercharge this research, but we're not limited to that.
Will Petillo: So it's not a choice between theory versus experiment, we should be using both. You're focused on the theory side, and arguably there's not enough work on that because theory is where the bottleneck is, not on getting more data.
Vanessa Kosoy: Yeah, I think we should definitely be doing both. Ideally, there needs to be synergy where experiments produce new phenomena for theorists to explain and theory inspires the experiments. The theorists should be telling the experimentalists which questions and what kind of experiments are the most interesting and we should have this kind of synergy. But I think that in the current landscape—definitely in AI alignment—the theory side is currently left behind. That's where I think we should put the marginal efforts in.
Will Petillo: Do you see that synergy existing now? Like, is OpenAI asking MIRI for feedback on their experiments, or is there any kind of connection, or are people just siloed off from each other?
Vanessa Kosoy: I think it doesn't exist now, almost at all. OK, no, to be fair, it exists in some areas and much less in other areas. For example, there's people working on Singular Learning Theory. I think that they are much more interfaced with experimental work, which is good. The kind of research that MIRI is doing is and the kind of research I'm doing is much less interfaced with experimental work. I have some plans for creating an experimental group working in a close loop with me on those questions as part of my big, long term plans, but I still haven't gotten around to doing that yet.
Will Petillo: If you could change anyone's mind, or set political and business agendas, what would you like to see happen to have more of an interface?
Vanessa Kosoy: First of all, we just need more theorists. To have an interface, we need something to interface with, so we just need more theorists. I think that this is, in practice, where the bottleneck is now. Once this progress in theory gets sufficiently paced there will be a bunch of questions. I mean, there are already questions that I would like to see experiments on, but the more this thing picks up, the more such questions we will have. I think now the main bottleneck is just in having more people working on this theory.
Will Petillo: What would change from an external perspective if there was a lot more theory work? I imagine a skeptic could argue: "OpenAI and these other companies are making these really awesome breakthroughs in a largely experimental kind of way, and it's working great! If it ain't broke, don't fix it!" What's broken, in your view?
Vanessa Kosoy: I think that the current path is leading us to a disaster. I think that companies like OpenAI and other leading labs are wildly overconfident about their ability to solve problems as they come along. I think that they haven't come up with any convincing solutions to the hard parts of the problem, and they don't even have the tools to do this because of a lack of theoretical understanding. We don't even have models that are precise enough to have a solution in which we could really be confident. We need to be very precise about the arguments which convince us that the solution is good. And we don't even have the tools to reach this type of precision.
What the companies are doing is basically just developing things in trial and error. If we see any problems, then we'll just tweak the thing until the problem goes away. That's a bandaid method, which is to say it works until it doesn't work. It fixes problems on a superficial level, but eventually there will come a point where either the problem will not be caught in time and the results will be catastrophic, or the problems will be caught in time, but then nobody will have any idea what to do in order to fix it. And eventually someone is going to do the catastrophic thing anyway.
The only thing which makes me less pessimistic than other people in MIRI is that I think we still have more time. I don't think they're quite as close to AGI, and I think that a lot of things can change during this time. Which is again, not to say they will change—we might burn all this time and still end up with a catastrophe.
Will Petillo: What's an example of an existing problem that only has superficial solutions?
Vanessa Kosoy: I mean, the real problem we're concerned about is not really an existing problem, right? The main thing we're concerned about is that future AI systems—which will be much more powerful than existing AI systems—will bring about extinction of the human race or a catastrophe on a similar level.
That's not an existing problem for the simple reason that the AI systems we have today are not capable of learning a model of the world that's so sophisticated that it enables you to do these types of actions. But even now, the companies struggle with all the things that happen with Large Language Models, such as the infamous jailbreaks where they're trying to make them well-behaved in various ways. Not telling the users offensive, dangerous information, for example, and the users easily find jailbreaks to work around that, or just tell false answers.
But again, for me, that's not the real problem, it's just an analogy. I mean, they're kind of struggling with these very simple, much easier problems now, which is not to say they won't solve them. Trial and error will get you there eventually. The reason trial and error is not the solution for existential risk is because once everyone is dead, the trial is over. There's no more trial. So the problems we have now, they're still struggling with them because they don't have principle tools to solve them, but eventually they will trial-and-error their way through and will patch them somehow, or at least solve them well enough for it to be economical. But once you reach the point where failures are global catastrophes, trial and error is no longer an acceptable method of fixing the problem.
Will Petillo: Obviously we don't get to see lots of test data of the world ending. But I would imagine there'd be some precursor issues that are smaller, but hint at what's to come. Do you see the challenges with hallucination or not being able to control what the AI says as those kinds of precursors? Or are they totally irrelevant and there just won't be any precursor issues?
Vanessa Kosoy: It's a difficult question, because there are very important bits that are still missing from existing AI systems to produce existential risks. We can point at examples where systems are kind of mis-generalizing, there's a lot of famous examples: some program that "wins" Tetris by pausing the game forever, or that wins some boat race game by racing the boats in infinite circles, doing various weird unintended behaviors, because the metric that the algorithm is maximizing is not actually what the users intended. You could call those precursors, but I feel like it's not exactly capturing the magnitude of the problem because those are still toy settings. There are no open-world systems that are acting in the open, physical world. The goals that they're trying to solve are much simpler than human values; there's not really operating domains where there are really complex ethical considerations.
Maybe Large Language Models are starting to approach this because they enter domains where there are some, at least morally, not completely trivial issues that come up. On the other hand, Large Language Models are not really doing things that are strongly superhuman. Well, they may be superhuman in the sense that they have a very large breadth of knowledge compared to humans, but not in other senses. So it's hard. There are things that are sort of analogous, but it's not strongly analogous.
But then again, our motivation to be concerned about this risk doesn't come from looking at LLMs. Eliezer Yudkowsky started talking about those things before Deep Learning was a thing at all. That's not where the motivation comes from.
Will Petillo: I guess the reason I was asking about it is that in the places where this stuff gets debated and polarized, one of the common objections is: "There's no evidence behind this! This is all just storytelling!" Is there evidence of the danger or does it really just come from looking at the math?
Vanessa Kosoy: The problem is, what do you call evidence? That's a very complicated question. The things that would be obvious evidence would be things like: AI's completely going out of control, breaking out of the box, hacking the computer, copying themselves to other computers, outright manipulating human operators, and so on. But this kind of thing is a sort of canary that you only expect to see very, very, very close to the point where it's already too late. It's not possible to say that we will only rely on this type of evidence to resolve the debate.
For other types of evidence, some people say that evolution is sort of evidence how a Machine Learning algorithm can produce something which is completely unaligned with the original algorithm. Other people show you Reinforcement Learning algorithms doing not what the designer intended. But for every argument like that, you could have a counter argument which says, "yeah, but this example is not really similar. We cannot really project from there to existential risk because there are some disanalogies."
And yeah, there will always be some disanalogies because until you have AIs in the real world that are very close to being an existential risk, you won't have anything that's precisely analogous to something presenting an existential risk. So we have no choice but to reason from first principles or from math or by some more complicated, more multi-dimensional analysis. We just have no choice. The universe doesn't owe it to us to have a very easy, empirical way of testing whether those concerns are real or not. One of the things I'm hoping for is that the theory will bring about stronger arguments for AI being dangerous—or the theory will tell us no, everything is fine, and we can all relax. The lack of theory is part of the reason why we don't have foolproof, completely solid arguments in one direction or the other direction.
Will Petillo: The challenge with finding evidence is that anything you can point to that exists now could be interpreted in multiple ways. Having solid theory would lend some credence to one interpretation over another.
Vanessa Kosoy: Yeah, absolutely. If you have a theory that says that a particular type of misgeneralization is universal across most possible machine learning systems, and we also see this type of misgeneralization happening in real Machine Learning systems, then it would be much harder to dismiss it and say, "oh yeah, here we have this problem, but we'll do this and that, and that will solve it easily."
Will Petillo: There's one thing that's still bugging me about the issue of evidence not being available now. The analogy my mind immediately goes to is climate change. You could say that "Oh, the idea of large swaths of the world being uninhabitable is just this elaborate story because all that has never happened before!" But then you can look at a bunch of things that exist already: small scale disasters, the graphs of CO2 versus temperature, and so on, point to those and say, "Hey, the really bad stuff hasn't happened yet, but there is a lot of evidence that it will!" What makes AI different?
Vanessa Kosoy: I think that climate change is a great analogy. The big difference is that in climate change, we have a really good theory. Like in climate change, we have physics, right? And we have planetary science, which is on a very, very solid foundation. And we have computer simulations. It's still not trivial, there are some chaotic phenomena which are hard to simulate or predict, so not everything is completely trivial, but still we have some very, very strong theoretical foundation for understanding how those things work and what are the mechanisms. And this theory is telling us that there's still big uncertainty intervals around how exactly many degrees of warming we're going to get with such and such amount of CO2, but we still have a fairly solid prediction there.
Whereas with AI, we don't have this. The analogous situation, if you want to imagine climate change, AI style, then it would be something like not having a theory which explains why CO2 leads to warming. Having some empirical correlation between temperature and CO2, and then people could argue ad infinitum. Correlation is not causation, maybe the warming is caused by something completely different, maybe if we do some unrelated thing it will stop the warming, which is not actually true. We would be in the dark. With AI, we're currently in the dark.
Will Petillo: What is happening currently with your work at MIRI?
Vanessa Kosoy: Currently there are multiple problems I'm looking at. Hopefully I will publish very soon a paper on imprecise linear bandits, which is related to Infra-Bayesianism that I mentioned before, which is a theory of agents that reason about complicated worlds. That's analyzing this theory in some very simple, special case in which I succeeded to get some precise bounds for how much data an algorithm would need to learn particular things. After that, I'm starting to look into the theory of learning state representations in Reinforcement Learning, which is currently another big piece missing from the theory, which is about how your algorithms should learn about which features of the world are actually important to focus on.
In parallel, I have a collaborator, Gergely Szucs, who is working on using my theory of Infra Bayesian Physicalism to create a new interpretation of quantum mechanics. He has some really interesting results there. It's kind of a test case which demonstrates how this framework of thinking about agents allows you to solve all sorts of philosophical confusions. In this case, it's confusions that have to do with the interpretation of quantum mechanics. Scott Garrabrant has a project about a new type of imprecise probabilities, some new way of representing beliefs that have some nice compositionality properties. Kaspar Osterheld from Carnegie Mellon and Abram Demski had a paper recently about some new type of frequentist guarantees for algorithms that are making decisions based on something that's similar to a prediction market. So yeah, a lot of interesting things are happening.
Will Petillo: Are there any other questions that I did not ask that would be helpful for someone seeing this to get a sense of what you're about here?
Vanessa Kosoy: Not a question exactly, but I also have a more concrete approach for how to actually solve alignment, how to actually design an aligned agent, which I call Physicalist Superimitation. It's a variation on the theme of value learning, but it draws from the framework of Infra Bayesian Physicalism, which comes from the Learning Theoretic Agenda and from some ideas in algorithmic information theory to come up with a semi-formal approach to how you could have an AI that learns human values in a robust way.
It deals with a lot of problems that other approaches to value learning have, like: how do you determine where the boundaries of an agent are? What is a human? How do you locate this human in space? How do you take into account things which are not just behavior, but also internal thought processes of the human in inferring the human's values? How do you prevent perverse incentives such as the AI somehow changing or manipulating humans to change their values? How do you avoid the inner alignment problem? It answers a range of concerns that other approaches have.
Will Petillo: This sounds reminiscent of Inverse Reinforcement Learning?
Vanessa Kosoy: Inverse Reinforcement Learning is the idea that we should look at behaviors of humans, infer what those humans are trying to do, and then we can do this thing. "We" as an AI. So I actually have presentations in which I explain Physical Superimitation as Inverse Reinforcement Learning on steroids. It's taking this basic idea, but implementing it in ways that solve a lot of the deep problems that more simplistic approaches have. One problem that simplistic approaches have is that they model humans as perfect agents that follow the perfect policy, given perfect knowledge of the environment, which is wildly unrealistic.
Instead, I model humans as learning agents. They learn things as they go along. And they also might even do that imperfectly. Another thing is the issue of boundaries. What is a human exactly? Where do you put the boundaries around a human? Is there just some particular input and output, which the human uses and you consider everything that goes through this port to be the human? But then how do you deal with various discrepancies between what goes into this port and what the human actually intended to do, or various possibilities like the AI hijacking this channel?
In my approach, the way a human is formalized is that a human is a particular computation that the universe is running. This is something I can actually formalize using Infra Bayesian Physicalism. It has particular properties, which make it agentic, so the agent detects which computations the universe is running, among them detects which computations are agents, and amongst those agents, it selects which agent is its user by looking into causal relationships, and this way it homes onto the boundary of the agent. The first thing is because we're talking about the computation that this human is running, which is human reasoning and regarded as a computation. We're also automatically looking at internals as internal thought processes and not just things that are expressed as external behaviors. So we have potentially much more information there.
Will Petillo: What would be the best way for someone to get involved? And what would they want to learn in advance?
Vanessa Kosoy: One thing they could immediately start doing is reading up on stuff people did in Agent Foundations and in the Learning-Theoretic Agenda until now. I have this recent post,Learning Theoretic Agenda: Status 2023, which summarizes a lot of the things. I also have areading list post where I recommend some background reading for people who want to get into the field. More concretely in terms of career steps, it's already too late to apply, but I'm running a track in MATS, which is a training program for researchers who want to get into AI safety. I have a track focused on the Learning Theoretic Agenda. Hopefully there will be another such track next year. I also have a fantasy of having an internship program, which would bring people to Israel to work with me on this. Currently, because of the war, this thing has been postponed, but hopefully, eventually things will settle down and I will revive this project. Those are currently the main ways to get involved.
Will Petillo: Thank you for that description. I wish you the best in developing this theory and gaining more interest so that mismatch between evidence and theory starts to get corrected and the researchers know what they're doing rather than stumbling in the dark!
Below is a transcript (edited for grammar) of a video interview I had with Vanessa Kosoy, cross-posted from my blog. It is intended as a (relatively) beginner-friendly explanation of the goals of the Learning Theoretic Agenda and why more theoretical work is needed for ensuring AI that is safe and reliable at scale.
Short Videos (highlights):
Intro (by Will Petillo): Discussing the future of AI tends to get philosophical. What does it mean to have "goals" or "understanding"? Is power-seeking a default consequence of wanting things...or a human quirk resulting from our unique evolutionary history? What motivates kindness? Framing questions in this way makes them accessible, allowing everyone to participate in the conversation. But this lack of precision also makes such questions intractable, as disagreements become a clash of intuitions.
Today's guest on "Guardians of Alignment" is Vanessa Kosoy, an independent researcher supported by the Machine Intelligence Research Institute (MIRI) and the Long-Term Future Fund (LTFF), on building mathematical theories for safe AI. This focus on understanding from first principles puts her work in stark contrast to the "move fast and break things" experimental approach of leading AI labs. In this interview and elsewhere, Vanessa defends the value of a more theory-based approach and explains what it means to explore machine learning as a basic science.
Will Petillo: How did you get into AI safety?
Vanessa Kosoy: I've always been an autodidact, so I kind of just tend to learn things by myself. When I was little, I thought I would be a theoretical physicist. I actually have a bachelor's in math, but after completing a bachelor's, instead of going to academia, I decided to go for a career in the industry instead, in software.
I had a long career in the software industry, specifically in algorithm engineering, mostly computer vision, various roles, algorithm engineer, team leader, R&D manager. I also had my own startup. I was a consultant and then something like 10 years ago. I got exposed to this whole topic of existential risks from AI and started thinking that, well, that actually seems kind of important. So I started pivoting to that and initially it was just me doing research in my free time. Then it was supported by MIRI. And then I more recently also got support from the Long Term Future Fund, which enabled me to go full time.
Will Petillo: So what was that process like? You were working just a self directed way, and then later you gained support from Miri and other sources? How did that come about?
Vanessa Kosoy: I started reading some of the stuff MIRI and people on Less Wrong were writing. I started working on my own ideas and writing posts on Less Wrong. After that, I got invited to some workshops, some events, and eventually MIRI said that, well, it seems like you're doing some nice work here, so maybe we will also pay you for it. And I was great, because that also enabled me to do more of that and spend less time doing other things.
Will Petillo: So a little background for the audience here. Less Wrong is a popular blog. It's initially on rationality, but also on AI related things. MIRI is the Machine Intelligence Research Institute. I like to describe them as the people working on alignment before it was cool. Tell me a little bit more about MIRI as an institution.
Vanessa Kosoy: MIRI were more or less the first to talk about existential risks for artificial intelligence. Eliezer Yudkowsky started talking about this in the year 2000, and initially Miri was just Yudkowsky, and then over the years they managed to get some funding to get other researchers on board. They were thinking about this question of: how do we make artificial intelligence safe, and how do we even approach the question? What kind of mathematical theory can we come up with to solve this question? That's even before the deep learning revolution started, and way before the whole hype with large language models in recent years. Most of their time has been committed to trying to come up with some foundational mathematical theory that will help us with AI alignment.
More recently, they pivoted to outreach and trying to influence policy because of their beliefs that the timelines are really short and we don't have time for developing this theory, unfortunately.
Will Petillo: Have you been a part of that pivot?
Vanessa Kosoy: No, my outlook is different. I'm more conservative you could say. I think that the timelines are not quite as short as a lot of the people in the risk community believe. I think that if the effort to, in the policy channels to regulate AI development and delay AI development to stop the really dangerous kind of AI from coming about will succeed, then it's only buying us time. And then the question is: buying us time for what? And I think that theoretical foundations is definitely still the most important thing among the things we should be doing with the time that we have, or with the time that we will succeed to buy by some policy plan, one way or another.
I think that practically in any world, creating this foundational theory is the key. And that's what I'm doing. That's definitely where my personal skills and advantages lie in, working on math and not on policy.
Will Petillo: You mentioned timelines. Intuitively, and I know it's impossible to really predict these things with any kind of precision, but just in terms of what motivates you, what do you see as your timeline for when this stuff needs to be solved?
Vanessa Kosoy: Some people think that AGI is coming in 10 years or even less. I think that's a little extreme. But when things need to be solved, the sooner the better, right? If we have the solution in five years, then we're in a better place than if we only have the solution in 10 years, which is still better than if we only have the solution in 20 years and so on.
Realistically, my personal opinion is that there are still multiple decades before we actually arrive at the sort of AI that poses existential risks. So that gives us a little more time, but not infinite time.
Will Petillo: What made it click for you that this is something you want to work on?
Vanessa Kosoy: It started as more like a curiosity because it actually started with me discovering completely randomly, you could say some papers on AGI—not even on AI alignment or risk or anything like that, but just some paper by Jürgen Schmidhuber and Marcus Hutter with some ideas about AGI. I always was a math nerd, so it seemed really cool that there are those mathematical frameworks for thinking about AGI. I started reading about that and eventually also found Less Wrong, where some people were also discussing this kind of thing.
On the one hand, I was reading more and more things that Eliezer Yudkowsky wrote on the topic, and people on LessWrong wrote on the topic, but I also started thinking about the mathematical models. Eventually it clicked for me that when you're thinking about the actual math, then it makes perfect sense that there's no mathematical reason why AI would have to care about humans or care about anything at all aligned with what we as humans care about.
On the other hand, it also seemed obvious that it could be much more capable than us. It's pretty clear intuitively, I think, but for me, I like to understand everything through math. So when I saw that you could actually put that into a mathematical model, then it really clicked for me that this is the real thing, this is something we should really be concerned about.
Will Petillo: This idea that there's no reason to assume that AI will necessarily be nice, that sounds a lot like the Orthogonality Thesis that Nick Bostrom wrote about. Intelligence and how nice something is don't have to go together; any set of values can go with any level of intelligence. Is that essentially what your insight was?
Vanessa Kosoy: Yeah, that's exactly the terminology. In hindsight, it just seems like an obvious thing. But for me, it was necessary to see that you could actually think of it in terms of mathematical objects; values could be formalized as a utility function. And then an agent could be formalized as some kind of an optimizer, some kind of a Bayes-optimal policy or whatever for this utility function. You could actually put rigorous meaning behind each of those terms and see that it actually all makes sense—it's not just some kind of a philosophical hand waving trick.
Will Petillo: What, in your view, is the fundamental reason that makes the problem of AI alignment hard?
Vanessa Kosoy: I think the problem is hard. I think that what makes it hard is, well, first of all, that we're aiming at a very narrow target because human values are very complicated and specific. We care about a lot of very detailed things: love, friendship, beauty (in our own subjective understanding of it), sex, all of those things are human things that exist because of some complicated accidents of evolution, the way it happened on some very particular planet. It's a very particular point in the history of a particular universe. This set of values is a very, very narrow fraction of the huge space of possible values or minds that you could imagine.
So most minds are definitely not anything that would be nice to us by our standards. Worse than that, there's also this phenomenon which Nate Soares formulated nicely in one of his relatively recent posts, where he wrote that there is a basin of attraction around agent capabilities, but there is no basin of attraction about agent alignment. What this means is that if you apply enough optimization pressure, even with brute force techniques, you will eventually generate highly capable agents. And an example of that is evolution, right? Evolution is a very primitive brute force algorithm that eventually created the human brain, which is a much more sophisticated algorithm. If you put enough brute force optimization into looking for things that are successful in open-world environments, eventually you will come across intelligent agents. And this is even before you put in recursive self improvement in the equation, which makes it even stronger as this kind of basin of attraction where you converge to. Whereas nothing of the sort is true about being aligned to human values in particular. It's very plausible that we could, by just kind of blind or semi blind trial and error, arrive at creating highly capable agents long before we understand enough to actually make those agents aligned.
Will Petillo: That sounds like going against the other Bostrom-popularized idea of Instrumental Convergence, that things like survival will be wanted by almost any system optimizing hard enough.
Vanessa Kosoy: There's this notion of instrumentally convergent goals, which are certain goals that most intelligent agents will pursue because they help them to achieve their terminal goals, whatever their terminal goals are. And these are things like survival, like gaining more resources, becoming more intelligent, and so on. But human values are not that. If we manage to construct an AI that survives and gains a lot of resources, that's nice for the AI, I guess, but it doesn't help us in terms of alignment to our values at all. That's a very different kind of thing.
Will Petillio: Does the fact that modern AI is trained on human-generated data and exists in human society not help?
Vanessa Kosoy: I think it helps, but it kind of leaves a lot of questions. One question is: okay, you learn from human generated data, but how do you generalize from there? Because it's really not clear what conditions are needed to get a good generalization, especially when the concept you're learning is something extremely complicated.
The higher the complexity of the concept you're learning, the more data points you need to learn it. What we're doing with the so-called Large Language Models, which are the hype in recent years, is trying to imitate humans. Which is, I mean, nice. It could be that it will lead to something good with some probability—not a very high probability. But the problem with it is that in order to use that, you need to generalize far outside the training distribution. Here we actually need to look at what the goal is.
The problem is that it is technically possible to create super intelligent AI, which will be dangerous. To solve this problem, it's not enough to create some kind of AI which would not be dangerous, because otherwise they could just write an algorithm that doesn't do anything. That's not dangerous, mission accomplished. We need to be able to create AIs that are sufficiently powerful to serve as defense systems against those potentially dangerous AIs. So those have to be systems that have superhuman capabilities at building sophisticated models of the world and building complex long term plans based on that. And that's something that is far outside the training distribution of a Large Language Model or anything that's based on human imitation. It is extremely unclear whether we can actually rely on the algorithms we have to generalize that far out of the training distribution without completely losing all of their alignment properties.
Will Petillo: To summarize, LLMs are fundamentally imitative, which doesn't seem particularly dangerous in itself, but it also limits what they can do. And so we can't really expect that development is just going to stop here. Eventually there might be something like Reinforcement Learning added onto it—maybe not necessarily that algorithm, but something that can be as creative as Alpha Zero is in Go and finding a really creative move that no one's seen before. So we need to be prepared for things that are much more powerful because those are going to be useful and the economics that led to building LLMs are going to lead to building bigger things. Is that what you're getting at?
Vanessa Kosoy: Yeah, that sounds pretty much on the spot. Either it will be Reinforcement Learning or...well, I don't want to speculate too much on what is needed in order to make AI more powerful because it's not good information to have out there.
Will Petillo: Fair enough. Moving on to the things that you actually work on, one idea I've seen adjacent to this is the term "Agent Foundations". And also the "Learning Theoretic Agenda". What are those things?
Vanessa Kosoy: Agent Foundations is this abstract idea that says we need to create a foundational mathematical theory which explains what agents are. What does it mean, mathematically, for an algorithm to be an agent? What types of agents are possible? What capabilities can they have or not have? And so on. The Learning Theoretic Agenda is more specific than that, in the sense that it's like a very specific program that is trying to achieve this goal. Specifically, by tools that build on statistical and computational learning theory, algorithmic information theory, control theory, this kind of thing. This is the program I created to answer this challenge of coming up with those agent foundations.
Will Petillo: OK, so Agent Foundations is like the question of "how do minds work?", that encompasses AI, and the Learning Theoretic Agendas is like, "how do we design algorithms that push this in a good direction?" Is that right?
Vanessa Kosoy: I wouldn't put it that way. I would just say that Agent Foundations is just trying to understand how minds work and people have been trying to do it in various ways. MIRI have historically had all sorts of proof-theoretic models that try to approach this, and then there's Garrabrant's Logical Induction, and there are various ideas under this very broad umbrella, whereas the Learning Theoretic Agenda is a very specific approach.
It's this approach that starts with AIXI and classical reinforcement learning theory as the starting points and then looks what are the missing ingredients from that in order to have a foundational theory of agents and start building towards those missing ingredients with ideas such as Infra-Bayesianism and Infra-Bayesian Physicalism and Metacognitive Agents, and so on.
Will Petillo: The kind of agents and minds that you're talking about here, is this tied to the frontier Large Language Models, or is it more broad to AI or any kind of thinking entity?
Vanessa Kosoy: When I say agent, I mean something very broad. Much broader than existing AIs or even just AIs. Certainly including humans, potential aliens, or whatever. So, for me, an agent is a system that has particular goals and is learning sophisticated models of the world in which it is embedded and uses those models to build long term plans in order to achieve its goals. So this is the informal description of what I mean by an "agent". The whole goal of this program is to go from this to a completely formal mathematical definition and study all the implications of this definition.
Will Petillo: So not even beyond LLMs, it's even broader than Machine Learning. What's the reason for that approach? Given how dominant Machine Learning is, why not focus on the things that seem to be the most widely used?
Vanessa Kosoy: First of all, let's keep some order in the terminology. I would distinguish between AI, Machine Learning, and Deep Learning. AI is this thing that people started thinking about since the 1950s, about how to build thinking systems, without really having good understanding of what does it even mean, but just some kind of intuitive notion that there is such a thing as thinking and we should be able to replicate it in a machine.
Machine learning is a more specific approach to this that emerged...well, I didn't want to point a finger exactly when, but probably something like the eighties. Machine Learning is specifically this idea that the central element of thinking is learning and learning means that you're interacting with some unknown environment and you need to create a model of this environment. So you need to take the data that you see and use it to create a model. And that's analogous to how scientists do experiments, gather data, and then build theories based on this data.
This general idea is called Machine Learning—or, more accurately would be to call it just learning. The "machine" part comes from trying to come up with ways to actually implement this inside a machine. This is a field that has a lot of mathematical theory. The mathematical theory behind machine learning is what's known as statistical and computational learning theory, and that's actually the foundation of the Learning Theoretic Agenda. That's why it's called "learning theoretic".
There has been a hypothesis that this kind of notion of learning captures most of the important bits of what we mean by thinking. And I think that this hypothesis has been extremely well-supported by recent developments in technology. This is something that I completely endorse, and it is basically the basis of my whole research program. So there's no contradiction here, because learning is still a very general thing. Humans also do learning. Aliens would also have to do learning.
Deep Learning is a more specific set of algorithms for how to actually accomplish learning efficiently in a machine, and that's what started the deep learning revolution around 2010, although the algorithms existed in some form for decades before that. But it took a while to get the details right and also to have the right hardware to run them. Deep Learning's unfortunate feature is that we don't understand it mathematically. A lot of people are trying to understand it, but we don't have a good theory of why it actually works. That's why it's not the focus of my research program, because I'm trying to come up with some mathematical understanding. I definitely have a hope that eventually people will crack this kind of mystery of how Deep Learning works, and then it will be possible to integrate it into the theory I'm building.
But even if we had this theory, then it still seems really important to think in the broadest possible generality, because, well, first of all, we don't know that the algorithms that exist today will be the algorithms that bring about AGI. And also because the broadest generality is just the correct level of abstraction to think about the problem, to get at those concepts of what does it even mean for a system to be "aligned". There's some philosophical problems that need to be solved here, and they are specific to some very particular algorithms. Also, there is the fact that I actually want this theory to include humans because I might want to use this theory to formalize things like value learning. How do I design an AI system that learns values from humans?
Will Petillo: Seeing the Wikipedia-level, or just browsing the internet description, of Machine Learning and Deep Learning, it's very easy to use them interchangeably. I think I've seen the description as Deep Learning is just the idea that you add multiple layers of neurons. And so because there's multiple layers, it's "deep"
Vanessa Kosoy: Let me try to clarify the distinction. Machine Learning talks about taking data and building models from it. The type of model you're building can be very different. Before Deep Learning, we had algorithms such as Support Vector Machines, Polynomial Regression is also a very simple type of Machine Learning—fitting a model to data. Various methods used in statistics can be regarded as a sort of Machine Learning. There is some space of models or hypotheses and you're trying to use the data in the optimal way to infer what the correct hypothesis is, or to have some probability distribution of your hypothesis if you're doing a Bayesian approach.
But different types of hypothesis classes lead to very different results in terms of the power of the algorithms, but also in terms of what we know to say about how to learn those hypothesis classes. And what do we know to prove mathematically under which conditions we can actually learn them? For example, for Support Vector Machines, the mathematical theory is basically solved. There's kernel methods that build on top of that, and that also has very solid mathematical theory. Deep learning is a particular type of learning algorithm which uses those artificial neural network architectures.
It's not just multiple layers, there's a lot of details that matter. For example, the fact that the activation functions are ReLU, that turns out to be pretty important for what kind of regularization method you use in training. For example, dropouts are basically what started the Deep Learning revolution. If you're working with sequences, then we have transformers, which is a very specific network architecture. So there is actually a lot of very specific details that people came up with over the years, mostly with the process of trial and error, to just see what works. We don't have a good theory for why those specific things work well. We don't even understand the space of models those things are actually learning, because you can prove theoretically that if you take a neural network and you just let it learn another neural network, then in some situations it will be infeasible.
But for real world problems, neural networks succeed to learn a lot of the time. This is ostensibly because the real world has some particular properties that make it learnable, or there's some particular underlying hypothesis class that the neural networks are learning, and which captures a lot of real world phenomena, but we don't even have a mathematical description of what this underlying hypothesis class is. We have some results for some very simple cases, like two layer or three layer neural networks, or some other simplifying assumptions, but we're not close yet to having the full answer.
Will Petillo: Deep Learning assumes certain things about how the world is, in terms of what it can pick up, and it happens to work fairly well, but it's not really clear what it's assuming.
Vanessa Kosoy: Yeah, that's exactly right. So we have different no-go theorems, which say that for arbitrary data, even if the data is fully realizable, and even if the data is such that the neural network can perfectly express an exactly correct model, the problem is infeasible. In general, gradient descent will not converge to the right model, and also no other algorithm will converge because the problem will just be intractable. There are some properties that the world has, and since this Deep Learning is successful in such a big variety of very different cases, it feels like those properties should have some simple mathematical description.
It's not like some properties that are extremely specific to texts or to audio or to images. It's some properties that are extremely general and hold across a wide range of different modalities and problems. Those are properties that I can speculate, for example, as having to do with compositionality, how the real world is often well described as being made of parts, and how things can be decoupled according to different spatial scales, or different temporal scales on which the dynamics is happening. But we don't have a theory that actually explains it.
Will Petillo: You mentioned ReLU as one of the examples of a thing that just works. As I understand it, ReLU is basically like taking the output, changing it in a way that could be illustrated as a graph where it's flat on one side and a diagonal line past zero. Whereas before, models typically used Sigmoid as the activation function, which was more like a smoothly curved line that prevents numbers from getting too big. For some reason, ReLU works better. The sense I'm getting from your explanation is that this change impacts what kinds of things the neural network is able to understand in a direction that's more matching with reality. But all these changes are developed with a "throw stuff at the wall and see what sticks" kind of way, simply measuring the results without really understanding why ReLU is better than sigmoid.
Vanessa Kosoy: That's more or less right. We just have to be careful about what we mean when we say what the neural network can "understand". It's a pretty complicated concept because it's not just what the neural network can express with some set of weights, it's what the neural network can actually learn through a process of gradient descent. It has to do not just with the space of functions that the neural network can describe, but with the entire loss landscape that is created in this space of weights when we look at a particular data set.
Will Petillo: When you're describing a gradient descent and loss landscape the analogy I hear thrown around a lot is a ball rolling down a hill—there's a constant gravity force, and you want the ball to get down to sea level. But often it doesn't because it finds some local minima, like a hole or something, where any direction it can go is up, so it's not going to roll anymore. So you have to shape the landscape such that down always gets the ball to sea level.
Vanessa Kosoy: Yeah, that's a pretty good explanation. Gradient descent is something that we have good mathematical theory for how it converges to the global minimum for convex functions, but the loss for neural networks is non-convex...but it still happens to be such that it works. People have gained some insights about why it works, but we still don't have the full answer.
Will Petillo: OK, so if the landscape is really bumpy, then you don't expect the ball to get to sea level, so the fact that it somehow does anyways demands an explanation that we don't really have. I can see how that kind of framing raises a lot of questions regarding unpredictability.
Moving on, you mentioned AIXI at one point. What is that?
Vanessa Kosoy: AIXI is this idea by Marcus Hutter, which is supposed to be a mathematical model of the perfect agent. The way it works is: there is a prior, which is the Solomonoff prior. For those who don't know what that is, it's basically some way to mathematically formalize the concept of Occam's Razor. And Occam's Razor is this idea that simple hypotheses should be considered a priori more likely than more complicated hypotheses. And this is really at the basis of all rational reasoning. Hutter took the Solomonoff prior, which is a very clever way to mathematically formalize this notion of Occam's razor, and say, well, let's consider an agent that's living in a universe sample from the Solomonoff prior. And this agent has some particular reward function that it's maximizing. And let's assume it's just acting in a Bayes-optimal way. So it's just following the policy that will lead it to maximize its expected utility according to this prior. And let's call this AIXI. Which is a really cool idea...only it has a bunch of problems with it, starting with the "minor" problem that it's uncomputable. There's not an algorithm that exists to implement it even in theory.
Will Petillo: I think I heard it explained once as imagining the entire universe described as a bunch of bits—ones and zeros. At the start, all of them could either be a one or a zero, then you get a little bit of data and now you've locked in a few of those numbers and have cut the space of all things that could be in half. As you keep learning, you get more and more certain.
Vanessa Kosoy: It's actually a little more nuanced than that. Just the fact that you have a lot of possibilities doesn't mean that it's uncomputable. Maybe the exact thing is uncomputable, but you could still imagine that there is some clever algorithm that approximates this Bayesian inference process. For example, if you look at classical reinforcement learning theory, then there are things like algorithms for learning in arbitrary Markov decision processes with n states. In a Markov decision process with n states, there are still an exponentially large space of possible ways it could be, and we still have actually efficient algorithms that converge to the right thing out of this exponentially large thing by exploiting some properties of the problem.
The thing with AXI is that its prior is such that even individual hypotheses in the prior are already arbitrarily computationally expensive, because in its prior it considers every possible program, so every possible program that you can write on a universal Turing machine is a possible hypothesis for how the world works. And some of those programs are extremely expensive computationally. Some of those programs don't even halt, they just enter infinite loops. And you can't even know which, because this is the halting problem, right? This is why AIXI is a non-starter for doing something computable, not to mention computationally tractable.
Will Petillo: The minor problem of uncomputability aside, a "perfect algorithm"...what does that mean? Would AIXI be safe if it were somehow, magically computed?
Vanessa Kosoy: No, it doesn't make it safe in any way. It's "perfect" in the sense of it's the most powerful algorithm you could imagine. Again, under some assumptions. I mean, there are other problems with this, such as that it assumes that the outside world is simpler than the agent itself. There are multiple problems with this, but if you can put all those problems aside then you could argue that this is the best possible agent. And in this sense, it's perfect. It's very, very, very not safe. In order for it to be safe, we would need to somehow plug the right utility function into it. And that would still be a very non-trivial problem.
Will Petillo: What kind of algorithms do you look for, assuming you're trying to find things that are computable?
Vanessa Kosoy: Computability is just one of the issues. I'm imagining that there will be something that I call the frugal universal prior, which is some kind of prior that we can mathematically define, which will simultaneously be rich enough to capture a very big variety of phenomena. And on the other hand, we'll have some clever algorithm that can actually allow efficient learning for this prior using, for example, some compositionality properties of the hypothesis on this prior or something of that sort.
But even knowing this prior, there's a lot of other conceptual problems that you also need to deal with. Like what I call the problem of privilege, where the formalization of Occam's Razor and AXI privileges the observer, and you need to understand how to deal with that. And there's the problem of realizability where you cannot actually have a hypothesis which gives you a precise description of the universe, but only some kind of approximate or partial description, and you need to understand how to deal with that. Then there's also the fact that you want your utility function to be not just a function of your observations, but also some parameters that you cannot directly observe. You also want to be able to prove some Frequentist guarantees for this algorithm. To know how much data this algorithm actually needs to know to learn particular facts and have a good theory of that. There's a whole range of different questions that come up when studying AIXI like models.
Will Petillo: Studying AIXI like models, that is what you're working on?
Vanessa Kosoy: You could, yeah, if you wanted to put it in one sentence, I guess.
Will Petillo: What are some interesting problems that you're interested in solving? I've seen Newcomb's problem floating around and stuff adjacent to this.
Vanessa Kosoy: Newcomb's problem is something Eliezer Yudkowsky wrote about a lot as an example of something that's very confusing for classical accounts of rationality. You have two boxes that you need to choose from. One box has a thousand dollars. The other box has either nothing or a million dollars. You can either choose the first box or you could take the money that's in both boxes. Normally, taking the money that's in both boxes is always strictly superior to just taking the one box.
Except that in this spot experiment there is some entity called Omega who can predict what you do, and so it only puts the $1,000,000 in the other box if it knows that you will only take that box and won't try to take the thousand dollar box as well. So only if you're the type of agent that would predictably (for Omega) take only one box, only in this case you will get out of this room with $1,000,000. Whereas in the other case you will only have $1,000. So arguably it's better to take one box instead of two boxes, as opposed to what many classical accounts of rationality would say. This is one example of an interesting thought experiment.
For me, this thought experience is a special case of the problem of non-realizability, where you need to deal with environments that are so complex that you're not able to come up with a full description of the environment that you can actually simulate. Because in this example, the environment contains this agent Omega, which simulates you, and this means that you cannot simulate it, because otherwise it would create this kind of circular paradox. I've actually also shown that my theory of dealing with non-realizability, which I call Infra-Bayesianism, actually leads to optimal behavior in these kinds of Newcomb-like problem scenarios.
Will Petillo: And the reason for studying Newcomb-like problems is not because we expect to be faced with Omega offering us boxes at some point, but because it's just an illustrative way of thinking about how to deal with things when you can't know what's going on. And also because it might be easy to say, "yeah, well, I'll just take one box because I'll get more that way," but when you really dive into what's a coherent, non hand wavy reason as to why, then there's some interesting insights that can potentially come up from that. Have there been any discoveries that you found through exploring these kinds of things?
Vanessa Kosoy: I would say that Infra-Bayesianism itself is an interesting discovery, that some theoretical account of agents that can reason about complicated worlds that are much too complicated for the agent to simulate. Now I describe the motivation by stating the problem of non-realizability, but the way I actually arrived at thinking about this is through thinking about so-called logical uncertainty. The reason people started thinking about it was because of so-called Updateless Decision Theory, which came from thinking about Newcomb time paradoxes. So it all comes from that line of reasoning even though, after the fact, you can motivate it by some much more general abstract thinking.
Will Petillo: What's the connection between these decision theory type questions and making a safer AI?
Vanessa Kosoy: The idea is creating a general mathematical theory of agents. The way it's going to help us with making AI safer, there are several reasons, the most obvious is that in having this theory, we hopefully will be able to come up with rigorous models of what it means for a system to be an aligned agent. Having this rigorous definition, we'll be able to come up with some algorithms for which we can prove those algorithms are actually safe agents. Or at least we could have some conjecture that says that this given model of such and such conjectures, we think that those algorithms are safe agents. Like in cryptography, you have some conjectures which have very strong evidential support.
We could have at least some semi formal arguments because now when people are debating whether a particular design is safe or not safe, it all boils down to those hand waving philosophical arguments, which don't have any really solid ground. Whereas this gives us tools for much more precise, crisp thinking about these kinds of questions. It hypothetically also gives us much more power to leverage empirical research because maybe we will be able to take the empirical research we have, plug it into the mathematical theory, and get some answers about how we expect those results to actually extrapolate to various regimes where we haven't done so.
Will Petillo: Would this line of research that you're working on eventually be usable in evaluating things like Large Language Models or Deep Learning based systems, to be able to say with greater certainty the extent to which they're safe or unsafe?
Vanessa Kosoy: I think there are multiple paths to impact. So there is a path to impact where we will eventually come up with a theory of Deep Learning. Or, if not a fully proven theory, then at least some strong conjectures about how Deep Learning works that can interface with the theory of agents I'm building. And then we could use this composite theory to prove things or at least to have strong arguments about properties of systems built in Deep Learning.
There could be a different path to impact where we use this theory to come up with completely new types of algorithms for building AI, which are not Deep Learning, but for which we have a good theoretical understanding.
There's also some third possibility that we won't have a good theory, but we could at least reason by analogy, similarly to how many Deep Learning algorithms are designed by analogy to some algorithms for which we have mathematical theory. Deep Q learning, for example, is analogous to simple Q learning, for which we have mathematical theory. So we could imagine a world in which we have some kind of idealist toy model algorithms for which we have some rigorous arguments why they are aligned and then we have some more heuristic algorithms, which we cannot directly prove things about, but which are arguably analogous to those toy models.
Will Petillo: So I heard three paths to impact. One is potentially building a different form of AI that's verifiable from the ground up and does the same things as Deep Learning based AI, but in a more rigorous sort of way. A second is evaluating, or at least better understanding, Deep Learning or whatever is state of the art. And then a third, in between the two, is having a simpler form of AI that analogizes to the state of the art things, so that you can use the former to understand the latter.
Vanessa Kosoy: Yeah, that sounds about right.
Will Petillo: I'd like to focus a bit on using foundational research to understand other things like Deep Learning, getting at this theory-based approach. Bringing in a totally opposite counterpoint, one could argue: no, you should just look at the things that are being used and collect data about it and then build your theory by finding patterns in the data. When the theories are shown to be wrong—as a result of more data—then update your theories then. Why work on theory in advance?
Vanessa Kosoy: The biggest reason is that you cannot reliably extrapolate from empirical research without having an underlying theory. Because you might, for example, take some measurements and find some trend...but then there is some phase transition later on that you don't see in the trend, but which happens and behavior changes to a completely different regime. And because you don't have a theoretical explanation, you won't notice—or people will just switch to using completely different algorithms, which behave in completely different ways.
You might have those empirical models of existing AIs, but those empirical models are very myopic. They're always looking one step ahead. And then you don't see the cliff that's three steps ahead of you. Updating those theoretical, empirical models on new things that happen—it might just not be quick enough. Eventually you fall off the cliff and then it's too late to say, "oh, actually, that trend line was wrong!"
Luckily we are in a domain where we have tools to do research even without empirical data. Of course, we should use the empirical data that we have, but we're not bottlenecked on empirical data because the thing we're studying is algorithms, and algorithms are mathematical objects, so they can be studied mathematically. This is very different from studying some phenomenon of physics, where if you don't have the data, then there's no way to generate the data without having it. Here, this really should all boil down to math. More precisely, it should boil down to math plus whatever properties the real world phenomena have that we want to assume in our mathematical theories.
And here, yeah, here it is something that we need empirical input for. But on the other hand, we already have a really good understanding of physics. So given the knowledge of physics and other scientific domains that we have, it's very plausible that we have enough information to answer all the questions purely through mathematical inquiry, even if we had no empirical data at all. Which is not to say we shouldn't also use empirical data, to supercharge this research, but we're not limited to that.
Will Petillo: So it's not a choice between theory versus experiment, we should be using both. You're focused on the theory side, and arguably there's not enough work on that because theory is where the bottleneck is, not on getting more data.
Vanessa Kosoy: Yeah, I think we should definitely be doing both. Ideally, there needs to be synergy where experiments produce new phenomena for theorists to explain and theory inspires the experiments. The theorists should be telling the experimentalists which questions and what kind of experiments are the most interesting and we should have this kind of synergy. But I think that in the current landscape—definitely in AI alignment—the theory side is currently left behind. That's where I think we should put the marginal efforts in.
Will Petillo: Do you see that synergy existing now? Like, is OpenAI asking MIRI for feedback on their experiments, or is there any kind of connection, or are people just siloed off from each other?
Vanessa Kosoy: I think it doesn't exist now, almost at all. OK, no, to be fair, it exists in some areas and much less in other areas. For example, there's people working on Singular Learning Theory. I think that they are much more interfaced with experimental work, which is good. The kind of research that MIRI is doing is and the kind of research I'm doing is much less interfaced with experimental work. I have some plans for creating an experimental group working in a close loop with me on those questions as part of my big, long term plans, but I still haven't gotten around to doing that yet.
Will Petillo: If you could change anyone's mind, or set political and business agendas, what would you like to see happen to have more of an interface?
Vanessa Kosoy: First of all, we just need more theorists. To have an interface, we need something to interface with, so we just need more theorists. I think that this is, in practice, where the bottleneck is now. Once this progress in theory gets sufficiently paced there will be a bunch of questions. I mean, there are already questions that I would like to see experiments on, but the more this thing picks up, the more such questions we will have. I think now the main bottleneck is just in having more people working on this theory.
Will Petillo: What would change from an external perspective if there was a lot more theory work? I imagine a skeptic could argue: "OpenAI and these other companies are making these really awesome breakthroughs in a largely experimental kind of way, and it's working great! If it ain't broke, don't fix it!" What's broken, in your view?
Vanessa Kosoy: I think that the current path is leading us to a disaster. I think that companies like OpenAI and other leading labs are wildly overconfident about their ability to solve problems as they come along. I think that they haven't come up with any convincing solutions to the hard parts of the problem, and they don't even have the tools to do this because of a lack of theoretical understanding. We don't even have models that are precise enough to have a solution in which we could really be confident. We need to be very precise about the arguments which convince us that the solution is good. And we don't even have the tools to reach this type of precision.
What the companies are doing is basically just developing things in trial and error. If we see any problems, then we'll just tweak the thing until the problem goes away. That's a bandaid method, which is to say it works until it doesn't work. It fixes problems on a superficial level, but eventually there will come a point where either the problem will not be caught in time and the results will be catastrophic, or the problems will be caught in time, but then nobody will have any idea what to do in order to fix it. And eventually someone is going to do the catastrophic thing anyway.
The only thing which makes me less pessimistic than other people in MIRI is that I think we still have more time. I don't think they're quite as close to AGI, and I think that a lot of things can change during this time. Which is again, not to say they will change—we might burn all this time and still end up with a catastrophe.
Will Petillo: What's an example of an existing problem that only has superficial solutions?
Vanessa Kosoy: I mean, the real problem we're concerned about is not really an existing problem, right? The main thing we're concerned about is that future AI systems—which will be much more powerful than existing AI systems—will bring about extinction of the human race or a catastrophe on a similar level.
That's not an existing problem for the simple reason that the AI systems we have today are not capable of learning a model of the world that's so sophisticated that it enables you to do these types of actions. But even now, the companies struggle with all the things that happen with Large Language Models, such as the infamous jailbreaks where they're trying to make them well-behaved in various ways. Not telling the users offensive, dangerous information, for example, and the users easily find jailbreaks to work around that, or just tell false answers.
But again, for me, that's not the real problem, it's just an analogy. I mean, they're kind of struggling with these very simple, much easier problems now, which is not to say they won't solve them. Trial and error will get you there eventually. The reason trial and error is not the solution for existential risk is because once everyone is dead, the trial is over. There's no more trial. So the problems we have now, they're still struggling with them because they don't have principle tools to solve them, but eventually they will trial-and-error their way through and will patch them somehow, or at least solve them well enough for it to be economical. But once you reach the point where failures are global catastrophes, trial and error is no longer an acceptable method of fixing the problem.
Will Petillo: Obviously we don't get to see lots of test data of the world ending. But I would imagine there'd be some precursor issues that are smaller, but hint at what's to come. Do you see the challenges with hallucination or not being able to control what the AI says as those kinds of precursors? Or are they totally irrelevant and there just won't be any precursor issues?
Vanessa Kosoy: It's a difficult question, because there are very important bits that are still missing from existing AI systems to produce existential risks. We can point at examples where systems are kind of mis-generalizing, there's a lot of famous examples: some program that "wins" Tetris by pausing the game forever, or that wins some boat race game by racing the boats in infinite circles, doing various weird unintended behaviors, because the metric that the algorithm is maximizing is not actually what the users intended. You could call those precursors, but I feel like it's not exactly capturing the magnitude of the problem because those are still toy settings. There are no open-world systems that are acting in the open, physical world. The goals that they're trying to solve are much simpler than human values; there's not really operating domains where there are really complex ethical considerations.
Maybe Large Language Models are starting to approach this because they enter domains where there are some, at least morally, not completely trivial issues that come up. On the other hand, Large Language Models are not really doing things that are strongly superhuman. Well, they may be superhuman in the sense that they have a very large breadth of knowledge compared to humans, but not in other senses. So it's hard. There are things that are sort of analogous, but it's not strongly analogous.
But then again, our motivation to be concerned about this risk doesn't come from looking at LLMs. Eliezer Yudkowsky started talking about those things before Deep Learning was a thing at all. That's not where the motivation comes from.
Will Petillo: I guess the reason I was asking about it is that in the places where this stuff gets debated and polarized, one of the common objections is: "There's no evidence behind this! This is all just storytelling!" Is there evidence of the danger or does it really just come from looking at the math?
Vanessa Kosoy: The problem is, what do you call evidence? That's a very complicated question. The things that would be obvious evidence would be things like: AI's completely going out of control, breaking out of the box, hacking the computer, copying themselves to other computers, outright manipulating human operators, and so on. But this kind of thing is a sort of canary that you only expect to see very, very, very close to the point where it's already too late. It's not possible to say that we will only rely on this type of evidence to resolve the debate.
For other types of evidence, some people say that evolution is sort of evidence how a Machine Learning algorithm can produce something which is completely unaligned with the original algorithm. Other people show you Reinforcement Learning algorithms doing not what the designer intended. But for every argument like that, you could have a counter argument which says, "yeah, but this example is not really similar. We cannot really project from there to existential risk because there are some disanalogies."
And yeah, there will always be some disanalogies because until you have AIs in the real world that are very close to being an existential risk, you won't have anything that's precisely analogous to something presenting an existential risk. So we have no choice but to reason from first principles or from math or by some more complicated, more multi-dimensional analysis. We just have no choice. The universe doesn't owe it to us to have a very easy, empirical way of testing whether those concerns are real or not. One of the things I'm hoping for is that the theory will bring about stronger arguments for AI being dangerous—or the theory will tell us no, everything is fine, and we can all relax. The lack of theory is part of the reason why we don't have foolproof, completely solid arguments in one direction or the other direction.
Will Petillo: The challenge with finding evidence is that anything you can point to that exists now could be interpreted in multiple ways. Having solid theory would lend some credence to one interpretation over another.
Vanessa Kosoy: Yeah, absolutely. If you have a theory that says that a particular type of misgeneralization is universal across most possible machine learning systems, and we also see this type of misgeneralization happening in real Machine Learning systems, then it would be much harder to dismiss it and say, "oh yeah, here we have this problem, but we'll do this and that, and that will solve it easily."
Will Petillo: There's one thing that's still bugging me about the issue of evidence not being available now. The analogy my mind immediately goes to is climate change. You could say that "Oh, the idea of large swaths of the world being uninhabitable is just this elaborate story because all that has never happened before!" But then you can look at a bunch of things that exist already: small scale disasters, the graphs of CO2 versus temperature, and so on, point to those and say, "Hey, the really bad stuff hasn't happened yet, but there is a lot of evidence that it will!" What makes AI different?
Vanessa Kosoy: I think that climate change is a great analogy. The big difference is that in climate change, we have a really good theory. Like in climate change, we have physics, right? And we have planetary science, which is on a very, very solid foundation. And we have computer simulations. It's still not trivial, there are some chaotic phenomena which are hard to simulate or predict, so not everything is completely trivial, but still we have some very, very strong theoretical foundation for understanding how those things work and what are the mechanisms. And this theory is telling us that there's still big uncertainty intervals around how exactly many degrees of warming we're going to get with such and such amount of CO2, but we still have a fairly solid prediction there.
Whereas with AI, we don't have this. The analogous situation, if you want to imagine climate change, AI style, then it would be something like not having a theory which explains why CO2 leads to warming. Having some empirical correlation between temperature and CO2, and then people could argue ad infinitum. Correlation is not causation, maybe the warming is caused by something completely different, maybe if we do some unrelated thing it will stop the warming, which is not actually true. We would be in the dark. With AI, we're currently in the dark.
Will Petillo: What is happening currently with your work at MIRI?
Vanessa Kosoy: Currently there are multiple problems I'm looking at. Hopefully I will publish very soon a paper on imprecise linear bandits, which is related to Infra-Bayesianism that I mentioned before, which is a theory of agents that reason about complicated worlds. That's analyzing this theory in some very simple, special case in which I succeeded to get some precise bounds for how much data an algorithm would need to learn particular things. After that, I'm starting to look into the theory of learning state representations in Reinforcement Learning, which is currently another big piece missing from the theory, which is about how your algorithms should learn about which features of the world are actually important to focus on.
In parallel, I have a collaborator, Gergely Szucs, who is working on using my theory of Infra Bayesian Physicalism to create a new interpretation of quantum mechanics. He has some really interesting results there. It's kind of a test case which demonstrates how this framework of thinking about agents allows you to solve all sorts of philosophical confusions. In this case, it's confusions that have to do with the interpretation of quantum mechanics. Scott Garrabrant has a project about a new type of imprecise probabilities, some new way of representing beliefs that have some nice compositionality properties. Kaspar Osterheld from Carnegie Mellon and Abram Demski had a paper recently about some new type of frequentist guarantees for algorithms that are making decisions based on something that's similar to a prediction market. So yeah, a lot of interesting things are happening.
Will Petillo: Are there any other questions that I did not ask that would be helpful for someone seeing this to get a sense of what you're about here?
Vanessa Kosoy: Not a question exactly, but I also have a more concrete approach for how to actually solve alignment, how to actually design an aligned agent, which I call Physicalist Superimitation. It's a variation on the theme of value learning, but it draws from the framework of Infra Bayesian Physicalism, which comes from the Learning Theoretic Agenda and from some ideas in algorithmic information theory to come up with a semi-formal approach to how you could have an AI that learns human values in a robust way.
It deals with a lot of problems that other approaches to value learning have, like: how do you determine where the boundaries of an agent are? What is a human? How do you locate this human in space? How do you take into account things which are not just behavior, but also internal thought processes of the human in inferring the human's values? How do you prevent perverse incentives such as the AI somehow changing or manipulating humans to change their values? How do you avoid the inner alignment problem? It answers a range of concerns that other approaches have.
Will Petillo: This sounds reminiscent of Inverse Reinforcement Learning?
Vanessa Kosoy: Inverse Reinforcement Learning is the idea that we should look at behaviors of humans, infer what those humans are trying to do, and then we can do this thing. "We" as an AI. So I actually have presentations in which I explain Physical Superimitation as Inverse Reinforcement Learning on steroids. It's taking this basic idea, but implementing it in ways that solve a lot of the deep problems that more simplistic approaches have. One problem that simplistic approaches have is that they model humans as perfect agents that follow the perfect policy, given perfect knowledge of the environment, which is wildly unrealistic.
Instead, I model humans as learning agents. They learn things as they go along. And they also might even do that imperfectly. Another thing is the issue of boundaries. What is a human exactly? Where do you put the boundaries around a human? Is there just some particular input and output, which the human uses and you consider everything that goes through this port to be the human? But then how do you deal with various discrepancies between what goes into this port and what the human actually intended to do, or various possibilities like the AI hijacking this channel?
In my approach, the way a human is formalized is that a human is a particular computation that the universe is running. This is something I can actually formalize using Infra Bayesian Physicalism. It has particular properties, which make it agentic, so the agent detects which computations the universe is running, among them detects which computations are agents, and amongst those agents, it selects which agent is its user by looking into causal relationships, and this way it homes onto the boundary of the agent. The first thing is because we're talking about the computation that this human is running, which is human reasoning and regarded as a computation. We're also automatically looking at internals as internal thought processes and not just things that are expressed as external behaviors. So we have potentially much more information there.
Will Petillo: What would be the best way for someone to get involved? And what would they want to learn in advance?
Vanessa Kosoy: One thing they could immediately start doing is reading up on stuff people did in Agent Foundations and in the Learning-Theoretic Agenda until now. I have this recent post, Learning Theoretic Agenda: Status 2023, which summarizes a lot of the things. I also have a reading list post where I recommend some background reading for people who want to get into the field. More concretely in terms of career steps, it's already too late to apply, but I'm running a track in MATS, which is a training program for researchers who want to get into AI safety. I have a track focused on the Learning Theoretic Agenda. Hopefully there will be another such track next year. I also have a fantasy of having an internship program, which would bring people to Israel to work with me on this. Currently, because of the war, this thing has been postponed, but hopefully, eventually things will settle down and I will revive this project. Those are currently the main ways to get involved.
Will Petillo: Thank you for that description. I wish you the best in developing this theory and gaining more interest so that mismatch between evidence and theory starts to get corrected and the researchers know what they're doing rather than stumbling in the dark!
Vanessa Kosoy: Thank you for having me.