..

Victoria Krakovna on AGI Ruin, The Sharp Left Turn And Paradigms Of AI Alignment

Victoria Krakovna is a Research Scientist at DeepMind working on AGI safety and a co-founder of the Future of Life Institute, a non-profit organization working to mitigate technological risks to humanity and increase the chances of a positive future

In this interview we discuss three of her recent LW posts, namely DeepMind Alignment Team Opinions On AGI Ruin Arguments, Refining The Sharp Left Turn Threat Model and Paradigms of AI Alignment.

This conversation presents Victoria’s personal views and does not represent the views of DeepMind as a whole.

(Our conversation is ~2h long, you can click on any sub-topic of your liking in the outline below and then come back to the outline by clicking on the green arrow)

Contents

Highlighted Quotes

(See the Lesswrong and EA Forum posts for discussion)

On The Intelligence Threshold For Planning To Take Over The World

The quote below answers the question: “Do you mostly agree that the AI will have the kind of plans to disempower humanity in its training data, or does that require generalization?”

“I don’t think that the internet has a lot of particularly effective plans to disempower humanity. I think it’s not that easy to come up with a plan that actually works. I think coming up with a plan that gets past the defenses of human society requires thinking differently from humans. I would expect there would need to be generalization from the kind of things people come up with when they’re thinking about how an AI might take over the world and something that would actually work. Maybe one analogy here is how, for example, AlphaGo had to generalize in order to come up with Move 37, which no humans have thought of before.”

“The same capabilities that give us probably creative and interesting solutions to problems that, like Move 37, could also produce really undesirable creative solutions to problems that we wouldn’t want the AI to solve. I think that’s one argument that I think is also on the AGI Ruin list that I would largely agree with, that it’s hard to turn off the ability to come up with undesirable creative solutions without also turning off the ability to generally solve problems that we one day want AI to solve. For example, if we want the AI to be able to, for example, cure cancer or solve various coordination problems among humans and so on, then a lot of the capabilities that would come with that could also lead to bad outcomes if the system is not aligned.” (full context)

On The Motivation For Refining The Sharp Left Turn Threat Model

The quote below explains the motivation for Refining The Sharp Left Turn Threat Model, a Lesswrong post distilling the claims in the sharp left turn threat model, as described in Nate Soares’ post.

“Part of the reason that I wrote a kind of distillation of the threat model or a summary how we understand it is that I think the original threat model seems a bit vague or it wasn’t very clear exactly what claims it’s making. It sounds kind of concerning, but we weren’t sure how to interpret it. And then when we were talking about it with within the team, then people seem to be interpreting it differently. It just seemed useful to kind of arrive at a more precise consensus view of what this threat model actually is and what implications does it have. Because if we decide that the sharp left turn is sufficiently likely, that we would want our research to be more directed towards overcoming and dealing with the sharp left turn scenario. That implies maybe different things to focus on. It’s one thing that I was wondering about. To what extent do we agree that this is one of the most important problems to solve and what the implications actually are in particular.”

“The first claim is that you get this rapid phase transition in capabilities, rather than, for example, very gradually improving capabilities in a way that the system is always similar to the previous version of itself. The second claim is that assuming that such a phase transition happens, our Alignment techniques will stop working. The third claim is that humans would not be able to effectively intervene on this process. For example, like detecting a sharp left turn is about to happen and stopping the training of this particular system or maybe coordinating to develop some kind of safety standards or just noticing warning shots and learning from them and so on. These are all kind of different ingredients for a concerning scenario there. Something that we also spend some time thinking about is what could a mechanism for a sharp left turn actually look like? What would need to happen within a system for that kind of scenario to unfold? Because that was also kind of missing from the original threat model. It was just kind of pointing to this analogy with human evolution. But it wasn’t actually clear how will this actually work for an actual Machine Learning system.” (full context)

On The Pivotal Act Appearing To Be A Very Risky And Bad Idea

“There’s this whole idea that in order to save the world, you have to perform a pivotal act where a pivotal act is some kind of intervention where you prevent anyone in the world from launching an unaligned AGI system. I think MIRI in general believe that you can only do that by deploying your own AGI. Of course if you are trying to deploy a system to prevent anyone else from deploying an AGI, that’s actually a pretty dangerous thing to do. That’s one thing that people, at least in our team, disagreed with the most. The whole idea that you might want to do this or, not to mention that you would need to do this, because it just generally seems like a very risky and bad idea. The framing bakes in the assumption that there’s no other way to avoid unaligned AGI being deployed by other actors. This assumption relies on some of MIRI’s pessimism about being able to coordinate to slow down or develop safety standards.”

“I do feel somewhat more optimistic about cooperation in general. Especially within the West, between western AI companies, it seems possible and definitely worth trying. Global cooperation is more difficult, but that may or may not be necessary. But also, both myself and others on the team would object to the whole framing of a pivotal act as opposed to just doing things that you would need that increase the chances that an unaligned AGI system is not deployed. That includes cooperation. That includes continuing to work on Alignment research, continuous progress as opposed to focusing on this very specific scenario where some small group of actors would take some kind of unilateral action to try to stop unaligned AGI from being deployed.” (full context)

DeepMind Alignment Team Opinions On AGI Ruin Arguments

What Do We Mean By ‘AGI Ruin’

Michaël: To start this, I think it would make sense to jump to this very important blog post you published called “DeepMind Alignment Team Opinions on AGI Ruin Arguments”. “AGI Ruin Arguments” in the title refers to a post written by Eliezer Yudkowsky at the beginning of the year and it was very insightful to have the opinion of people from DeepMind on this blog post. Can you maybe explain what the post was about? Not your post, but the one from Eliezer… and also maybe who Eliezer is.

Victoria: Eliezer is a founder of the Machine Intelligence Research Institute. He’s been thinking about AI Alignment for a long time, and he wrote the “AGI Ruin” post summarizing various arguments for why he thinks that AGI is likely to cause catastrophic outcomes for humanity.

Michaël: What’s AGI for people who are not familiar with the term?

Victoria: AGI is stands for Artificial General Intelligence. It’s a hypothetical, general AI system that would be able to do anything a human can do, or perform the vast majority of tasks that humans can perform. Which is distinct from the current AI systems that we have now, which are called narrow AI systems that perform specific tasks or that can handle specific domains, but cannot do everything a human can do.

Michaël: In this post, Eliezer goes on to give some examples or a list of ways in which such general intelligence could cause some existential threats.

Victoria: It said catastrophic outcomes.

Michaël: What’s a catastrophic outcome?

Victoria: The idea is that if we have a powerful general AI system that can do a lot of things humans could do or that is superhuman in most domains, then the system would be able to gain control of the world. Humans would no longer be in control of the world. If this system has goals and preferences that are not aligned with what we want, then this could lead to really bad outcomes for us.

Michaël: Your post shed light on this by having people from the DeepMind Alignment team, maybe not all of them, but some sample of the Alignment researchers there, give their opinions there.

The Motivation For Collecting Opinions On AGI Ruin

Michaël: What motivated you to write that post?

Victoria: When the AGI Ruin post came out, I personally found it to be a useful collection of arguments for why AGI can lead to bad outcomes or why Alignment is difficult. Many of those arguments have appeared in the past, but it was nice to have it all articulated in one place. Something I was wondering about was how should we think about this in our team, and how much do people agree or disagree with these particular arguments, and also how should this affect our overall strategy in terms of our research priorities.

Victoria: We were just having some discussions about the post and the arguments in it, and what their implications are for our work, and then it occurred to me that it’s easier to keep track of this if I collect people’s views in a survey and then have a nice spreadsheet where you can see how much agreement or disagreement there is within the team, and which kind of claims are the most controversial that maybe we would want to discuss more or figure out why we disagree, and so on.

Victoria: This started out as an exercise in understanding the distribution of views within the team and just discussing the arguments and clarifying our thinking about AI risk. Once I had already collected this information about people’s opinions, then it seemed like this might also be of interest to the wider community and not just to us. That’s how it became a post.

High Levels Of Disagreements On Some Arguments And On The Implications For AI Risk

Michaël: It was very useful for a lot of people because instead of having only one person writing a post with a bunch of arguments, we saw that there was maybe eight researchers that often agreed with the claims. Not all the time, but there was some people in the industry that also agreed with some of these claims. It gave more seriousness to the claims that were being made than just one person writing a post. Did you learn anything from writing it? Were you surprised by some of the answers?

Victoria: There were some surprising things in there. Just the level of disagreement on some of the arguments was a bit surprising. There were some things that I expected people to agree more on. Overall, maybe it wasn’t very surprising, but also just the specific reasons that people gave for agreeing or disagreeing were interesting.

Michaël: You wrote in the post that it’s not really about agreeing with the arguments. You can agree with the arguments being true because he writes in a very clear way and he tries to make things seem logically consistent. But maybe even if the argument is right, it doesn’t imply that there is a high probability of a scenario playing out. So you can agree with the arguments without agreeing that the arguments imply some huge risk at the end. There are various levels of agreeing with the claims.

Victoria: This happened relatively often because the arguments were made based on a particular framing and people would disagree with the framing or disagree with some of the assumptions. They might agree with the argument as stated, but then if they disagree with the assumptions, then they wouldn’t agree with the implications for AI risk.

On The Possibility Of Iterating On Dangerous Domains

Victoria: There was one argument, for example, where the idea is that you can’t iterate on dangerous domains where you deploy AGI. So if you are deploying a very general system in the real world for the first time, then you might not get another try to get it wrong because maybe the system will resist correction, and so on.

Michaël: You need to get it right on the very first try.

Victoria: Right. That’s something that people might topologically agree with, that if you define a dangerous domain as a domain where you cannot iterate, then the statement kind of becomes trivially true. But then there are some things that were kind of implied by the argument. For example, iterating on less dangerous domains will not teach us anything for iterating on dangerous domains. That was maybe not completely explicitly stated in the argument itself, but that’s something that, for example, Paul Christiano was pointing out in his response that you wouldn’t agree with the implication that we can’t iterate at all because we could iterate in somewhat similar settings where we can still shut down the system, and so on.

Michaël: So we might see some failure from developing somewhat general systems, but we will not be able to test Alignment techniques over a thousand runs and we will only be able to test things in the dangerous domains for the first time. Is that right?

Victoria: Do you mean as re-statement of the argument?

Michaël: Yeah. In the post, it says something like you can only deploy it once or test it once in the dangerous domains. And the restatement that works is “you won’t be able to do much learning, testing, or benchmarking on the dangerous domains.”

A Pivotal Act Seems Like A Very Risky And Bad Idea

Victoria: You can put it that way. But the kind of counterargument that we would make is that you can have similar domains that are less dangerous that you can learn from or that you can find a lot of issues by testing there. Ideally you would not want to deploy your system in dangerous domains at all. That’s another thing. There’s this implication in this argument that you will have to deploy your system in dangerous domains because there’s this whole idea that in order to save the world, you have to perform a pivotal act where a pivotal act is some kind of intervention where you prevent anyone in the world from launching an unaligned AGI system. Eliezer and MIRI in general believe that you can only do that by deploying your own AGI. Of course if you are trying to deploy a system to prevent anyone else from deploying an AGI, that’s actually a pretty dangerous thing to do. That’s one thing that people, at least in our team, disagreed with the most. The whole idea that you might want to do this or, not to mention that you would need to do this, because it just generally seems like a very risky and a bad idea. The framing bakes in the assumption that there’s no other way to avoid unaligned AGI being deployed by other actors. This assumption relies on some of MIRI’s pessimism about being able to coordinate to slow down or develop safety standards, and so on.

Michaël: You think there will be more time to coordinate between different actors and we won’t be in the need of some people to act where we need to deploy an AGI in the world to prevent other AGIs from being deployed?

Victoria: I do feel somewhat more optimistic about cooperation in general. Especially within the West, between western AI companies, it seems like possible and definitely worth trying. Global cooperation is more difficult, but that may or may not be necessary. But also, both myself and others on the team would object to the whole framing of a pivotal act as opposed to just doing things that you would need that increase the chances that an unaligned AGI system is not deployed. That includes cooperation. That includes continuing to work on Alignment research, continuous progress as opposed to focusing on this very specific scenario where some small group of actors would take some kind of unilateral action to try to stop unaligned AGI from being deployed. That just generally seems like a very bad idea and it would be very likely to do more harm than good. In practice, you would never be confident enough that your AGI is aligned to feel confident that using that AGI to perform “pivotal acts” would not be very harmful. In practice, you would just never do that.

Michaël: It’s too risky and the situation is not dire enough, or there’s no situation where you can justify doing this.

Victoria: But also, it’s just generally a bad idea to act unilaterally in that way as opposed to try to foster broader cooperation to make sure that deploying AGI goes well because this is something that’s in the interest of everybody.

What Do We Mean By Unaligned AI

Michaël: You mentioned unaligned AGI multiple times, and we’ll go into definitions maybe later, more precisely, but just quickly: what do people mean by unaligned AGI, or just Alignment? Aligning the AGI?

Victoria: What I mean by unaligned AI is an AI system that knowingly acts against human interests. I don’t mean a system that causes harm by accident 1. That’s a different category that, of course, can also be bad. I mean a system that has goals that are in conflict with human goals or human interests. Broadly speaking, we could say that aligned systems are systems that kind of do what we want to want them to do, respond to feedback from us about what we want them to do. This is sometimes called corrigibility.

Michaël: There’s also the definition where you specify that it’s at least trying to do the thing we want it to do?

Victoria: That’s kind of what I’m pointing at, because a system doesn’t have to perfectly understand what we want in order to be aligned, it has to just be trying to do what we want. Similar to how if we’re trying to help another person, we might not know exactly what they want, but we will pay attention to what they’re telling us about what they want and update our model of what their preferences are, and so on.

Victoria: That’s how corrigibility works for humans and that’s what we would want the AI system to do as well. Even if it starts off very ignorant about what humans want, as long as it’s corrigible, it can converge to a good working understanding of what we want in order to be helpful rather than harmful.

AGI Ruin Might Be Too Pessimistic About Interpretability

Michaël: You mentioned the pivotal act as one disagreement you have. Are there any other points you disagree about?

Victoria: A general area of disagreement is about how promising is interpretability work because I get the sense that Laser and maybe MIRI are more broadly pessimistic about interpretability working at all. While we think that there’s a decent chance of making progress on interpretability and having some visibility into how the model is representing its beliefs and goals internally. Of course, we still think it’s quite difficult and we need more time and more good researchers working on it. We now have a mechanistic interpretability group within the Alignment team that started about a year ago. That’s one general research priority for us.

Michaël: Is it doing work similar to what, for instance, Anthropic is doing.

Victoria: It’s related. Similar flavor to what Chris Olah’s group is doing.

Michaël: It would make sense to go on more specific questions because what was very useful in your post was that you were kind of summarizing the argument from Yudkowsky and you were also giving the comments that the specific researchers were giving for why they agree or disagree for specific things.

What A World Where We Decide Not To Build AGI Look Like

Michaël: A lot of people commented on the statement, “The world cannot just decide not to build AGI.” And I guess everyone said it was unclear. Do you have any ideas on what a world where we decide not to build AGI at all would look like? Just international agreement to not build AGI?

Victoria: One thing I could imagine is if something goes wrong with AI kind of on a smaller scale than a catastrophic outcome, but that’s something that still makes a lot of people concerned, and I guess some kind of warning shot. It is possible to shift the view of the ML community to how concerning this is. And of course, Governments would have some kind of response to warning shots as well. The way I imagine this happening is that AI systems gradually become more advanced and they’re kind of used for more different applications and at some point, enough things go wrong where the system acts in some kind of unpredictable way and causes harm, maybe damage some infrastructure or something like that.

Michaël: A server would be corrupted, or?

Victoria: I guess probably more than that. I mean, I don’t know if I want to go to very specific examples here, but I could imagine if you have an AI system in charge of the power grid, then it could cause a blackout or something. Or maybe things that are more serious than that. Eventually we might see these kind of issues once the AI systems are deployed more broadly. Also, there might be scenarios where an AI system breaks out of containment, or accesses some kind of resources, or accesses the internet when it’s not supposed to and then maybe hacks into some servers. There are lots of things you could imagine here, but it’s possible to shift the broader opinion towards the systems being potentially dangerous. Depending on how well AI governance work goes, cooperation to at least slow down research towards AGI is possible. I’m not a governance expert, so there are people who have more informed takes on this than I do.

Stricter Publication Norms Are Necessary

Michaël: One disagreement was how much can we buy time by doing this? And I guess, in my mind, imagine we’re in a world where we are able to have an AI that is able to turn off some power grid or hack into systems. We would then have a bunch of people that might get access to it via open source maybe a few months after the top labs have access to something like this. I’m not sure how much time we can buy.

Victoria: The publication norms would need to be more strict as well. Various AI labs might have… might acquire more stringent publication norms also just for competitive reasons. Also from a safety perspective, advances towards AGI would not make sense to publish immediately and allow any bad actors or careless actors to use them by making them open source. That’s one shift I would want to see in the views of the ML community, is that currently people still see open source as an unequivocal good. [People think that] You should always openly distribute your technology, but people don’t always think about what are the implications of this in the limit of the power of this technology. And that if you actually have increasingly general AI systems that could be used in a lot of ways that lead to bad outcomes, and it’s pretty irresponsible to open source advances like that, even though of course there would be a lot of benefits as well. But you just have to really weigh the pros and cons and be careful about that.

Michaël: This relates to a lot of people when they discuss infohazards. When you have, let’s say a paper about scaling that is very useful for scaling models, but can lead to faster timelines. So a date where we release AI that is much sooner than we would want. Someone on Twitter when I asked “What question you would like to ask Victoria Krakovna?” asked me “what do you think would be a good info hazard policy?” Always have a committee deciding if a paper is dangerous or not, or have some kind of agreement of not releasing papers that speed up timelines too much? Do you have any ideas on how to do this?

Victoria: That’s a good question. Having any kind of review of this kind at AI labs would be useful. Even some kind of light touch review where you flag whether some paper might constitute a sufficient capability advancement that might be worth reconsidering whether to publish it or how to publish it. Just having that sort of process happen at all already seems good. There’s some default way that papers get reviewed internally and ideally it would just become one of the considerations whether they should be careful about publishing the paper.

Michaël: This will highly depend on if people are pro-building AGI, or against, or maybe see more risk or less risk in building AGI. In the comments about the claim about the pivotal act, one person said that it was kind of dependent on whether there’ll be a tiny group of researchers trying to build AGI and the entire world will be against it or not. You only want to have a pivotal act if the entire world is trying to build AGI, so you want people to act for shifting, let’s say the tendency. But if the concerns start piling up and everyone agrees that it’s a risk, you might not want to do this. Do you think we’re going to reach some kind of place where most people will see the risk? Or maybe until crunch time, until when gets very close, most people will not see the risk?

Victoria: Most people just haven’t thought about it very much, and I’m not sure whether it’s worthwhile for regular people to think about this because there’s just a lot a person has to learn about what the Alignment concerns are about, why Alignment is hard, and so on. This doesn’t seem necessarily that useful. I would want most Machine Learning researchers to see the risk or at least having a good intuitive understanding for why Alignment is difficult or why the same way that we fix problems for narrow systems might not work for general systems.

We Already Have Examples Of Misalignment In Less Intelligent Systems

Michaël: You mentioned narrow systems and general systems. One claim is that some problems only occur above a certain intelligence threshold. It’s claim number 13 in Yudkowsky’s post, and I guess some people in the Alignment team were saying that they expect some problems to also arise in less intelligent systems like Specification Gaming. Do you agree with this?

Victoria: Do you mean whether I agree with the general argument?

Michaël: If you think we would see them in less intelligent systems or not.

Victoria: We’re already seeing some examples of goal misgeneralization, although these are somewhat less concerning examples, but then they can still teach us how the phenomenon works.

Goal Misgeneralization In CoinRun

Goal Misgeneralization is the idea that you have a system that’s trained with a certain objective function, but then it only observes that objective function on the training distribution. Then there are different ways to extrapolate that outside the training distribution. It might not learn the intended goal from the training data, but some other goal that coincides with the given objective on the training data.

Michaël: What do you mean by goal here?

Victoria: Oh. I mean, we still don’t have a very good definition of what goals are, but we mean goals in a behavioral sense where a system learns to take actions that kind of bring the environment into some particular target states or target configurations. One example of goal generalization is this example that came up in the game CoinRun where you have this agent… it’s trained to go through the level and avoid enemies and obstacles and then reach the end, and at the end there’s a coin and it collects the coin. That’s the reward. Here, you could see there are different possible goals. You could say reaching the end is the goal or finding the coin is the goal, and then it might learn to pursue either of these based on how it’s trained. When they tested it outside the training distribution by putting the coin somewhere else, it just ignored the coin and kept going. It seemed like it clearly not learned that the goal is getting the coin.

Michaël: Right.

Victoria: It was behaving in a way where it was still going to the end of the level. It seemed like that was the goal of the system. Of course, we can’t look inside the system and point at some components, like oh, this is representing the goal. So we don’t know how it’s representing goals. But we can think about the behavior of the system in new situations, and whether it’s competently acting towards an undesired outcome.

Michaël: You can see if the system is predicting high reward when it sees the end of the level with a wall at the end or if it sees high reward when it sees the goal. You can set the activation of the neurons whenever he looks at the coin or whenever he looks at the end of the level and see whatever. Maybe it’s kind of hard because-

What Goal Misgeneralization Would Imply For Advanced Systems

Victoria: It’s a bit difficult to do. The idea with goal misgeneralization is that you have generalizing capabilities, but the goals of the system generalize in a different way that we would like them to. For example, when the CoinRun agent was deployed in this new environment, it could still avoid obstacles and enemies and move through the level, so it was acting competently. It wasn’t just kind of flailing and doing random things. It was competently acting towards an objective that was not what we had intended, so it was kind of deliberately going to the end of the level instead of collecting the coin as we would want.

Michaël: It’s different from just something that doesn’t generalize in Computer Vision where the thing just doesn’t work at all. It’s like… it’s competent. It’s both good and does the wrong thing, so it could be dangerous for AIs that are put in the real world. If they’re good, but doing the wrong thing.

Victoria: That’s the idea, that you can have particularly bad outcomes for the intended goal if the system is competently doing something else rather than just acting randomly. You were asking about whether some of these problems will occur before the kind of superhuman intelligence threshold.

Michaël: Right.

Victoria: Yes. This might also depend on the problem, but at least for the kind of problems that we can foresee happening for highly intelligent systems, we can try to generate examples for current systems of analogous problems with goal misgeneralization and so on. Then we can learn how these problems play out and then try to learn, for example, how to detect these kind of problems in advance and what kind of mitigations we can come up with for current systems. Some of these things might not carry over, but I don’t see a reason to be very confident that they won’t carry over. It just seems like the thing that we’d want to do.

People In ML Are Biased Towards Optimism

Michaël: This relates to one claim about optimism, where Yudkowsky is often seen as more pessimistic than other people. He claims that we will get a bias towards optimism until we get some strategic failure. My impression so far is that you’re maybe a bit more optimistic than Eliezer.

Victoria: Well, yeah. I’m certainly more optimistic than Eliezer.

Michaël: Laughs

Victoria: Yeah. That’s for sure.

Michaël: Are you an optimist about Alignment or are you optimistic about the future?

Victoria: Relative to most Alignment researchers, I’m probably a bit more optimistic than average. I’m not exactly sure what the distribution is. But in that claim, Eliezer was talking particularly about maybe the Machine Learning community. Not necessarily people working on alignment, but just people who hear about this problem, and have cached thoughts about how they would solve the problem. He was talking about how if society tries to deal with this problem as a whole, then it would have a bias towards optimism. In particular, if people who are building AGI are thinking about how things could go wrong, then they have a bias towards being able to fix things because they’ve always been able to fix problems in the past.

Michaël: Right.

Victoria: I’m not sure if he’s talking specifically about Alignment people. I think alignment people have thought about the risks and how things don’t work a lot more [than Machine Learning people].

What Victoria Guesses The AI Risk Actually Is

Michaël: What people are interested is more precise evaluation or how you approach the future. This is not a very pleasant thing to think about, but what would you consider is the probability of Victoria Krakovna dying from AI before 2100?

Victoria: I mean, 2100 is very far away, especially given how quickly the technology’s developing right now. I mean, off the top of my head, I would say like 20% or something.

Michaël: That’s pretty optimistic. It means 80% of the time we have some very advanced AI and you’re still alive.

Victoria: Either that or advanced AI doesn’t happen for some reason, or maybe we cooperate on slowing it down, or maybe we build AI that’s advanced enough to help us solve the Alignment problem while it’s not advanced enough to cause a lot of issues and bad outcomes. There’s a lot of uncertainty about the future. It doesn’t seem reasonable to me to be very confident in optimism or very confident in pessimism.

Michaël: I feel like 20% is kind of confident in optimism though.

Victoria: I mean, I would say a 20% probability of doom——still, I mean, that’s plenty enough reason to work on the problem. There’s this wide range of probabilities of bad outcomes where you would end up acting in much the same way. If the probability’s high enough that you want to locate a bunch of time doing the research and trying to solve governance challenges and ensure cooperation and so on. If anything, I’ve seen people who are really optimistic start giving up, and we’ve seen some of that happen with MIRI. At some point, they just seem to not work on the problem so much anymore because they feel hopeless and that this just seems kind of counterproductive.

Michaël: Right, so it’s instrumentally useful to be optimistic.

Victoria: I mean, part of it’s instrumentally useful, but also we just don’t know enough to take this extremely pessimistic stance that it’s almost impossible to avoid bad outcomes. There’s just so many ways that the future could go that we have no idea about.

Michaël: We can also add a disclaimer that this is a number that should be a distribution over, because …

Victoria: I’m not really that confident in this number at all. It’s most problematic if some people think a 0% probability or something very low because they’re super confident that technology can’t go wrong or that maybe we will never build AGI or that AGI doesn’t make sense as a concept and therefore there’s nothing to worry about. I’m not sure. A lot of these kind of objections don’t really make sense, but we just can’t be that confident in either direction.

Refining The Sharp Left Turn Threat Model

The Sharp Left Turn Is The Hard Problem Of Alignment

Michaël: This relates to how confident we are that the problems we’re working on right now are solving things in alignment, so are addressing the most lethal problems, as Eliezer says, or if we’re kind of working on things that are kind of relevant for the future. Do you think we made progress on the more lethal problems or the more important problems in alignment, or are we working on less important things?

Victoria: This is another argument on the list that bakes in some assumptions in particular about what the most important problems are or which problems are deal-breakers if they’re not solved. I think there’s some disagreement on this between MIRI people and the rest of the community as well. This also connects to the sharp left turn threat model that we’ll be talking about later in this conversation where this is the main threat model that MIRI has put forward. And so that that’s kind of their claim about what they think the most difficult problem is, or the most important problem is, that there’s likely to be this rapid phase transition in capabilities and, the Alignment techniques, as they think, are unlikely to hold up during that transition. They think that this is the main problem to solve, and if we don’t solve it, everything else is useless. That’s what they’re pointing at. But a lot of people in the Alignment community might not agree that we have to solve that problem or that this sharp left turn idea is necessarily that likely. They might be solving other problems, like how to make sure that we can, for example, solve the interpretability problem. How do we give ourselves the ability to inspect the systems, beliefs, and goals, and so on, And that’s also a hard problem, but maybe a somewhat better defined hard problem. We are making some progress on scalable oversight and interpretability and other, maybe some foundational progress on understanding agency and so on. It’s unclear whether we’ll make enough progress on this before we actually have to apply these techniques.

Michaël: Scalable oversight is we have higher and higher levels of oversight, so smarter agents capable of overseeing the other models that are being trained. Maybe quickly define the sharp left turn. I don’t think you’ve defined what’s the sharp left turn.

Victoria: I just said that it’s the idea that there might be a rapid phase transition in the system’s capabilities where a lot of the capabilities of the system, planning and modeling the world and so on, become a lot better, and that if there is such a phase transition, then Alignment techniques that worked before the transition don’t necessarily kind of continue to work because the system starts thinking about the world very differently and might acquire different objectives. There are various reasons why that might not work. There’s this analogy that they make with human evolution where at some point we kind of rapidly increased our capabilities. We invented agriculture and writing and got much better at planning and pursuing our goals and acquired various concepts, a scientific understanding of the world, and so on. Now we are less aligned with the kind of goals that evolution built in for us than we were before. That’s an analogy with humans that is being extrapolated here for AI systems. But there are various ways in which AI is different from humans and then AI designers are different from evolution. That’s one of the problems with thinking about this idea of the sharp left turn, that we don’t truly have examples. Depending on where your intuitions come from, you can come to very different conclusions about how likely it is and how problematic it is and so on.

Why Refine The Sharp Left Turn Threat Model

Michaël: You tried to clarify all the assumptions made in the sharp left turn. I think it’s a good transition to move into your other post, which is Refining the Sharp Left Turn Threat Model. There was a post by Nate Soares which you were responding to. This was maybe the first time someone defined the sharp left turn and you wrote the post a few months after trying to put all the claims together and tried to define all the implicit assumptions.

Victoria: Part of the reason that I wrote a kind of distillation of the threat model or a summary how we understand it is that I think the original threat model seems a bit vague or it wasn’t very clear exactly what claims it’s making. It sounds kind of concerning, but we weren’t sure how to interpret it. And then when we were talking about it with within the team, then people seem to be interpreting it differently. It just seemed useful to kind of arrive at a more precise consensus view of what this threat model actually is and what implications does it have. Because if we decide that the sharp left turn is sufficiently likely, that we would want our research to be more directed towards overcoming and dealing with the sharp left turn scenario. That implies maybe different things to focus on. It’s one thing that I was wondering about. To what extent do we agree that this is one of the most important problems to solve and what the implications actually are in particular.

Breaking Down The Sharp Left Turn Argument Intro Three Claims

Victoria: I think we were not very clear in the beginning about what exactly the claims are that are made by the threat model. And what we ended up with after discussing it among ourselves is breaking it down into three claims where the first claim is that a rapid phase transition will happen at some point where an AI system that is becoming increasingly general, at some point will undergo some kind of transition in capabilities where it becomes a lot more capable and maybe getting better at a lot of different domains at once. Maybe it just acquires some kind of better planning ability. That would have a lot of implications for different tasks and then would just become a qualitatively different kind of system.

Victoria: The first claim is that you get this rapid phase transitioning capabilities, rather than, for example, very gradually improving capabilities in a way that the system is always similar to the previous version of itself. That’s the first claim. The second claim is that assuming that such a phase transition happens, our Alignment techniques will stop working. The third claim is that humans would not be able to effectively intervene on this process. For example, detecting a sharp left turn is about to happen and stopping the training of this particular system or maybe coordinating to develop some kind of safety standards or just noticing warning shots and learning from them and so on. These are all kind of different ingredients for a concerning scenario there. Something that we also spend some time thinking about is what could a mechanism for a sharp left turn actually look like? What would need to happen within a system for that kind of scenario to unfold? Because that was also kind of missing from the original threat model. It was just kind of pointing to this analogy with human evolution. But it wasn’t actually clear how will this actually work for an actual Machine Learning system, maybe?

Michaël: Right, so instead of thinking about it as an agent or a black box, you try to see how it would be implemented in standard learning techniques and what would it mean for generalization. I think when you said about the first claim is it’s able to become very good at many domains at the same time, that would mean that the thing is capable of generalizing to many domains. I think that’s the first claim. Second claim is where the Alignment of our systems won’t generalize as well as the capabilities and the third one is that we’ll not be able to stop it. Is it basically right?

Victoria: I think a second claim I would phrase more is just like the Alignment techniques kind of don’t generalize while the capabilities generalize because sometimes people also make this comparative statement of capabilities will generalize more than Alignment does. I’m not sure that really makes sense to me because we want AI Alignment to generalize well enough to still lead to a system that acts in a human compatible way. I don’t think it has to generalize as well as capabilities do. But also, I’m also not sure what that means. It seems to be kind of comparing apples to oranges a bit. I think the question is “do the Alignment methods generalize well enough when capabilities become much higher?”

Michaël: Then I think there are two small claims, they’re optionals. I don’t know if you want to go into that much detail.

Victoria: Probably not.

A ‘Move 37’ Might Disempower Humanity

Michaël: One thing about generalization is that the kind of dangerous part is if the model is able to generalize well enough to disempower all humanity or lead to catastrophic outcomes. The question is if the model is able to generalize well enough, it will be able to find those plans to take control, even if it’s not being trained on it explicitly. But I guess one disagreement I had when reading your post was that there are plenty of scenarios where people take over on the internet. If we train language models on the internet, maybe the model will have this in its training data. Do you mostly agree that it will have the kind of plans to disempower humanity in its training data, or does that require generalization?

Victoria: I don’t think that the internet has a lot of particularly effective plans to disempower humanity. I think it’s not that easy to come up with a plan that actually works. I think coming up with a plan that gets past the defenses of human society requires thinking differently from humans. I would expect there would need to be generalization from the kind of things people come up with when they’re like thinking about how an AI might take over the world and something that would actually work. Maybe one analogy here is how, for example, AlphaGo had to generalize in order to come up with Move 37, which no humans had thought of before. That allowed it to defeat a grandmaster in Go, well, if it was just following the same kind of strategies that humans already know about, it probably wouldn’t win the game, or it would be much less likely to. I think you would need to come up with Move 37 type ideas, or to actually effectively be able to gain control of human society, of the infrastructure of our civilization and so on, maybe come up with novel ways to hack into data centers or whatever.

Michaël: It’s kind of sad to see the Move 37 as a way of disempowering humanity.

Victoria: I would say there’s an important point there, where the same capabilities that give us probably creative and interesting solutions to problems that, like Move 37, could also produce really undesirable creative solutions to problems that we wouldn’t want the AI to solve. I think that’s one argument that I think is also on the AGI Ruin list that I would largely agree with, that it’s hard to turn off the ability to come up with undesirable creative solutions without also turning off the ability to generally solve problems that we one day want AI to solve. For example, if we want the AI to be able to, for example, cure cancer or solve various coordination problems among humans and so on, then a lot of the capabilities that would come with that could also lead to bad outcomes if the system is not aligned. Sometimes I like to make an analogy between Move 37 and various specification gaming examples, like the AI’s exploiting a bug in the game or generally finding ways to exploit the reward function. From the AI’s perspective, the same creativity and general search process that leads to Move 37 leads to these bad solutions as well.

Michaël: Right. For it, just another action on the board is not something surprising. It’s just for us humans, we don’t see going circles in a boat race game to be something very normal. But for the AI, it’s just like another action. Just about the Move 37, as a human that was playing Go at the time, I was rooting for Lee Sedol, and so I was not happy. I thought the move was beautiful, but I was not happy when everyone said that the AI would win because I was rooting for the human at the time. For some people, but not at DeepMind, it was a terrible move.

We Might Need Qualitatively Different Alignment Techniques

Michaël: To move on to the other claims about Alignment maybe not generalizing the way we wanted to generalize compared to capabilities. You mentioned in the post that we will need some qualitatively different kind of Alignment techniques for it to work, or that is not just something about quantity, but also about quality. Do you want to explain a little bit more what you meant if that’s fine?

Victoria: I mean, that’s just a restatement of the claim that we think has been made by the threat model. I think the claim was that the kind of Alignment techniques we have now are not going to work, so we just need to come up with very different Alignment techniques from the current research agendas. I don’t necessarily agree with that, but that seemed to be part of what was claimed by the threat model.

Michaël: You don’t think we need some very qualitatively different techniques?

Victoria: I mean, it’s plausible. I’m not sure. I’m not sure whether it’s necessary. I think in principle, it’s plausible that, for example, the mechanistic interpretability agenda that focuses on identifying what different components of neural network are doing. I think it’s possible that if this goes far enough, then we’ll have a good enough understanding of the system and kind of at a higher level, that would be useful for aligning systems or for detecting when they have in desirable objectives and so on. I mean, it’s also possible that it won’t be sufficient. I don’t see fundamental problems with current research agendas. I think I did read the follow-up post by Nate Soares where he was kind of trying to point out what he thought were reasons that those research agendas won’t work, but I think it didn’t cover all the research agendas that the Alignment community is pursuing. Also, I think some of those objections were more on an intuitive level where just saying that interpretability is really hard, we probably won’t solve it, rather than very specific objections. I don’t see fundamental obstacles to current Alignment techniques working. I mean it does seem like there’s a lot of hard problems to solve. I think it’s more likely that we will just run out of time rather than that the current paradigms definitely won’t generalize.

How Many Years Do We Have Before Systems Become Transformative?

Michaël: I think this is kind of related to how fast you think things will be and how fast the take-off will be. I guess if you are thinking about not having a lot of time and all the agendas will not work in the time constraints. It’s also related to the other things we said before about how many warning shots will we have and if people will be able to cooperate. I think everything is kind of related.

Victoria: I think we should distinguish between how much time we have until there’s a bit of advanced general systems have been developed versus how quickly they would develop. Difference between timelines and take-off speed, because I think I do have relatively short timelines. But also, I don’t expect super fast take-off. I’m more kind of more on the slow take-off side of things. I think the short timelines could still lead to there not being enough time.

Michaël: What do you mean by short timelines?

Victoria: I guess just the amount of time until the AGI has been built.

Michaël: I mean, in terms of length.

Victoria: Oh, just number of years? I don’t know.

Michaël: The median number.

Victoria: 10 or 15 years, maybe. But as usual, we have wide error bars here, it’s also possible that scaling laws will stop for some reason and then we might have a lot more time than we thought.

Michaël: Right.

Victoria: Various other factors.

Michaël: So you see the current Alignment agendas leading to good outcomes in this timeframe, or making enough progress in this timeframes to be confident that we have systems that are safe.

Victoria: I think it’s possible. I think it depends on how many good researchers we have working on these agendas. I think also whether we can get assistance from AI systems in doing Alignment research. There’s some people who are working on how to make language models useful for Alignment research, for example. I think it’s quite plausible that there would be a few years where we have AI systems that are useful for Alignment research, but before we have full AGI that can be potentially quite dangerous. I think in that window of time, we could make a lot of progress on these research agendas as well and iterate on systems that are similar to AGI but not AGI as well.

Michaël: You mean transformative?

Victoria: Yeah, transformative AI systems. There might be various ways to buy time and increase that time period where we can potentially make a lot of progress in Alignment by getting some help from somewhat advanced but not quite AGI systems.

Michaël: Isn’t there a problem if we use AI to align our AIs, and we still haven’t figured out how to align the first ones?

Victoria: I think that the Alignment requirements for the systems that you’re using to help with Alignment research would be less strict. I’m just talking about using GPT-3 or GPT-4 to help come up with critical Alignment ideas with different ways of thinking about the problems and so on. I think that level we can still verify whether the system is being helpful.

We Could Find An Aligned Model Before A Sharp Left Turn

Michaël: It’s not advanced enough to produce deceptive plans over very long time horizions. The last claim we haven’t talked about is the third one, about how we would prevent AI from having a deceptive sharp left turn.

Victoria: You mean we just preventing a sharp left turn in general?

Michaël: No. I think preventing bad outcomes or whatever humans will be able to align the transition. I think this is roughly the third claim.

Victoria: I guess also, I think we haven’t talked about some things about the second claim. Or, I don’t know if you want to go into that first. I guess in the follow-up post that I recently wrote, I was looking at what I would see as our current default plan for dealing with a sharp left turn. And my understanding of that is that ideally we would want to find a system that’s sufficiently aligned before a sharp left turn. That it would try to preserve its goals during that capability transition. As opposed to trying to do Alignment during a sharp left turn, which seems a lot more difficult. My general impression is that the default plan is not to try to align a system during the sharp left turn, but try to align it enough beforehand that even in the sharp left turn situation the system is trying to keep itself aligned. And maybe human designers are trying to help it avoid various pitfalls, so then you’re working together with the AI system to preserve its goals. And I think for this, the system doesn’t have to be already an aligned superintelligence or something.

Human Analogy For The Sharp Left Turn

Victoria: I think we can have a system that is weekly goal-directed but has the capacity to reason about its own goals, and that can still be easier than doing long range planning to achieve those goals. I think one analogy here is, once again, what happened to humans when humans went through a sharp left turn.

Michaël: Wait, humans went through a sharp left turn?

Victoria: Well, this is the whole analogy where, of course, the human sharp left turn that happened on the… I mean, it’s fast on an evolutionary time scale. While of course for humans it felt very slow, because we were slowly accumulating different skills and abilities through the cultural evolution. Let’s say if you take a caveman whose descendants are going to learn to do agriculture and writing and so on. Then I think the values of the caveman got passed down in a relatively effective way. Now that we’ve acquired all these skills and concepts, we still value a lot of the same things, like human relationships and so on. I think if you take the caveman as an analogy to a weekly goal-directed system, so someone who doesn’t make very long range plans but has some preferences or is trying to get their environment into certain configurations, at least in particular situations where they know what to do, then if that entity acquires lots of capabilities then they’ll still try to preserve what they valued before. I think at least based on the human analogy, we can expect that a system might try to preserve its previous preferences. So that’s the overall argument.

How To Find That Aligned System Before A Sharp Left Turn

Victoria: And my second post focuses on finding this weakly aligned system for a sharp left turn, and then helping the system get through a sharp left turn. And then we don’t have to worry about trying to align a system with bad goals during a sharp left turn, ideally we just don’t want to go there at all.

Michaël: So ideally if the model is able to preserve its own goals, then we won’t have to force the model to preserve it, it will do it for us. And we just have to insert the right goal at the beginning.

Victoria: Yeah. Or find a system with sufficiently beneficial goal at the beginning. So finding the system might be difficult. That’s where we need a lot of interpretability tools, and we might need to restart the search a bunch of times if we detect signs of deceptive Alignment and so on. Which might be quite hard to find in that system. But I think my impression of current Alignment techniques they’re trying to do is mostly find that aligned system in the first place. That’s also something that Paul Cristiano was talking about in his comment on that post where Nate was critiquing various Alignment approaches and how they don’t address the sharp left turn enough. Paul was making this argument that if you have your weakly aligned system, then it would have the incentive to preserve its own goals, and then you’re on the same side as opposed to being in an adversarial relationship with the system. And then he described different possible mechanisms for the sharp left turn, how you might help the system keep itself self-aligned. That does seem just a more promising path than trying to align a system during the sharp left turn. If you’ve done a good enough job with finding an aligned system in the first place, then maybe your interpretability tools don’t have to keep reliably working during a sharp left turn because you can expect that your system will generalize correctly. So even if we don’t have that insight into its goals anymore, then you might still be fine.

Michaël: Even if the AI becomes much smarter than it was before, and you don’t have all the interpretability tools to see exactly what it’s doing or thinking, if it’s preserving its goal you can just be assured that it’s probably going to be beneficial, if at the beginning it was beneficial. I think in the comments by Paul Christiano, or in other parts of the post you’re mentioning, there’s something about internal search. And I guess there’s the goals that we can decide what these are. Maybe, let’s say the outer goals, the outer rewards and those kind of things. And then if the model is having an internal goal that is different. There was something about it being harder to change the goals that are internal goals inside the system versus the goals outside the system. So maybe the reward function we give it. Do you think about those kind of scenarios? Do you think there’s going to be different goals inside the system? Or do you think roughly there’s going to be one goal we need to align to?

Victoria: I think there could be different goals inside the system. I guess I’m not sure what exactly the question is. I mean it’s possible. The whole goal misgeneralization idea where the system might represent some other goal than what we trained it for. And then we would need some way to elicit that, a bit of different Alignment techniques. Either interpretability tools or possible solution to eliciting latent knowledge and so on. I guess what you were talking about there is one mechanism for a sharp left turn is that the system could learn to run its own internal search that’s faster than the outer optimization loop and then could make itself much better at planning. And maybe it’s something that would be hard to align from the outside. But the idea with the goal-aligned model is that if you build such a model, it also can predict how running this internal search could change its preferences that it would also want to avoid that happening. So then, we are trying to be careful about aligning AI systems, which we could see as humans running an internal search. Then similarly, if the system is running some kind of internal search then be able to make sure that has beneficial outcomes according to its current goals. Then, I guess we would want that reasoning to be explicit enough that we can, communicate with the AI system about that. So then ideally we would find an AI system in the first place that would want to preserve its goals in that way.

Michaël: To the extent that the AI understand that its goal and wants to preserve them, if it does internal search or ohas ther types of goals, it will still find a way to maximize the chance of its goal being preserved over time.

Victoria: Right. Of course that could still fail in some way, and then maybe the system doesn’t foresee how the internal search might change its goals over some long time horizon or something. And then, we still want to help with that. There’s various ways that I think this idea could fail, but overall it still seems relatively promising.

Michaël: Seems worth trying.

Simple Factors That Could Lead To A Sharp Left Turn

Michaël: One thing I found interesting was how you found examples of how we could get a sharp left turn. The AI becoming smarter by extending its memory by, for instance, how humans write. So could just write in Google Docs or extend its memory outside. Or you could find better optimization. Do you have any other examples of how it could be smarter? Or is it just roughly getting better memory the main way you think about it?

Victoria: I think getting better memory, a pretty central example. Or coming up with a better planning algorithm and then using that repeatedly to come up with better planning algorithms. I think designing better prompts for itself, for example.

Michaël: “Thinking step by step”.

Victoria: Yeah. And sometimes that could happen in a dialogue with humans potentially if sometimes humans might try to use the AI to come up with better prompts as well. But also, I think there are different ways for that to play out.

Michaël: And in that case even if we give it the same compute or the same amount of time in terms of wall clock, it will have more subjective time to come up with better plans if it’s capable of processing more information per FLOP or those kind of things.

Victoria: Right. Or if it comes up with a more efficient search algorithm, then can do a better search for the same amount of compute.

Goal Directedness May Precede Situational Awareness

Michaël: I think one key term in your post that we haven’t talked much about is situational awareness. Is situational awareness basically being aware of your own goals?

Victoria: I think more broadly situational awareness is just the AI system knowing that it is an AI system is being trained to perform certain tasks. That it’s situated in a broader world with a lot of humans in it. And just generally understanding its place in the world, and that would include also what its objectives are. For example, if it’s aware that it’s in a training loop, then it knows the objective is that it was trained for and how it may be different from what it hasn’t internalized and so on. It’s a mix of self-awareness and also having some kind of model of the world and its place in the world. For example, if you go back to the AlphaGo example, it does not have situational awareness because all it knows about is the Go board and it doesn’t know that there’s a whole world outside the Go board. While we would expect that when we do eventually build a general AI system, that they would have situational awareness. The current language models show some level of situational awareness, but it’s not clear how reliable that understanding is versus if it’s just something that they’ve learned to say, that I am an AI system, I’ve been trained to do X, Y, Z.

Victoria: I guess that’s something we also want to get better at detecting in our AI systems as well because it’s an important property both for Alignment and for misalignment. Because if the system develops situational awareness when it has beneficial goals, then the situational awareness is helpful for preserving those goals, and so on. And then acting in a cooperative way with humans. But if it acquires situational awareness when it has undesirable goals then the system is more likely to become deceptively aligned because then if it has goals that are incompatible with our interests and then it acquires this understanding, “Oh I’m being trained for such and such purpose which actually is not what I really want”. I guess from speaking as the AI system, imagining being the AI system, then it would have an incentive to hide its actual motivations from the designers, or play along with the training game.

Victoria: That’s a term that Ajeya was using in her threat model. That’s a good way to think about it. The system would want to play along with whatever tasks you set for it so that we would continue training the system, and then at some point deploy the system in a setting where maybe it would have less oversight. What I’m saying is that situational awareness can be a prerequisite for deceptive alignment. It can also be a prerequisite for the model becoming more aligned. So really we want to be able to tell when does the model acquire this ability, and if it acquires that stability under the right conditions.

An Algorithm To Find Aligned Situationally Aware Models

Michaël: Ideally, we would want some kind of algorithm where we try to pick the best model. I don’t know exactly where we start, but we try to avoid models that are situationally aware or that preserve their own goals and are not aligned. And we try to find models that preserve their own goals. And so in some, how do we start? How do we pick the model that we want that can be both aligned and preserve its own goal? I think in your post it gives some kind of algorithm on how to actually find the correct one.

Victoria: I think as we search for a goal and model, there are certain criteria for stopping and restarting the search. Because if we found a model that is deceptively aligned, or shows some signs of concerning behavior, you don’t want to continue training that model. Or if we find a model that has situational awareness but doesn’t have any particular goals, I think in that case we also probably want to restart the search because you don’t have as much control about which goals the system acquires first. And then once it becomes more goal-directed and it already have situational awareness, it might become deceptively aligned.

I guess the way I’m currently thinking about this, which I’m not super confident in, but my current view is that you would want the system to first gain some level of goal directedness towards goals that are desirable for us and then acquire situational awareness. [It matters in] What order these things happen. Of course, we would need some ability to tell when a model is goal-directed and so on, would need some kind of test or some kind of interpretability techniques. Being able to do this relies on more advanced Alignment techniques than we currently have, but I think that could come out of the current Alignment research agendas.

Michaël: Isn’t there a risk of running into a model that is deceptively aligned and is also situationally aware, or capable of preserving its own goals and the moment we test and run it, it’s already dangerous.

Can A Model Become Unaligned With One Step Of SGD?

Victoria: I think partly this approach relies on there being some early warning signs of the deceptive alignment. Where you don’t just go from one step of SGD when the model is just not goal directed, or just not that competent. And then next step it’s already deceptively aligned model. But it just doesn’t seem as likely to happen. I think I would expect deception to develop more gradually.

Michaël: I feel like one step of SGD is a straw man because we-

Victoria: A little bit, yeah.

Michaël: We don’t test our models every step of SGD

Victoria: But I mean if you replace one step of SGD with the interval between testing. Usually these models are fairly similar. I also would expect that the system would need to experiment with different weaker forms of deception before it becomes really good at that. Of course, there’s always some bad properties that you don’t catch. But I would expect that there are some early signs of the model being misaligned that you’d be able to pick up on. Because also, if you’re using both interpretability techniques and also behavioral tests on the model to detect undesirable goals, situational awareness, the model would need to fool all of them at once. I think that’s not that likely that the model would be able to do that. I think this also depends on not expecting a super fast takeoff. Of course, if you have super fast take off then I think this would be more likely to happen.

Michaël: For people who are not familiar, by takeoff you mean going from general AI to self-improving AI, or something that is able to disempower humanity.

Victoria: Sorry, maybe I don’t need to talk about takeoff here. I just mean if you have a system that just becomes much more capable very quickly. So if you’re training a system and it goes from being not particularly goal-directed and pretty bad at deception to boom, a few batches later its just superhuman at deception or something. And it can avoid both your interpretability tools and your behavioral tests. Of course that phase transition could be harder to deal with. But also, there are other Alignment techniques. For example, trying to predict phase transitions and capabilities in advance.

Victoria: So if you think it’s likely enough that the model is about to undergo this kind of phase transition, you can also stop training early and try to find some other way. I think the more likely way for this to fail is that the people working on AI capabilities will run out of patience to do all these tests and all these restarting the search a lot and so on. Or that people wouldn’t want quite this much overhead. I think the amount of overhead is the bigger obstacle, but maybe we can bring that down as we keep working on this.

Michaël: I feel like if we ask people to run a bunch of Alignment or safety benchmarks every batch or something, it’s too much of a requirement.

How A Situationally Aware Model Might Not Preserve Its Own Goals

Michaël: In your post you mentioned, in the follow-up post, you mentioned two steps. The first step is finding the right model through search, if I understand correctly. And the second one is something about preserving goal Alignment during the sharp left turn during transition. Is that correct?

Victoria: Yeah. And in particular, the second step is where the model itself is trying to preserve its goals and potentially with some help from the human designers.

Michaël: How could this fail? I think you give some examples.

Victoria: I think one way this could fail is that it’s hard for us to help the model because maybe we don’t have good enough interpretability to see how it’s reasoning about its goals. Maybe it’s hard to communicate with the model about what consequences it predicts for undergoing a capability transition. Also, on the model side it might be just difficult for it to foresee some of the more subtle consequences of a capability transition. So even if the model is trying to predict what would happen, it might fail. I guess also just in terms of trying to help the model go through this transition, if the model ends up with very different concepts from us, then it might just be hard to communicate about how the goals might change.

Paradigms of AI Alignment

Michaël: We’ve talked about outer reward function, and what kind of goals the model might have internally. And with vague definitions. But I think you tried to clarify everything in this third post we are going to talk about now, which is Paradigms of AI Alignment. And for me, it was very clear, and it made a lot of sense. And I think it will be helpful to the listeners to have those kind of distinctions.

The Different Components Of Alignment

Victoria: So this was my overview post for what’s going on in Alignment research and what are the different parts of the problem that we need to solve. In particular I like to distinguish between work on Alignment components and Alignment enablers, where components are different elements of an aligned system that we need to get right. In particular, I’ll say that Outer Alignment and Inner Alignment are different components and enablers are just Alignment techniques that make it easier to get the components right. For example, like interpretability or foundational work.

Victoria: In terms of these components, I think I find it helpful to think about different levels of specification of the system where you have, what’s called the ideal specification and the design specification, and the revealed specification. And the ideal specification is what the designers have in mind when they build the system. So basically the wishes of the designer. The design specification is what you actually specify to the system to do, for example. The reward function and then the revealed specification is what you can reconstruct from the system’s behavior, what it seems to be, what goal it seems to be pursuing. So for example, if you derive a reward function from looking at what the system is doing.

Victoria: And then if we look at the gaps between these different specifications, the gap between ideal and design specifications correspond to Outer Alignment problems where there are some ways that we want the system to behave or some norms or constraints we wanted it to respect but that we failed to specify in our reward function. While Inner Alignment problems are the other gap between design and revealed specifications where even if your design specification is correct, the system is still doing something wrong. So in particular, goal misgeneralization is a way that inner misalignment can happen because even if you have a correct objective function, then the system can still learn a different goal that coincides with objective function on the training data. So the system just doesn’t have enough information about the design specification to learn the right goal. This is the way I would break this down.

Michaël: So there’s the design specification, then there’s the revealed specification, which is how the model behaves. Is there a third one?

Victoria: So there’s the ideal specification, then design, then revealed.

Building An Ideal Specification For Satisfying Human Values

Michaël: Right. So ideal specification is what we would want ideally to satisfy human preferences?

Victoria: Yeah. So maybe this is the true human utility function if there is such a thing.

Michaël: It seems like when you’re saying this, you don’t believe in that sort of thing.

Victoria: Well, I mean, that’s why we call it the ideal specification is that you can’t actually write this down otherwise you would just make that into your reward function. So that’s something that we don’t have full introspective access to.

Michaël: Can we approximate it?

Victoria: Yeah. We can try to approximate it.

Michaël: Do you think there’s something like a coherent extrapolated volition? If we could design what humans would want until the end of time, but would humanity agree on if they had infinite time?

Victoria: I mean, that might be a thing, but it’s not very computable. I think it’s like… I’m not sure that’s something we’d be able to figure out just by thinking because we actually have to experience different situations to find out what our preferences are there.

Michaël: You need to simulate humans?

Victoria: I think there’s a lot of much more obvious ways for design specifications to omit various parts of the ideal specification. If you think about for example, the side effects problem, that’s an example where it’s much easier to specify what we want the system to do than what we don’t want the system to do. And it’s really hard to specify all the different ways we don’t want the system to change the world, which is part of our ideal specification. And so if you have these missing parts of the specification and the system might just interpret that as being indifferent to those variables when we actually aren’t. So that’s just one example of how it is a specification can be mismatched from the ideal specification because it has to be a lot simpler.

Michaël: I mean the ideal specification would be something like, try to do whatever the human do and be certain about humans’ goals and ask questions. And if we try to specify all the different cases, it’s not an actual specification, it’s more like in terms of outcomes, in terms of a list of rules. It’s like what kind of outcomes you would want.

Victoria: Sure. You could say that.

Michaël: And just to go back to the Inner alignment, Outer Alignment things, to make it clear for everyone, the Outer Alignment would be if we’re aligned between the ideal specification and the design specification. So the designers or the AI engineers are writing the right thing, are able to implement the ideal specification, very close to the design and the Inner Alignment is if the design specification is actually what the behavior we observe in the model. So that if the model behaves the same way as we program it.

Victoria: I guess I would say that Inner Alignment falls within that gap between the design and revealed specification. But there are other problems that are also in that gap. For example, robustness problems where the system might fail to behave according to design specification, not because it has some misaligned objectives but because of robustness failures or some other reasons, maybe adversarial inputs or whatever. So, not everything in that gap is inner misalignment.

How Could We Solve Goal Misgeneralisation?

Michaël: Do you have any promising way of solving goal misgeneralization?

Victoria: I think, we don’t really have a very good solution for it yet. Because there’s various things you can do to decrease the problem, try to have more diversity in the training data so that you include more situations that disambiguate between different possible goals the system could learn. But it’s difficult to predict all the different kinds of data diversity you might need or all the ways in which new situation may be different. And in our paper we distinguish between ways to mitigate goal misgeneralization in general and ways to mitigate deceptive alignment, goal misgeneralization where the system not only learns some desirable goal but tries to hide this from us. And then we can try to use interpretability tools to detect what objective the system might have or whether detect one as being deceptive.

Victoria: For example, if the system is maintaining multiple models of the world. Maybe one model of these would represent what it actually thinks the world is like and how it actually models the world. And the other one is how humans think about the world and it’s trying to get us to interpret sections in a certain way and so on. This was not very clear, but at least we can maybe draw some parallels with what deception looks like for humans internally, where the human is trying to convince others of something that is not true. They have to keep multiple scenarios in their head like what is actually going on and what they’re trying to convince the other person of. So maybe this is how it ends up if it turns out that it works similarly for AI systems. We might try to detect that.

Victoria: But there are various other things we can do. For example, we could apply methods debate where you have a self play scenario between two AI systems where they try to point out flaws and what the other system is doing and maybe using interpretability tools on each other so the opponent in the debate could try to help us discover if the other AI is deceptively aligned. So there are various approaches I think that they do rely on better interpretability than what we have right now. But I think in principle this is possible.

More Foundational Work Is An Alignment Enabler

Michaël: When you mention interpretability, this is more of Alignment enabler. It’s something that can enable us to solve those gaps in Alignment or help us think more clearly. So that you mentioned mechanistic interpretability, are there any other enablers?

Victoria: So the kind of things that I would consider enablers are some work on mechanistic interpretability, trying to build an understanding of how the AI system works where the different [system] components represent different kinds of knowledge and so on. Then there’s just understanding different kinds of bad incentives the systems might have and whether we can detect those incentives in particular systems. And then there’s also foundational work generally trying to better understand what is goal directedness and how can we tell if the system is goal-directed and what is it like to be an embedded agent, an agent that’s not separate from its environment and how do we think about that system having goals and things like that. So if we make progress in our understanding of this phenomena, then we are also in a better position to tell whether our [alignment] components are actually working.

Michaël: So you think things like embedded agents trying to figure out how to model things when embedded in the world will be useful to align models?

Victoria: I think particularly the considerations that come up in research, in embedded agency that are about the model being able to potentially modify itself in some ways even in limited ways I think that could be relevant to trying to think about what might go wrong. Even a language model designing its own prompts is a limited form of self modification where insights from an embedded agency might be helpful. Then in an embedded agency scenario and people also think about the model actually having sub-agents that might have distinct objectives and how that works.

Can Language Models Simulate Agency?

I think that can also be relevant for example, for language models that might simulate different characters or sub-agents who might, if they’re simulated in enough detail, they might have some kind of coherent motivations and so on.

Michaël: Are you taking about Janus’ simulators post?

Victoria: Exactly that line of research. So I’m not sure whether the theoretical insights from the embedded agency agenda would directly apply to this, but I think just being aware of some of those phenomena I think is helpful when thinking about what can happen for those models as they potentially modify themselves and so on.

Michaël: So just about the simulation of other agents, do you think this would count as having multiple agents instead of you if you were capable of simulating a bunch of different agents?

Victoria: I think there’s a continuum here depending on how detailed and persistent these simulated agents are, but I think that could be the case. I think it could be possible for a more advanced language model to simulate a very sort of goal-directed subagent that would try to control the language model or try to get out or something. I’m not sure exactly what that would look like, but I think in principle that’s a possibility. But also, I think that’s outside the capacity our current language models, might happen sometime in the future. Something we might want to think about.

Michaël: Does it make sense to talk about goals if they don’t actually have actuators in the world or don’t have a real-world function if it’s just a simulation by language model? Does it make sense to talk about agents?

Victoria: I think it does make sense to potentially think about those entities as goal directed. If the model is prompted to simulate an agent. If they ask what would such and such human do in this situation? Or what would a really competent person do in this situation or whatever. Then it’s trying to simulate something that is goal-directed and so it’s running that process somewhere internally and then, a lot of those considerations would apply. But I guess the sub-agent is not truly separate from the broader model. So it doesn’t have, I guess in terms of taking actions, when the model is simulating this agent and might be talking to the human user who’s communicating with the model. But then if the language model then that goes into a different story than this character is not communicating with the user anymore. So it’s not as persistent.

Michaël: It’s like an agent appearing in the world for one forward pass and then disappearing, sending a message.

Victoria: It could be something like that. But I think it’s worth thinking about like what does this agency look like and how does it differ from standard reinforcement learning agency would be like. We are used to thinking about how these agents could act in undesirable ways. So I mean that’s this thing from the language model itself, potentially becoming more goal-directed through fine-tuning with human feedback or whatever. Then it’s able to act in the world by talking to humans. And maybe, I mean if you develop a model with superhuman persuasion or manipulation ability, then it would be able to potentially get humans to do a lot of things, in which case it is effectively acting in the world. So that’s something that in some ways it’s more safe because we do have humans in the loop. But also if it can exploit human vulnerabilities to get them to do different things, then that’s still concerning.

What Do We Mean By Goals Actually?

Michaël: So we’ve talked about this goal directedness or what are goals for a long time, it’s a bit late to ask the question because it’s almost the end, but if you could define precisely goals, it would be something like, you want the world to look in a specific way and you tend to try to move the world in a specific way. So the goal of breaking a vase is like going from this state of not-broken to broken. How would you define goal directedness or just goals more precisely?

Victoria: So you could say that goals are some particular states of the system, states of the environment that tend to be reached from different starting configurations. So I like the definition from the ground of optimization post on the Alignment forum that was defining an optimizing system as a system that tends to end up in some small set of target configurations from different initial conditions. And this doesn’t mean that from an arbitrary starting point, you will end up with the target states because sometimes the world is just too different and you don’t know how to pursue your goals anymore. But maybe from a wide enough variety of starting points, you end up with this goal and this goal state, then you could say that it’s a goal-directed system. So it’s kind of like a continuum, the bigger this basin of attraction is, the more different kinds of situations lead to these target states, the more goal-directed the system is.

Michaël: So it’s going from a high entropy initial environment and minimizing the entropy towards a very specific number of outcomes that are… If most things end up with the thing being broken, then you can say that the thing wanted the vase to be broken.

Victoria: I think you can put it that way: if you perturb the environment in different ways then the system would find a way to overcome those perturbations. There was a nice example on the ground of optimization post about people building a house where you have some configuration of building materials and people. And you end up with a state that has a house in it and then if you introduce some perturbations, you remove a wall or you steal some of their tools or whatever, then eventually they would replace them and then still build the house. But of course there’s some perturbations that are too large, [like if] there’s a giant earthquake and everyone has to evacuate, then maybe they give up on building the house or something.

Victoria: So there’s always some range of initial conditions. I think it’s a bit tricky to think about goal directness in general because I think what we are really interested in is whether the system has some internal representation of what are the target states that it’s trying to achieve. But we can only observe goal directness through behavior, at least not until we have much better interpretability techniques. And the behavioral definitions are a bit circular because we only observe behavior on situations that we have seen. And so it’s hard to predict behavior in new situations without making some inference about how it’s representing objectives. So there’s a lot of circular stuff going on here. It’s been notoriously difficult to actually define goal directedness in a useful way.

Conclusion

Twitter Questions

Michaël: It’s kind of similar to how we can try to see the behavior of humans and try to infer their utility function from it. Like in inverse reinforcement learning, where you only see the state in actions and you try to define a reward function from the behavior, but you can only do that if you have the behavior in front of you. So I guess to end this, I would like to ask the question that was asked on Twitter by Gary Marcus. So Gary Marcus actually read my tweets and when I ask what questions to ask Victoria Krakovna, he asked, let me get this right, “Has there been any significant progress in the misalignment of goals? The “mis” is in parenthesis, so it’s “Alignment of goals”. Since you started researching the topic, how serious a problem is it now?

Victoria: I would say there has been some progress. I think we have a better grasp on what the problems are. I think we have a better understanding of what some outer misalignment problems look like and inner misalignment problems look like. I think when I started working in this, we didn’t have this idea of goal misgeneralization or examples of this, how this could happen. So now we have a better idea of what is the problem we are trying to solve. So I think that’s one thing. Also, I think there has been progress in interpretability methods. For example, applying the circuits’ methodology to language models, which started out as technique purely for vision models.

Victoria: And then of course I think they had to do it differently for language models. But I think the basic idea of interpreting the network on the level of circuits seems to have transferred pretty well. So I think that mechanistic interpretability is definitely in a better place than it was five years ago. So there’s still a lot of difficult open problems. And generally interpretability seems pretty hard and identifying goals of the system is still pretty hard. I think we have a somewhat better understanding of some of these things and there’s a lot more people with different angles on these problems that are working on them. But yeah, will definitely say it’s still a serious issue.

Last Message For The ML Community

Michaël: Any other last message from for the Alignment committee or ML community as a whole? Or anything you want to spread for our listeners?

Victoria: I guess I think for the ML community as a whole, I would encourage people just to think about the implications of trying to get really capable systems to do what we want and how that might be different from getting current systems to do what we want. In particular, we might not be able to iterate as easily on powerful systems or detect when there are bad consequences of their behavior as we do with current systems. And I think this is one of things that makes Alignment difficult is that if we gradually deploy more advanced systems on different kinds of complicated tasks and at some point we would no longer be able to foresee the consequences of the model’s plans. And that’s why we need scalable oversight. And I think for really narrow systems, let’s say AlphaGo, this is fine because even if we don’t truly understand how the systems plan leads to winning at Go, then we can be fine with it because we know that it doesn’t have any effects outside of the game.

Victoria: But if you have deployed advanced AI systems in more high stakes domains or in the real world, it becomes more important for us to understand all the consequences of what the system is doing. And that’s where we can try to use AI to help us understand what AI is doing. But at a certain point you can get into a situation where let’s say the AI is proposing some complicated plan at the level of some complicated mathematical theorem and the human is trying to tell whether the theorem is true or false. But they might not have an expertise to evaluate all these consequences. So this is just one example of how Alignment of advanced systems can be hard. I guess I would encourage people working on Machine Learning too, keep this in the back of your mind as you’re doing what you’re doing. The same ways that we can solve various bad behaviors for current systems might no longer work and we actually need new approaches that take for aligning more advanced systems.

Michaël: Do you have any resource that people can look at or website or papers you think are a good entry point?

Victoria: So I keep a list of AI Safety resources on my blog. So that’s where I would usually point people. There’s a lot of good overviews of the motivation for working on Alignment and why Alignment is difficult and what the current research agendas look like and so on. So that’s a lot. A lot of good stuff to read.

Michaël: Cool. On that note, thank you very much.

  1. Note from Victoria Krakovna: A more clear way to say this would be “through incompetence”, since an unaligned system can arise by accident.