What Failure Looks Like: Distilling the Discussion

Ben Pace

The comments under a post often contains valuable insights and additions. They are also often very long and involved, and harder to cite than posts themselves. Given this, I was motivated to try to distill some comment sections on LessWrong, in part to start exploring whether we can build some norms and some features to help facilitate this kind of intellectual work more regularly. So this is my attempt to summarise the post and discussion around What Failure Looks Like by Paul Christiano.

Epistemic status: I think I did an okay job. I think I probably made the most errors in place where I try to emphasise concrete details more than the original post did. I think the summary of the discussion is much more concise than the original.

What Failure Looks Like (Summary)

On its default course, our civilization will build very useful and powerful AI systems, and use such systems to run significant parts of society (such as healthcare, legal systems, companies, the military, and more). Similar to how we are dependent on much novel technology such as money and the internet, we will be dependent on AI.

The stereotypical AI catastrophe involves a powerful and malicious AI that seems good but suddenly becomes evil and quickly takes over humanity. Such descriptions are often stylised for good story-telling, or emphasise unimportant variables.

The post below will concretely lay out two ways that building powerful AI systems may cause an existential catastrophe, if the problem of intent alignment is not solved. This is solely an attempt to describe what failure looks like, not to assign probabilities to such failure or to propose a plan to avoid these failures.

There are two failure modes that will be discussed. First, we may increasingly fail to understand how our AI systems work and subsequently what is happening in society. Secondly, we may eventually give these AI systems massive amounts of power despite not understanding their internal reasoning and decision-making algorithms. Due to the massive space of designs we'll be searching through, if we do not understand the AI, this will mean certain AIs will be more power-seeking than expected, and will take adversarial action and take control.

Failure by loss of control

There is a gap between what we want, and what objective functions we can write down. Nobody has yet created a function that when maximised perfectly describes what we want, but increasingly powerful machine learning will optimise very hard for what function we encode. This will lead to a strong increases the gap between what we can optimise for and what we want. (This is a classic goodharting scenario.)

Concretely, we will gradually use ML to perform more and more key functions in society, but will largely not understand how these systems work or what exactly they’re doing. The information we can gather will seem strongly positive: GDP will be rising quickly, crime will be down, life-satisfaction ratings will be up, congress's approval will be up, and so on.

However, the underlying reality will increasingly diverge from what we think these metrics are measuring, and we may no longer have the ability to independently figure this out. In fact, new things won’t be built, crime will continue, people's lives will be miserable, and congress will not be effective at improving governance, we’ll just believe this because the ML systems will be improving our metrics, and will have a hard time understanding what’s going on outside of what they report.

Gradually, our civilization will lose its ability to understand what is happening in the world as our systems and infrastructure shows us success on all of our available metrics (GDP, wealth, crime, health, self-reported happiness, etc) and in the end we will no longer be in control at all. Giving up this new technology would be analogous to living like the Amish today (but more extreme), which the vast majority of people do not even consider.

In this world, the future will be managed by a system we do not understand, designed to show us everything is going well, and the system will be integral to all parts of society. This is not a problem that we do not already face (the economy and the stock market are already quite confusing), but it is a problem that can be massively exacerbated by machine learning.

This gradually moves toward a world where civilization is run using tools that seem positive, but whose effects we don’t really understand, with no way out of this state of affairs.

Failure by enemy action

Machine Learning systems not only run computation we don’t understand, but entire agents too. The focus in this failure mode is on agents that seek to gain influence in the world, agents that we are not fully aware we have built.

As above, we will gradually use machine learning to perform more and more key functions in society, and will largely not understand how these systems work or what algorithms they’re running.

Despite this lack of understanding, we will design very competent systems with different goals to us (with many objective functions) in charge of a lot of the economy and civilization (law, health, governance, industry, etc).

We generally don’t understand a lot of what happens inside ML systems, all we know is that massive numbers of types of cognition and policies are being built and tested as we train such systems. We know that such systems may create various subsystems optimising for different goals (this is analogous to how evolution created humans who do not primarily care about self-replication, but care about many other things like art and adventure), but we do not know of a way to ensure that such deceptive influence-seeking agents are not created. As long as this is true, and as long as we don’t understand how to ensure that they are not instantiated in our ML systems, then they will eventually be built ‘by default’.

Concretely, this looks like us building systems that are very competent and good at being useful – they inform policy, they provide healthcare, serve as military units, give people life advice, and so on, even though we don't understand the internals of these systems. But they will sometimes blatantly fail – an automated corporation may just take the money and run, a law enforcement system may start seizing resources and try to defend itself from being discovered or decommissioned. There are many adversarial forces already present in the world (feral animals, corrupt bureaucrats, companies obsessed with growth, psychopaths, etc), and to this, we will add certain classes of powerful ML systems.

We’ll likely solve small versions of this problem, preventing any medium sized failures. But as automation grows, we will reach a point where we could not recover from a correlated automation failure, and now the incentives on the automated agents will be very different. During a period of heightened vulnerability – a conflict between states, a natural disaster, a serious cyberattack, etc – we will see a cascading series of automation failures, where we suddenly critically rely on systems in off-distribution behaviour and have to allow it substantial influence over the world without leaving us the ability to course correct afterwards.

After the disaster, we will realise that we have a number of powerful influence-seeking systems that are sophisticated enough that we can probably not get rid of them.

This is the classic Bostromian scenario of an AI taking a treacherous turn into explicit power-grabbing behaviour, and will take control over civilization and its trajectory with no way for humanity to retake control. While the increase in capabilities will be gradual, the turn will be fairly sudden and binary, and not have clear signalling in the lead-up.

The key distinction between this and the prior failure mode is that the systems are taking clear adversarial action to take control, as opposed to just trying to do their jobs very well. The AIs will eventually be taking clear power-grabbing action. At a point where the ML systems are in a position of sufficient power and control to stop obeying us, they will directly take power and control out of the hands of human civilization.

This doesn’t depend on a very specific story of how a given AI system is built, just that it is searching a very large space of algorithms that includes adversarial agents.

Further Discussion Summary.

This is not a comprehensive list of AI Risks

The scenarios above are failures from failing to solve the intent alignment problem, the problem of ensuring that AIs we build are “trying to do what their creators want”.

But this is not a comprehensive list of existential risks to which AI may contribute. Two other key problems associated with AI include:

We may build powerful AI weaponry that we use to directly end civilization such as nations corrupting each others’ news sources to the point of collapse, or nations engaging in war using advanced weapons such as drones to the point of extinction.
We may not use the powerful AI weaponry to end civilization directly, but use it to directly instantiate very bad end states for humanity that on reflection are very limited in value. This could look like building a universe of orgasmium, or something else that is confused about ethics and resultantly pretty meaningless.

This was discussed in comments by Wei Dai and Paul Christiano.

Parts of the stereotypical AI movie story may still apply

Not all the parts of the stereotypical movie story are necessarily mistaken. For example it is an open question as to whether failure by robot armies committing genocide is a likely outcome of the two failures modes discussed above.

Failure from loss of control. It is unclear whether or not the failures from loss of control involve mass military action – whether this is one of the simple metrics that we can optimise for whilst still losing control, or whether our simple metrics will actually be optimised around e.g. if the simple metric is “our security cameras show no death or murder” whether this means that we succeed on this basic metric yet still lose, or whether it means something the AI fabricating a fake feed into the video camera and then doing whatever it wants with the people in real life.

Failure from enemy action. Advanced AI may be used to help run legal and military systems, which could be used to commit genocide by an adversarial actor. There are many ways to take over the world that don’t involve straightforward mass murder, so it is not a necessary scenario for the initial point-of-failure, but either way a treacherous turn likly results in the swift end of humanity soon after, and this is a plausible mechanism.

This was discussed in comments by Carl Shulman and Paul Christiano.

Multipolar outcomes are possible

In a failure by enemy action, the precise details of the interactions between different AI systems running society are unclear. They may attempt to cooperate or enter into conflict, potentially leading to a multipolar outcome, and this can change the dynamics by which power-grabbing occurs.

This was discussed in comments by Richard Ngo and Wei Dai.

These failure modes are very general and may already be occurring

The two scenarios above do not depend on a detailed story of how AI systems work, and are problems that apply generally in the world. It is an interesting and open question to what extent civilization is currently gradually collapsing due to the problems stated above.

This was discussed in a comment by Zvi Mowshowitz.

A faster takeoff may emphasise different failures

This is specifically a description of failure in a world where there is a slow takeoff, whereby AI capabilities rise gradually – more specifically, where there is not a “one year doubling of the world economy before there's been a four year doubling”. Faster takeoff may see other failures.

This was discussed in comments by Buck Shlegeris and Ben Pace.

We can get more evidence and data from history

Further evidence about these scenarios can come from analysing how previous technological progress has affected these variables. How much has loss of control been a problem for tech historically? How much has ML faced such problems so far? These would help gather the evidence together on these topics to figure out the likelihood of these scenarios.

This was discussed in a post by Grue_Slinky.

Discussion not yet summarised.

If people comment with summaries that seem good to me, I will add them to the post. I also may accept edits to the above, as I didn't spend as much time as I'd like on it, so there will be improvements to be made (though I don't promise to spend a lot of time negotiating over edits).

Hanson's response
Response to Hanson by Alexis Carlier and Tom Davidson

The AI systems in part I of the story are NOT "narrow" or "non-agentic"

There's no difference between the level of "narrowness" or "agency" of the AI systems between parts I and II of the story.
- Many people (including Richard Ngo and myself) seem to have interpreted part I as arguing that there could be an AI takeover by AI systems that are non-agentic and/or narrow (i.e. are not agentic AGI). But this is not at all what Paul intended to argue.
- Put another way, both parts I and II are instances of the "second species" concern/gorilla problem: that AI systems will gain control of humanity's future. (I think this is also identical to what people mean when they say "AI takeover".)
- As far as I can tell, this isn't really a different kind of concern from the classic Bostrom-Yudkowsky case for AI x-risk. It's just a more nuanced picture of what goes wrong, that also makes failure look plausible in slow takeoff worlds.
Instead, the key difference between parts I and II of the story is the way that the models' objectives generalise.
- In part II, it's the kind of generalisation typically called a "treacherous turn". The models learn the objective of "seeking influence". Early in training, the best way to do that is by "playing nice". The failure mode is that, once they become sufficiently capable, they no longer need to play nice and instead take control of humanity's future.
- In part I, it's a different kind of generalisation, which has been much less discussed. The models learn some easily-measurable objective which isn't what humans actually want. In other words, the failure mode is that these models are trying to "produce high scores" instead of "help humans get what they want". You might think that using human feedback to specify the base objective will alleviate this problem (e.g. use learn a reward model from human demonstrations or preferences about a hard-to-measure objective). But this doesn't obviously help: now, the failure mode is that the model learns the objective "do things that look to humans like you are achieving X" or "do things that the humans giving feedback about X will rate highly" (instead of "actually achieving X").
- Notice that in both of these scenarios, the models are mesa-optimizers (i.e. the learned models are themselves optimizers), and failure ensues because the models' learned objectives generalise in the wrong way.

This was discussed in comments (on a separate post) by Richard Ngo and Paul Christiano. There's a lot more important discussion in that comment thread, which is summarised in this doc.

Failure by enemy action.

This makes it sound like it's describing misuse risk, when really it's about accident risk.

I guess it's an odd boundary. Insofar as it's an accident, the accident is "we created agents that had different goals and stole all of our resources". In the world Paul describes, there'll be lots of powerful agents around, and we'll be cooperating and working with them (and plausibly talking with them via GPT-style tech), and at some point the agents we've been cooperating with will have lots of power and start defecting on us.

I called it that because in Paul's post, he gave examples of agents that can do damage to us – "organisms, corrupt bureaucrats, companies obsessed with growth" – and then argued that ML systems will be added to that list, things like "an automated corporation may just take the money and run". This is a world where we have built other agents who work for us and help us, and then they suddenly take adversarial action (or alternatively the adversarial action happens gradually and off-screen and we only notice when it's too late). The agency feels like a core part of it.

So I feel like accident and adversarial action are sort of the same thing in this case.

I mean, I agree that the scenario is about adversarial action, but it's not adversarial action by enemy humans—or even enemy AIs—it's adversarial action by misaligned (specifically deceptive) mesa-optimizers pursuing convergent instrumental goals.

Can you say more about the distinction between enemy AIs and misaligned mesa-optmizers? I feel like I don't have a concrete grasp of what the difference would look like in, say, an AI system in charge of a company.

I could imagine “enemy action” making sense as a label if the thing you're worried about is enemy humans deploying misaligned AI, but that's very much not what Paul is worried about in the original post. Rather, Paul is concerned about us accidentally training AIs which are misaligned and thus pursue convergent instrumental goals like resource and power acquisition that result in existential risk.

Furthermore, they're also not “enemy AIs” in the sense that “the AI doesn't hate you”—it's just misaligned and you're in its way—and so even if you specify something like “enemy AI action” that still seems to me to conjure up a pretty inaccurate picture. I think something like “influence-seeking AIs”—which is precisely the term that Paul uses in the original post—is much more accurate.

I thought about it a bit more and changed my mind, it’s very confusing. I’ll make an edit later, maybe today.

I think I understand why you think the term is misleading, though I still think it's helpfully concrete and not inaccurate. I have a bunch of work to get back to, not planning to follow up on this more right now. Welcome to ping me via PM if you'd like me to follow up another day.

Like others, I find "failure by enemy action" more akin to a malicious human actor injecting some extra misalignment into an AI whose inner workings are already poorly understood. Which is a failure mode worth worrying about, as well.

Thanks for this post! I think the initiative is great, and I'm glad to be able to read a summary of the discussion.

Two points, the first more serious and important than the second:

Concerning your summary, I think that having only the Eleven Paragraph Summary (and maybe the one paragraph, if you really want a short one), is good enough. Notably, I feel like you end throwing too many important details in the three paragraphs summary. And nine paragraphs is short enough than anyone can read it.
Imagine that I want to respond to a specifc point that was discussed. Should I do that here or in the comments of the original post? The first option might make my comment easier to see, but it will split the discussion. (Also, it might cause an infinite recursion of distilling the discussions of the distillation of the discussion of the distillation of ... the discussion.)

Good point on the first part, seems right, I'll cut it to be that.

Added: I've now re-writtten it.

I don't know re: the second. Ideally this post would become more like a publicly editable wiki. I'd kind of like more discussion under this post, but don't have a strong feeling.

Giving up this new technology would be analogous to living like a quaker today

Perhaps you meant "Amish" or "Mennonite" rather than "quaker"?

That's right. Changed to Amish.

The AI systems in part I of the story are NOT "narrow" or "non-agentic"

There's no difference between the level of "narrowness" or "agency" of the AI systems between parts I and II of the story.
- Many people (including Richard Ngo and myself) seem to have interpreted part I as arguing that there could be an AI takeover by AI systems that are non-agentic and/or narrow (i.e. are not agentic AGI). But this is not at all what Paul intended to argue.
- Put another way, both parts I and II are instances of the "second species" concern/gorilla problem: that AI systems will gain control of humanity's future. (I think this is also identical to what people mean when they say "AI takeover".)
- As far as I can tell, this isn't really a different kind of concern from the classic Bostrom-Yudkowsky case for AI x-risk. It's just a more nuanced picture of what goes wrong, that also makes failure look plausible in slow takeoff worlds.
Instead, the key difference between parts I and II of the story is the way that the models' objectives generalise.
- In part II, it's the kind of generalisation typically called a "treacherous turn". The models learn the objective of "seeking influence". Early in training, the best way to do that is by "playing nice". The failure mode is that, once they become sufficiently capable, they no longer need to play nice and instead take control of humanity's future.
- In part I, it's a different kind of generalisation, which has been much less discussed. The models learn some easily-measurable objective which isn't what humans actually want. In other words, the failure mode is that these models are trying to "produce high scores" instead of "help humans get what they want". You might think that using human feedback to specify the base objective will alleviate this problem (e.g. use learn a reward model from human demonstrations or preferences about a hard-to-measure objective). But this doesn't obviously help: now, the failure mode is that the model learns the objective "do things that look to humans like you are achieving X" or "do things that the humans giving feedback about X will rate highly" (instead of "actually achieving X").
- Notice that in both of these scenarios, the models are mesa-optimizers (i.e. the learned models are themselves optimizers), and failure ensues because the models' learned objectives generalise in the wrong way.

This was discussed in comments (on a separate post) by Richard Ngo and Paul Christiano. There's a lot more important discussion in that comment thread, which is summarised in this doc.

Failure by enemy action.

This makes it sound like it's describing misuse risk, when really it's about accident risk.

So I feel like accident and adversarial action are sort of the same thing in this case.

I thought about it a bit more and changed my mind, it’s very confusing. I’ll make an edit later, maybe today.

Thanks for this post! I think the initiative is great, and I'm glad to be able to read a summary of the discussion.

Two points, the first more serious and important than the second:

Concerning your summary, I think that having only the Eleven Paragraph Summary (and maybe the one paragraph, if you really want a short one), is good enough. Notably, I feel like you end throwing too many important details in the three paragraphs summary. And nine paragraphs is short enough than anyone can read it.
Imagine that I want to respond to a specifc point that was discussed. Should I do that here or in the comments of the original post? The first option might make my comment easier to see, but it will split the discussion. (Also, it might cause an infinite recursion of distilling the discussions of the distillation of the discussion of the distillation of ... the discussion.)

Good point on the first part, seems right, I'll cut it to be that.

Added: I've now re-writtten it.

I don't know re: the second. Ideally this post would become more like a publicly editable wiki. I'd kind of like more discussion under this post, but don't have a strong feeling.

Giving up this new technology would be analogous to living like a quaker today

Perhaps you meant "Amish" or "Mennonite" rather than "quaker"?

That's right. Changed to Amish.

84

What Failure Looks Like: Distilling the Discussion

84

Ω 30

What Failure Looks Like (Summary)

Further Discussion Summary.

Discussion not yet summarised.

84

Ω 30

84

Ω 30