Another (outer) alignment failure story

paulfchristiano

LESSWRONG
LW

Another (outer) alignment failure story — LessWrong

Best of LessWrong 2021

254 Another (outer) alignment failure story

by paulfchristiano

7th Apr 2021

AI Alignment Forum

14 min read

254 Ω 76

Review by

1a3orn

Story

ML starts running factories, warehouses, shipping, and construction. ML assistants help write code and integrate ML into new domains. ML designers help build factories and the robots that go in them. ML finance systems invest in companies on the basis of complicated forecasts and (ML-generated) audits. Tons of new factories, warehouses, power plants, trucks and roads are being built. Things are happening quickly, investors have super strong FOMO, no one really knows whether it’s a bubble but they can tell that e.g. huge solar farms are getting built and something is happening that they want a piece of. Defense contractors are using ML systems to design new drones, and ML is helping the DoD decide what to buy and how to deploy it. The expectation is that automated systems will manage drones during high-speed ML-on-ML conflicts because humans won’t be able to understand what’s going on. ML systems are designing new ML systems, testing variations, commissioning giant clusters. The financing is coming from automated systems, the clusters are built by robots. A new generation of fabs is being built with unprecedented speed using new automation.

At this point everything kind of makes sense to humans. It feels like we are living at the most exciting time in history. People are making tons of money. The US defense establishment is scared because it has no idea what a war is going to look like right now, but in terms of policy their top priority is making sure the boom proceeds as quickly in the US as it does in China because it now seems plausible that being even a few years behind would result in national irrelevance.

Things are moving very quickly and getting increasingly hard for humans to evaluate. We can no longer train systems to make factory designs that look good to humans, because we don’t actually understand exactly what robots are doing in those factories or why; we can’t evaluate the tradeoffs between quality and robustness and cost that are being made; we can't really understand the constraints on a proposed robot design or why one design is better than another. We can’t evaluate arguments about investments very well because they come down to claims about where the overall economy is going over the next 6 months that seem kind of alien (even the more recognizable claims are just kind of incomprehensible predictions about e.g. how the price of electricity will change). We can’t really understand what is going to happen in a war when we are trying to shoot down billions of drones and disrupting each other’s communication. We can’t understand what would happen in a protracted war where combatants may try to disrupt their opponent’s industrial base.

So we’ve started to get into the world where humans just evaluate these things by results. We know that Amazon pays off its shareholders. We know that in our elaborate war games the US cities are safe. We know that the widgets that come out the end are going to be popular with consumers. We can tell that our investment advisors make the numbers in our accounts go up.

On the way there we’ve had some stumbles. For example, my financial advisor bought me into a crazy ponzi scheme and when I went to get the money out I couldn’t---financial regulators eventually shut down the fund but people with bad AI advisors still lost a lot. My factory colluded with the auditors who were valuing its output, resulting in a great Q4 report that didn’t actually correspond to any revenue. In a war game our drones let the opponents take the city as long as they could corrupt the communications out of the city to make it look like everything was great.

It’s not hard to fix these problems. I don’t just train my financial advisors to get more money in my bank account---if I eventually discover the whole thing is a fraud, then that’s a big negative reward (and we have enough data about fraud for models to understand the idea and plan to take actions that won’t be eventually recognized as fraud). If an audit is later corrected, then we use the corrected figures (and apply a big penalty). I don’t just rely on communications out of the city to see if things are OK, I use satellites and other indicators. Models learn to correctly treat the early indicators as just a useful signal about the real goal, which includes making sure that nothing looks fishy next year.

To improve safety we make these measures more and more robust. We audit the auditors. We ensure that ML systems are predicting the results of tons of sensors so that if anything is remotely fishy we would notice. If someone threatens an auditor, we’ll see it on the cameras in their office or our recordings of their email traffic. If someone tries to corrupt a communication link to a camera we have a hundred other cameras that can see it.

As we build out these mechanisms the world keeps on getting more complicated. The automated factories are mostly making components for automated factories. Automated R&D is producing designs for machines that humans mostly don’t understand, based on calculations that we can only verify experimentally---academic fields have pivoted to understanding machines designed by AI and what it says about the future of industry, rather than contributing in any meaningful way. Most people don’t have any understanding of what they are invested in or why. New industrial centers are growing in previously sparsely populated areas of the world, and most of it is feeding new construction that is several degrees removed from any real human use or understanding. Human CEOs are basically in charge of deciding how to delegate to ML, and they can talk as if they understand what’s going on only because they get their talking points from ML assistants. In some domains regulations are static and people work around them, in others corruption is endemic, in others regulators adopt new policies pushed by ML-enhanced lobbyists. Our automated army is now incomprehensible even to the humans in charge of it, procured by automated procurement systems and built by fully-automated defense contractors.

For many people this is a very scary situation. It’s like we are on a train that’s now moving too fast to jump off, but which is accelerating noticeably every month. We still understand well enough that we could shut the whole thing down, scrap the new factories or at least let them sit dormant while experts figure out what is actually going on. But that could not be done unilaterally without resigning yourself to military irrelevance---indeed, you have ML systems that are able to show you good forecasts for what would happen if you stopped the machine from spinning without also getting the Chinese to do the same. And although people are scared, we are also building huge numbers of new beautiful homes, and using great products, and for the first time in a while it feels like our society is actually transforming in a positive direction for everyone. Even in 2020 most people have already gotten numb to not understanding most of what’s happening in the world. And it really isn’t that clear what the harm is as long as things are on track.

We know what happens when you deploy a sloppily-trained ML system---it will immediately sell you out in order to get a good training reward. This isn’t done at all anymore because why would you? But people still remember that and it makes them scared, especially people in the defense establishment and AI safety community because we still haven’t really seen what would happen in a hot war and we know that it would happen extremely quickly.

Most people stay well clear of most of the new automated economy. That said, there are still drones everywhere they are legally allowed to be. At some point we reach a threshold where drones can do bad stuff and it’s very hard to attribute it to any legal person, so it becomes obligatory for every city to have automated local defenses. If they don’t, or if they do a sloppy job of it, drones descend to steal and kidnap and extort. This is a terrifying situation. (It never gets that terrifying, because before that point we’re motivated to try really hard to fix the problem.)

We do ultimately find our way out of that situation, with regulations that make it easier to attribute attacks. Humans don’t really understand how those regulations or the associated surveillance works. All they know is that there are a ton of additional cameras, and complicated book-keeping, and as a result if a drone flies into your city to mess stuff up someone is going to be on the hook for the damage it causes. And we know that we end up being pretty safe. In effect the harm caused by such drones has been propagated back into the reward function for every AI in the world, using mechanisms built and maintained by other AIs---if you mess with people, you are going to be held accountable and so you avoid actions that predictably lead to that consequence.

This regulatory regime persists and is constantly upgraded. It becomes ever-more-incomprehensible, and rests on complex relationships between autonomous corporations and automated regulators and automated law enforcement, new forms of bookkeeping and surveillance and complete inventories of everyone who could build a drone that can kill someone. None of this significantly reduces US competitiveness (because when considering a proposal we can tell if it would reduce competitiveness, and as long as we can get what we want without sacrificing competitiveness then we strongly prefer that).

There are treaties amongst states to prevent some of the worst offenses. Again, we can tell the treaties at least kind of work because we can tell that no one is dying. Again, we can’t tell what the treaties really do. Academic fields discuss them and study them and sometimes make proposals to slightly improve them, but it’s with the same spirit that academics today study a complex biological system which they have little hope of understanding.

The world continues to change faster and faster. The systems that protect us become increasingly incomprehensible to us, outpacing our attempts to understand. People are better educated and better trained, they are healthier and happier in every way they can measure. They have incredibly powerful ML tutors telling them about what’s happening in the world and helping them understand. But all of these things move glacially as far as the outside automated world is concerned.

Now we are resigned to being on this train and seeing where it goes, and for the most part people are happy (if we could have predicted that they would have been unhappy they would have taken a different route). There was a time when we were trying to design better and better systems to monitor for problems, but now that work is itself incomprehensibly sophisticated and out of our hands.

Some people still complain about the situation, and it still is objectively quite scary. We’ve built this incredible edifice designed to make sure that there are never any reports of trouble. As we built it we understood what was happening out there in the real world and there wasn’t much risk that something bad would happen without causing a report. And we had bigger things to worry about.

But now everything is totally out of our hands and we have no idea whether our ML systems would actually be trying to avoid or correctly predict a systematic failure. The communications infrastructure that records data was built by machines, and most of it is new, and it might be corrupted in a systemic failure. The sensors were built and deployed by machines. The forces on the other side who could try to attack us are now invisible and broadly distributed and incredibly sophisticated. The systems of accountability that would prevent someone from building up a paramilitary are themselves maintained by AI systems whose only real goal was to make sure that no problem ever got reported. We cannot tell if a factory was building widgets or weapons except by asking another ML system, and that system’s judgments are by now ultimately optimized to ensure that nothing ever causes any trouble.

For a time there were watchdogs that could explain to us why we should be scared, why a particular charge was part of a brewing storm that could actually cause trouble someday. And for a time that leads to real change to prevent trouble. But eventually we can’t tell real scary stories from bogus scary stories. We still have watchdogs that we train to tell us what’s scary, but they can always scare us and we’ve long-since become numb to the warnings. there were always bogus scary stories, if you train models to look for them, and it’s just a gradual transition to all the stories being meaningless. When we investigate a claimed problem, sometimes we do so with ML auditors who tell us there’s no problem, and sometimes we use ML auditors trained to be more skeptical who always tell us that there is a problem they just can’t demonstrate in a way we’d understand. When we go to the factory and take it apart we find huge volumes of incomprehensible robots and components. We can follow a piece of machinery along the supply chain but we can’t tell what it’s for.

If this led to a visible catastrophe that would show up on camera, then that would have showed up in the forecasts and we would have avoided it. So we’re able to stop machines that try to grab power and use it to cause a visible problem. In the scenario I’m describing we’ve done our job so well (and the machines we’ve delegated to have carried on the work so well) that there is basically no chance of that.

But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports. Cybersecurity vulnerabilities are inserted into sensors. Communications systems are disrupted. Machines physically destroy sensors, moving so quickly they can’t be easily detected. Datacenters are seized, and the datasets used for training are replaced with images of optimal news forever. Humans who would try to intervene are stopped or killed. From the perspective of the machines everything is now perfect and from the perspective of humans we are either dead or totally disempowered.

By the time this catastrophe happened it doesn’t really feel surprising to experts who think about it. It’s not like there was a sudden event that we could have avoided if only we’d known. We didn’t have any method to build better sensors. We could try to leverage the sensors we already have; we can use them to build new sensors or to design protections, but ultimately all of them must optimize some metric we can already measure. The only way we actually make the sensors better is by recognizing new good ideas for how to expand our reach, actually anticipating problems by thinking about them (or recognizing real scary stories and distinguishing them from fake stories). And that’s always been kind of slow, and by the end it’s obvious that it’s just hopelessly slow compared to what’s happening in the automated world.

Ways you could vary the story

In this story all the models generalize “correctly” in the sense that they keep trying to ensure that things look good according to some kind of augmented human judgment. There are other possible generalizations: (i) “get a low loss” where the model identifies with some copy of itself (a generalization of the copy receiving training data) and cares about the data going into that model. In this scenario an AI doesn’t have to seize the sensors all at once, they just have to seize their own physical realizations, (ii) treacherous behavior where the AI has some arbitrary other goal but wants to do well in training in order to get power to use later. In this scenario when it fails the AI doesn’t have to corrupt sensors and just has to make sure that people can’t fix the problem (i.e. can't reclaim the resources the AI controls). I think these changes lead to kind of similar stories to the one in this post, though usually failure will occur at a somewhat earlier stage and look a bit different. I think they offer more “outs” on the technical side but make it much harder to fix problems by e.g. building out more sensors or hardening them. I think that some folks (e.g. at MIRI) would consider this particular failure super implausible for that reason. I'm telling the story this way due to some combination of (i) I care about outer alignment in particular, (ii) I think the failure modes in this story are an important input into treachery and so it's interesting to tell a simpler story without more moving parts.
I’m making a lot of assumptions about how AI works (roughly that it looks like the “unaligned benchmark” here) and it could easily work totally differently. I’m also assuming that ML works well enough and generalizes to long horizons well enough that it's obligatory if you want to remain competitive, while also being risky (since a model can learn instrumental plans on long horizons). I do think lots of variants will leave the basic story intact, e.g. it doesn’t really matter that much how much your systems leverage planning or deduction (they could even involve almost no learning and still run into similar problems).
It seems like the story changes a lot based on how fast progress is in the outside world (is it like 3 years from a kind-of-weird world to the singularity, or 30 years, or 3 months?), which in turn depends on both what’s technically possible and on how the regulatory environment works out (e.g. does competition between the US and china lead to very fast adoption; can we actually end up in the world where crazy factories are being built in the middle of nowhere that people only pretend to understand?). My guess is in the 3-30 year ballpark depending in large part on where you draw the line for “kind of weird,” and this story is kind of centered on the 3 year world which feels a bit fast to me. I think the story would be much scarier if you have a much faster takeoff, and significantly less scary if you have a much slower takeoff (mostly since future people would have time to solve these problems).
I’m a bit skeptical that our society would be even this competent and unified. It feels like there’s a likely family of stories where everything is just a complete mess much earlier, with people yelling at each other and AI just committing fraud and stealing from people all over the place, and the machinery for correcting that situation totally breaks down as your civilization collapses. It seems worth fleshing out what that looks like as well, but it’s definitely not what I’m doing here.
In this situation a huge amount of work would be going into alignment, including by powerful ML assistants. I haven’t talked about that at all, and indeed I think there’s a reasonable chance that alignment just isn’t very tractable so that things really could go down this way. But a lot depends on exactly how alignment work goes down during this tumultuous period, how well people are able to use ML to help with alignment, how well-organized the community is and how able it is to recognize and implement good ideas, etc. I haven’t chosen this story to be one where alignment work is particularly valuable in advance, I think that may only happen if takeoff is much faster or if the response at the time is much worse.

Threat Models (AI)Outer AlignmentAI RiskAI

Curated

254 Ω 76

New Comment

39 comments, sorted by

top scoring

Click to highlight new comments since: Today at 8:42 AM

[-]Raemon5y*Ω9530

There's a lot of intellectual meat in this story that's interesting. But, my first comment was: "I'm finding myself surprisingly impressed about some aesthetic/stylistic choices here, which I'm surprised I haven't seen before in AI Takeoff Fiction."

In normal english phrasing across multiple paragraphs, there's a sort of rise-and-fall of tension. You establish a minor conflict, confusion, or an open loop of curiosity, and then something happens that resolves it a bit. This isn't just about the content of 'what happens', but also what sort of phrasing one uses. In verbal audio storytelling, this often is accompanied with the pitch of your voice rising and falling.

And this story... even moreso than Accelerando or other similar works, somehow gave me this consistent metaphorical vibe of "rising pitch". Like, some club music where it keeps sounding like the bass is about to drop, but instead it just keeps rising and rising. Something about most of the paragraph structures feel like they're supposed to be the first half of a two-paragraph-long-clause, and then instead... another first half of a clause happens, and another.

And this was incredibly appropriate for what the story was trying to do. I dunno how intentional any of that was but I quite appreciated it, and am kinda in awe and boggled and what precisely created the effect – I don't think I'd be able to do it on purpose myself without a lot of study and thought.

[-]Daniel Kokotajlo5yΩ14340

Thanks for this, this is awesome! I'm hopeful in the next few years for there to be a collection of stories like this.

This is a story where the alignment problem is somewhat harder than I expect, society handles AI more competently than I expect, and the outcome is worse than I expect. It also involves inner alignment turning out to be a surprisingly small problem. Maybe the story is 10-20th percentile on each of those axes.

I'm a bit surprised that the outcome is worse than you expect, considering that this scenario is "easy mode" for societal competence and inner alignment, which seem to me to be very important parts of the overall problem. Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?

Some other threads to pull on:

--In this story, there aren't any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it's more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven't had major war for seventy years, and maybe that's because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff? IDK, I worry that the reasons why we haven't had war for seventy years may be largely luck / observer selection effects, and also separately even if that's wrong, I worry that the reasons won't persist through takeoff (e.g. some factions may develop ways to shoot down ICBMs, or prevent their launch in the first place, or may not care so much if there is nuclear winter)

--Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on "under the hood" so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future? Why aren't they fighting each other as well as the humans? Or maybe they do fight each other but you didn't focus on that aspect of the story because it's less relevant to us?

--Yeah, society will very likely not be that competent IMO. I think that's the biggest implausibility of this story so far.

--(Perhaps relatedly) I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren't even as superficially aligned as the unaligned benchmark. They won't even be trying to make things look good according to human judgment, much less augmented human judgment! For example, some AI scientists today seem to think that all we need to do is make our AI curious and then everything will work out fine. Others seem to think that it's right and proper for humans to be killed and replaced by machines. Others will try strategies even more naive than the unaligned benchmark, such as putting their AI through some "ethics training" dataset, or warning their AI "If you try anything I'll unplug you." (I'm optimistic that these particular failure modes will have been mostly prevented via awareness-raising before takeoff, but I do a pessimistic meta-induction and infer there will be other failure modes that are not prevented in time.)

--Can you say more about how "the failure modes in this story are an important input into treachery?"

[-]paulfchristiano5yΩ9120

I'm a bit surprised that the outcome is worse than you expect, considering that this scenario is "easy mode" for societal competence and inner alignment, which seem to me to be very important parts of the overall problem.

The main way it's worse than I expect is that I expect future people to have a long (subjective) time to solve these problems and to make much more progress than they do in this story.

Am I right to infer that you think outer alignment is the bulk of the alignment problem, more difficult than inner alignment and societal competence?

I don't think it's right to infer much about my stance on inner vs outer alignment. I don't know if it makes sense to split out "social competence" in this way.

In this story, there aren't any major actual wars, just simulated wars / war games. Right? Why is that? I look at the historical base rate of wars, and my intuitive model adds to that by saying that during times of rapid technological change it's more likely that various factions will get various advantages (or even just think they have advantages) that make them want to try something risky. OTOH we haven't had major war for seventy years, and maybe that's because of nukes + other factors, and maybe nukes + other factors will still persist through the period of takeoff?

The lack of a hot war in this story is mostly from the recent trend. There may be a hot war prior to things heating up, and then the "takeoff" part of the story is subjectively shorter than the last 70 years.

IDK, I worry that the reasons why we haven't had war for seventy years may be largely luck / observer selection effects, and also separately even if that's wrong

I'm extremely skeptical of an appeal to observer selection effects changing the bottom line about what we should infer from the last 70 years. Luck sounds fine though.

Relatedly, in this story the AIs seem to be mostly on the same team? What do you think is going on "under the hood" so to speak: Have they all coordinated (perhaps without even causally communicating) to cut the humans out of control of the future?

I don't think the AI systems are all on the same team. That said, to the extent that there are "humans are deluded" outcomes that are generally preferable according to many AI's values, I think the AIs will tend to bring about such outcomes. I don't have a strong view on whether that involves explicit coordination. I do think the range for every-wins outcomes (amongst AIs) is larger because of the "AI's generalize 'correctly'" assumption, so this story probably feels a bit more like "us vs them" than a story that relaxed that assumption.

Why aren't they fighting each other as well as the humans? Or maybe they do fight each other but you didn't focus on that aspect of the story because it's less relevant to us?

I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald's and Burger King's marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?

I feel like when takeoff is that distributed, there will be at least some people/factions who create agenty AI systems that aren't even as superficially aligned as the unaligned benchmark. They won't even be trying to make things look good according to human judgment, much less augmented human judgment!

I'm imagining that's the case in this story.

Failure is early enough in this story that e.g. the human's investment in sensor networks and rare expensive audits isn't slowing them down very much compared to the "rogue" AI.

Such "rogue" AI could provide a competitive pressure, but I think it's a minority of the competitive pressure overall (and at any rate it has the same role/effect as the other competitive pressure described in this story).

Can you say more about how "the failure modes in this story are an important input into treachery?"

We will be deploying many systems to anticipate/prevent treachery. If we could stay "in the loop" in the sense that would be needed to survive this outer alignment story, then I think we would also be "in the loop" in roughly the sense needed to avoid treachery. (Though it's not obvious in light of the possibility of civilization-wide cascading ML failures, and does depend on further technical questions about techniques for avoiding that kind of catastrophe.)

[-]CarlShulman5y*Ω690

I think they are fighting each other all the time, though mostly in very prosaic ways (e.g. McDonald's and Burger King's marketing AIs are directly competing for customers). Are there some particular conflicts you imagine that are suppressed in the story?

I think the one that stands out the most is 'why isn't it possible for some security/inspector AIs to get a ton of marginal reward by whistleblowing against the efforts required for a flawless global camera grab?' I understand the scenario says it isn't because the demonstrations are incomprehensible, but why/how?

[-]paulfchristiano5yΩ770

I understand the scenario say it isn't because the demonstrations are incomprehensible

Yes, if demonstrations are comprehensible then I don't think you need much explicit AI conflict to whistleblow since we will train some systems to explain risks to us.

why/how?

The global camera grab must involve plans that aren't clearly bad to humans even when all the potential gotchas are pointed out. For example they may involve dynamics that humans just don't understand, or where a brute force simulation or experiment would be prohibitively expensive without leaps of intuition that machines can make but humans cannot. Maybe that's about tiny machines behaving in complicated ways or being created covertly, or crazy complicated dynamics of interacting computer systems that humans can't figure out. It might involve the construction of new AI-designed AI systems which operate in different ways whose function we can't really constrain except by seeing predictions of their behavior from an even-greater distance (machines which are predicted to lead to good-looking outcomes, which have been able to exhibit failures to us if so-incentivized, but which are even harder to control).

(There is obviously a lot you could say about all the tools at the human's disposal to circumvent this kind of problem.)

This is one of the big ways in which the story is more pessimistic than my default, and perhaps the highlighted assumptions rule out the most plausible failures, especially (i) multi-year takeoff, (ii) reasonable competence on behalf of the civilization, (iii) "correct" generalization.

Even under those assumptions I do expect events to eventually become incomprehensible in the necessary ways, but it feels more likely that there will be enough intervening time for ML systems to e.g. solve alignment or help us shift to a new world order or whatever. (And as I mention, in the worlds where the ML systems can't solve alignment well enough in the intervening time, I do agree that it's unlikely we can solve it in advance.)

[-]Wei Dai5yΩ11200

The ending of the story feels implausible to me, because there's a lack of explanation of why the story doesn't side-track onto some other seemingly more likely failure mode first. (Now that I've re-read the last part of your post, it seems like you've had similar thoughts already, but I'll write mine down anyway. Also it occurs to me that perhaps I'm not the target audience of the story.) For example:

In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persuasion/propaganda at each other? (Maybe this is what you meant by "people yelling at each other"?)
Why don't AI safety researchers try to leverage AI to improve AI alignment, for example implementing DEBATE and using that to further improve alignment, or just an adhoc informal version where you ask various AI advisors to come up with improved alignment schemes and to critique/defend each others' ideas? (My expectation is that we end up with one or multiple sequences of "improved" alignment schemes that eventually lock in wrong solutions to some philosophical or metaphilosophical problems, or has some other problem that is much subtler than the kind of outer alignment failure described here.)

[-]paulfchristiano5yΩ10210

In this story, what is preventing humans from going collectively insane due to nations, political factions, or even individuals blasting AI-powered persuasion/propaganda at each other? (Maybe this is what you meant by "people yelling at each other"?)

It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)

Why don't AI safety researchers try to leverage AI to improve AI alignment, for example implementing DEBATE and using that to further improve alignment, or just an adhoc informal version where you ask various AI advisors to come up with improved alignment schemes and to critique/defend each others' ideas?

I think they do, but it's not clear whether any of them change the main dynamic described in the post.

(My expectation is that we end up with one or multiple sequences of "improved" alignment schemes that eventually lock in wrong solutions to some philosophical or metaphilosophical problems, or has some other problem that is much subtler than the kind of outer alignment failure described here.)

I'd like to have a human society that is free to grow up in a way that looks good to humans, and which retains enough control to do whatever they decide is right down the line (while remaining safe and gradually expanding the resources available to them for continued growth). When push comes to shove I expect most people to strongly prefer that kind of hope (vs one that builds a kind of AI that will reach the right conclusions about everything), not on the basis of sophisticated explicit reasoning but because that's the only path that can really grow out of the current trajectory in a way that's not super locally super objectionable to lots of people, and so I'm focusing on people's attempts and failures to construct such an AI.

I don't know exactly what kind of failure you are imagining is locked in, that pre-empts or avoids the kind of failure described here. Maybe you think it doesn't pre-empt this failure, but that you expect we probably can solve the immediate problem described in this post and then get screwed by a different problem down the line. If so, then I think I agree that this story is a little bit on the pessimistic side w.r.t. the immediate problem although I may disagree about how pessimistic about it is. (Though there's still a potentially-larger disagreement about just how bad the situation is after solving that immediate problem.)

(You might leave great value on the table from e.g. not bargaining with the simulators early enough and so getting shut off, or not bargaining with each other before you learn facts that make them impossible and so permanently leaving value on the table, but this is not a story about that kind of failure and indeed those happen in parallel with the failure in this story.)

[-]Wei Dai5yΩ230

(Apologies for the late reply. I've been generally distracted by trying to take advantage of perhaps fleeting opportunities in the equities markets, and occasionally by my own mistakes while trying to do that.)

It seems like the AI described in this story is still aligned enough to defend against AI-powered persuasion (i.e. by the time that AI is sophisticated enough to cause that kind of trouble, most people are not ever coming into contact with adversarial content)

How are people going to avoid contact with adversarial content, aside from "go into an info bubble with trusted AIs and humans and block off any communications from the outside"? (If that is happening a lot, it seems worthwhile say so explicitly in the story since that might be surprising/unexpected to a lot of readers?)

I think they do, but it’s not clear whether any of them change the main dynamic described in the post.

Ok, in that case I think it would be useful to say a few words in the OP about why in this story, they don't have the desired effect, like, what happened when the safety researchers tried this?

I’d like to have a human society that is free to grow up in a way that looks good to humans, and which retains enough control to do whatever they decide is right down the line (while remaining safe and gradually expanding the resources available to them for continued growth). When push comes to shove I expect most people to strongly prefer that kind of hope (vs one that builds a kind of AI that will reach the right conclusions about everything), not on the basis of sophisticated explicit reasoning but because that’s the only path that can really grow out of the current trajectory in a way that’s not super locally super objectionable to lots of people, and so I’m focusing on people’s attempts and failures to construct such an AI.

I can empathize with this motivation, but argue that "a kind of AI that will reach the right conclusions about everything" isn't necessarily incompatible with "humans retain enough control to do whatever they decide is right down the line" since such an AI could allow humans to retain control (and merely act as an assistant/advisor, for example) instead of forcibly imposing its decisions on everyone.

I don’t know exactly what kind of failure you are imagining is locked in, that pre-empts or avoids the kind of failure described here.

For example, all or most humans lose their abilities for doing philosophical reasoning that will eventually converge to philosophical truths, because they go crazy from AI-powered memetic warfare, or come under undue influence of AI advisors who lack such abilities themselves but are extremely convincing. Or humans lock in what they currently think are their values/philosophies in some form (e.g., as utility functions in AI, or asking their AIs to help protect the humans themselves from value drift while unable to effectively differentiate between "drift" and "philosophical progress") to try to protect them from a highly volatile and unpredictable world.

[-]paulfchristiano5yΩ6110

How are people going to avoid contact with adversarial content, aside from "go into an info bubble with trusted AIs and humans and block off any communications from the outside"? (If that is happening a lot, it seems worthwhile say so explicitly in the story since that might be surprising/unexpected to a lot of readers?)

I don't have a short answer or think this kind of question has a short answer. I don't know what an "info bubble" is and the world I'm imagining may fit your definition of that term (but the quoted description makes it sound like I might be disagreeing with some background assumptions) . Here are some of the things that I imagine happening:

I don't pay attention to random messages from strangers. The volume of communication is much larger than our world and so e.g. I'm not going to be able to post an email address and then spend time looking at every message that is sent to that address (we are already roughly at this point today). The value of human attention is naturally much higher in this world.
Attackers (usually) can't force information in front of my face---the situation is basically the same as with bullets.
(I don't know if you would say that modern humans are in a physical bubble. It is the case that in this future, and to a significant extent in the modern world, humans just don't interact with totally unregulated physical systems. Every time I interact with a physical system I want to have some kind of legal liability or deposit for behavior of that system, and some kind of monitoring that gives me confidence that if something bad happens they will in fact be held liable. Today this usually falls to the state that governs the physical space I'm in, or sometimes private security.)
I don't think there is likely to be any absolute notion of trust and I don't currently think it would be necessary to make things OK. The big fork in the road to me is whether you trust humans because they aren't smart enough to implement attacks (e.g. because they have no good AI advisors). That doesn't sound like a good idea to me and it's not what I'm imagining.
Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it. This is itself super complicated. I think the first order consideration is that if a manipulator can tell P(Paul does X | Paul reads message Y) is high for some non-endorsed reason, such that sending message Y would be a good manipulative strategy, then a similarly-smart defender can also tell that probability is high and this gives a reason to be concerned about the message or (adjust for it). The big caveat is that the defender occupies the position of the interior. The caveat to the caveat is that the defender has various structural advantages (some discussed in the next bullet points).
People who want to get me to read messages are normally operating via some combination of deposits and/or legal liability for some kinds of actions (e.g. that I would judge as harmful given reflection). So even if a defender realizes a problem only after the fact, or only with small probability for any given resource, an attacker will still burn resources with the attack.
There may be considerable variation in how careful people are (just as there is considerable variation in how secure computer systems are), though ideally we'll try to make it as easy as possible for people to have good hygiene. People who are most easily manipulated in some sense "drop out" of the future-influencing game during this transitional period (and I expect will often come back in later via e.g. states that continue to care about their welfare).

My current view is that I think this problem may be too hard for civilization to cope with, but it doesn't seem particularly hard in principle. It feels pretty analogous to the cybersecurity situation.

Ok, in that case I think it would be useful to say a few words in the OP about why in this story, they don't have the desired effect, like, what happened when the safety researchers tried this?

The AIs in the story are trained using methods of this kind (or, more likely, better methods that people thought of at the time).

I can empathize with this motivation, but argue that "a kind of AI that will reach the right conclusions about everything" isn't necessarily incompatible with "humans retain enough control to do whatever they decide is right down the line" since such an AI could allow humans to retain control (and merely act as an assistant/advisor, for example) instead of forcibly imposing its decisions on everyone.

I don't think it's incompatible, it's just a path which we seem likely to take and which appears to run a lower risk of locking in such wrong solutions (and which feels quite different from your description even if it goes wrong). For example, in this scenario it still seems like a protected civilization may fail to evolve in the way that it "wants," and that instrumentally such a civilization may need to e.g. make calls about what kind of change feels like undesirable drift. But in this scenario those issues seem pretty decoupled from alignment. (To the extent that the rest of the comment is an argument for a coupling, or for pessimism on this point, I didn't find it compelling.)

[-]Wei Dai5yΩ220

Most of the time when I look at a message, a bunch of automated systems have looked at it first and will inform me about the intended effect of the message in order to respond to appropriately or decide whether to read it.

This seems like the most important part so I'll just focus on this for now. I'm having trouble seeing how this can work. Suppose that I, as an attacker, tell my AI assistant, "interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends" (while implicitly minimizing the chances of these messages/interactions being flagged by your automated systems as adversarial). How would your automation distinguish between me doing this, versus me trying to have a normal human conversation with you about various topics, including what's moral/normative? Or if the automation isn't trying to directly make this judgment, what is it telling you to allow you to make this judgment? Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?

[-]Wei Dai5yΩ4100

Trying to imagine myself how an automated filter might work, here's a possible "solution" I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I've been compromised by an AI-powered attack and is now trying to attack you. (I'm talking in binary terms of "compromised" and "uncompromised" for simplicity but of course it will be more complicated than that in reality.)

Is this close to what you're thinking? (If not, apologies for going off on a tangent.) If so, given that I would "naturally" change my mind over time (i.e., based on my own thinking or talking with other uncompromised humans), it seems that your AI has to model that as well. I can imagine that in such a scenario, if I ever changed my mind in an unexpected (by the AI model) direction and wanted to talk to you about that, my own AI might say something like "If you say this to Paul, his AI will become more suspicious that you've been compromised by an AI-powered attack and your risk of getting blocked now or in the future increases by Y. Are you sure you still want to say this to Paul?" So at this point, collective human philosophical/moral progress would be driven more by what AI filters expect and let pass, than by what physical human brains actually compute, so we better get those models really right, but that faces seemingly difficult problems I mentioned at Replicate the trajectory with ML? and it doesn't seem like anyone is working on such problems.

If we fail to get such models good enough early on, that could lock in failure as it becomes impossible to meaningfully collaborate with other humans (or human-AI systems) to try to improve such models, as you can't distinguish whether they're genuinely trying to make better models with you, or just trying to change your models as part of an attack.

[-]paulfchristiano5yΩ220

Trying to imagine myself how an automated filter might work, here's a possible "solution" I came up with. Perhaps your AI maintains a model / probability distribution of things that an uncompromised Wei might naturally say, and flags anything outside or on the fringes of that distribution as potential evidence that I've been compromised by an AI-powered attack and is now trying to attack you. (I'm talking in binary terms of "compromised" and "uncompromised" for simplicity but of course it will be more complicated than that in reality.)

This isn't the kind of approach I'm imagining.

[-]paulfchristiano5yΩ220

This seems like the most important part so I'll just focus on this for now

I'm not sure if it's the most important part. If you are including filtering (and not updates about whether people are good to talk to / legal liability / etc.) then I think it's a minority of the story. But it still seems fine to talk about (and it's not like the other steps are easier).

Suppose that I, as an attacker, tell my AI assistant, "interact with Paul in my name (possibly over a very long period of time) so as to maximize the chances that Paul eventually ends up believing in religion/ideology/moral theory X and then start spreading X to his friends"

Suppose your AI chooses some message M which is calculated to lead to Paul making (what Paul would or should regard as) an error. It sounds like your main question is how an AI could recognize M as problematic (i.e. such that Paul ought to expect to be worse off after reading M, such that it can either be filtered or caveated, or such that this information can be provided to reputation systems or arbiters, or so on).

My current view is that the sophistication required to recognize M as problematic is similar to the sophistication required to generate M as a manipulative action. This is clearest if the attacker just generates a lot of messages and then picks M that they think will most successfully manipulate the target---then an equally-sophisticated defender will have the same view about the likely impacts of M.

This is fuzzier if you can't tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it's probably more accurate to think of idealized deliberation as a collective activity. But as far as I can tell the basic story is still intact (and e.g. I have the intuition about "knowing how to manipulate the process is roughly the same as recognizing manipulation," just fuzzier.)

Can you give a concrete example of a sentence that it might say to you, upon seeing some element of the series of messages/interactions?

It's probably helpful to get more concrete about the kind of attack you are imagining (which is presumably easier than getting concrete about defenses---both depend on future technology but defenses also depend on what the attack is).

If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.

I suspect you have none of these examples in mind, but it will be easier to talk about if we zoom in.

[-]Wei Dai5yΩ220

This is fuzzier if you can’t tell the difference between deliberation and manipulation. If I define idealized deliberation as an individual activity then I can talk about the extent to which M leads to deviation from idealized deliberation, but it’s probably more accurate to think of idealized deliberation as a collective activity.

How will your AI compute "the extent to which M leads to deviation from idealized deliberation"? (I'm particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that's not the kind of approach you're imagining.)

If your attack involves convincing me of a false claim, or making a statement from which I will predictably make a false inference, then the ideal remedy would be explaining the possible error; if your attack involves threatening me, then an ideal remedy would be to help me implement my preferred policy with respect to threats. And so on.

The attack I have in mind is to imitate a normal human conversation about philosophy or about what's normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion. This may well involve convincing you of a false claim, but of a philosophical nature such that you and your AI can't detect the error (unless you've solved the problem of metaphilosophy and knows what kinds of reasoning reliably leads to true and false conclusions about philosophical problems).

[-]paulfchristiano5yΩ220

I think I misunderstood what kind of attack you were talking about. I thought you were imagining humans being subject to attack while going about their ordinary business (i.e. while trying to satisfy goals other than moral reflection), but it sounds like in the recent comments you are imagining cases where humans are trying to collaboratively answer hard questions (e.g. about what's right), some of them may sabotage the process, and none of them are able to answer the question on their own and so can't avoid relying on untrusted data from other humans.

I don't feel like this is going to overlap too much with the story in the OP, since it takes place over a very small amount of calendar time---we're not trying to do lots of moral deliberation during the story itself, we're trying to defer moral deliberation until after the singularity (by decoupling it from rapid physical/technological progress), and so the action you are wondering about would have happened after the story ended happily. There are still kinds of attacks that are still important (namely those that prevent humans from surviving through to the singularity).

Similarly it seems like your description of "go in an info bubble" is not really appropriate for this kind of attack---wouldn't it be more natural to say "tell your AI not to treat untrusted data as evidence about what is good, and try to rely on carefully chosen data for making novel moral progress."

So in that light, I basically want to decouple your concern into two parts:

Will collaborative moral deliberation actually "freeze" during this scary phase, or will people e.g. keep arguing on the internet and instruct their AI that it shouldn't protect them from potential manipulation driven by those interactions?
Will human communities be able to recover mutual trust after the singularity in this story?

I feel more concerned about #1. I'm not sure where you are at.

(I'm particularly confused because this seems pretty close to what I guessed earlier and seems to face similar problems, but you said that's not the kind of approach you're imagining.)

I was saying that I think it's better to directly look at the effects of what is said rather than trying to model the speaker and estimate if they are malicious (or have been compromised). I left a short comment though as a placeholder until writing the grandparent. Also I agree that in the case you seem to have had in mind my proposal is going to look a lot like what you wrote (see below).

How will your AI compute "the extent to which M leads to deviation from idealized deliberation"?

Here's a simple case to start with:

My AI cares about some judgment X that I'd reach after some idealized deliberative process.
We may not be able to implement that process, and at any rate I have other preferences, so instead the AI observes the output X' of some realistic deliberative process embedded in society.
After observing my estimate X' the AI acts on its best guess X'' about X.
An attacker wants to influence X*, so they send me a message M designed to distort X' (which they hope will in turn estimate X'')

In this case I think it's true and easy to derive that:

If my AI knows what the attacker knows, then updating on X' and on the fact that the attacker sent me M, can't push X'' in any direction that's predictable to the attacker.
Moreover, if me reading M changes X' in some predictable-to-the-attacker direction, then my AI knows that the reading M makes X' less informative about X.

I'd guess you are on board with the simple case. Some complications in reality:

We don't have any definition of the real idealized deliberative process
The nature of my AI's deference/corrigibility is quite different then simply regarding my judgment as evidence about X
The behavior of other people also provides evidence about X, and attackers may have information about other people (that the defender lacks)

My best guess had been that you were worried about 1 or 2, but from the recent comments it sounds like you may be actually thinking about 3.

The attack I have in mind is to imitate a normal human conversation about philosophy or about what's normative (what one should do), but AI-optimized with a goal of convincing you to adopt a particular conclusion

Filling in some details in a simple way: let's suppose that the attacker just samples a few plausible things for a human to say, then outputs the one that leads me to make the highest estimate for X. We believe that using "natural" samples from the distribution would yield an endorsed outcome, but that if you consistently pick X-inducing samples then the errors will compound and lead to something bad.

Then in my proposal a defender would observe X-inducing samples, and could tell that they are X-inducing (since the attacker could tell that and I think we're discussing the case where the defender has parity---if not then I think we need to return to some of the issues we set aside earlier). They would not initially know whether they are chance or manipulation. But after a few instances of this they will notice that errors tend to push in a surprisingly X-inducing direction and that the upcoming X-inducing samples are therefore particularly harmful.

This is basically what you proposed in this comment, though I feel the defender can judge based on the object-level reason that the X-inducing outputs are actually bad rather than explicitly flagging corruption.

In the context of my argument above, a way to view the concern is that a competitive defender can tell which of two samples is more X-inducing, but can't tell whether an output is surprisingly X-inducing vs if deliberation is just rationally X-inducing, because (unlike the attacker) they aren't able to observe the several natural samples from which the attack was sampled.

This kind of thing seems like it can only happen when the right conclusion depends on stuff that other humans know but you and your AI do not (or where for alignment reasons you want to defer to a process that involve the other humans).

[-]Alexei5y150

I’m curious how brain uploading / intelligence amplification interacts with this scenario. It’s possible we would be able to keep up for longer.

[-]paulfchristiano5y90

I think the upshot of those technologies (and similarly for ML assistants) is:

It takes longer before you actually face a catastrophe.
In that time, you can make faster progress towards an "out"

By an "out" I mean something like: (i) figuring out how to build competitive aligned optimizers, (ii) coordinating to avoid deploying unaligned AI.

Unfortunately I think [1] is a bit less impactful than it initially seems, at least if we live in a world of accelerating growth towards a singularity. For example, if the singularity is in 2045 and it's 2035, and you were going to have catastrophic failure in 2040, you can't really delay it by much calendar time. So [1] helps you by letting you wait until you get fancier technology from the fast outside economy, but doesn't give you too much more time for the slow humane economy to "catch up" on its own terms.

[-]1a3orn3y110Review for 2021 Review

There's a scarcity of stories about how things could go wrong with AI which are not centered on the "single advanced misaligned research project" scenario. This post (and the mentioned RAAP post by Critch) helps partially fill that gap.

It definitely helped me picture / feel some of what some potential worlds look like, to the degree I currently think something like this -- albeit probably slower, as mentioned in the story -- is more likely than the misaligned research project disaster.

It also is a (1) pretty good / fun story and (2) mentions the elements within the story which the author feels are unlikely, which is virtuous and helps prevent higher detail from being mistaken for plausibility.

[-]Andrew_Critch5yΩ8110

Paul, thanks writing this; it's very much in line with the kind of future I'm most worried about.

For me, it would be super helpful if you could pepper throughout the story mentions of the term "outer alignment" indicating which events-in-particular you consider outer alignment failures. Is there any chance you could edit it to add in such mentions? E.g., I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

[-]paulfchristiano5yΩ560

I'd say that every single machine in the story is misaligned, so hopefully that makes it easy :)

I'm basically always talking about intent alignment, as described in this post.

(I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)

[-]Andrew_Critch5yΩ5110

(I called the story an "outer" misalignment story because it focuses on the---somewhat improbable---case in which the intentions of the machines are all natural generalizations of their training objectives. I don't have a precise definition of inner or outer alignment and think they are even less well defined than intent alignment in general, but sometimes the meaning seems unambiguous and it seemed worth flagging specifically because I consider that one of the least realistic parts of this story.)

Thanks; this was somewhat helpful to my understanding, because as I said,

> I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

I realize you don't have a precise meaning of outer misalignment in mind, but in my opinion, confusion around this concept is central to (in my opinion) confused expectation that "alignment solutions" are adequate (on the technological side) for averting AI x-risk.

My question: Are you up for making your thinking and/or explaining about outer misalignment a bit more narratively precise here? E.g., could you say something like "«machine X» in the story is outer-misaligned because «reason»"?

Why I'm asking: My suspicion is that you answering this will help me pin down one of several possible substantive assumptions you and many other alignment-enthusiasts are making about the goals of AI designers operating in a multi-agent system or multi-polar singularity. Indeed, the definition of outer alignment currently endorsed by this forum is:

Outer Alignment in the context of machine learning is the property where the specified loss function is aligned with the intended goal of its designers. This is an intuitive notion, in part because human intentions are themselves not well-understood.

It's conceivable to me that making future narratives much more specific regarding the intended goals of AI designers—and how they are or are not being violated—will either (a) clarify the problems I see with anticipating "alignment" solutions to be technically-adequate for existential safety, or (b) rescue the "alignment" concept with a clearer definition of outer alignment that makes sense in multi-agent systems.

So: thanks if you'll consider my question!

[-]paulfchristiano5yΩ11260

I currently can't tell if by "outer alignment failure" you're referring to the entire ecosystem of machines being outer-misaligned, or just each individual machine (and if so, which ones in particular), and I'd like to sync with your usage of the concept if possible (or at least know how to sync with it).

I'm saying each individual machine is misaligned, because each individual machine is searching over plans to find one that leads to an outcome that humans will judge as good in hindsight. The collective behavior of many machines each individually trying make things look good in hindsight leads to an outcome where things look good in hindsight. All the machines achieve what they are trying to achieve (namely things look really good according to the judgments-in-hindsight), but humans are marginalized and don't get what they want, and that's consistent because no machines cared about humans getting what they want. This is not a story where some machines were trying to help humans but were frustrated by emergent properties of their interaction.

I realize you don't have a precise meaning of outer misalignment in mind, but in my opinion, confusion around this concept is central to (in my opinion) confused expectation that "alignment solutions" are adequate (on the technological side) for averting AI x-risk.

I use "outer alignment" to refer to a step in some alignment approaches. It is a well-defined subproblem for some approaches (namely those that aim to implement a loss function that accurately reflects human preferences over system behavior, and then produce an aligned system by optimizing that loss function), and obviously inapplicable to some approaches, and kind of a fuzzy and vague subproblem of others.

It's a bit weird to talk about a failure story as an "outer" alignment failure story, or to describe a general system acting in the world as "outer misaligned," since most possible systems weren't built by following an alignment methodology that admits a clean division into an "outer" and "inner" part.

I added the word "(outer)" in the title as a parenthetical to better flag the assumption about generalization mentioned in the appendix. I expected this flag to be meaningful for many readers here. If it's not meaningful to you then I would suggest ignoring it.

If there's anything useful to talk about in that space I think it's the implicit assumption (made explicit in the first bullet of the appendix) about how systems generalize. Namely, you might think that a system that is trained to achieve outcomes that look good to a human will in fact be trying to do something quite different. I think there's a pretty good chance of that, in which case this story would look different (because the ML systems would conspire to disempower humans much earlier in the story). However, it would still be the case that we fail because individual systems are trying to bring about failure.

confused expectation that "alignment solutions" are adequate (on the technological side) for averting AI x-risk.

Note that this isn't my view about intent alignment. (Though it is true tautologically for people who define "alignment" as "the problem of building AI systems that produce good outcomes when run," though as I've said I quite dislike that definition.)

I think there are many x-risks posed or exacerbated by AI progress beyond intent alignment problems . (Though I do think that intent alignment is sufficient to avoid e.g. the concern articulated in your production web story.)

It's conceivable to me that making future narratives much more specific regarding the intended goals of AI designers

The people who design AI (and moreover the people who use AI) have a big messy range of things they want. They want to live happy lives, and to preserve their status in the world, and to be safe from violence, and to be respected by people they care about, and similar things for their children...

When they invest in companies, or buy products from companies, or try to pass laws, they do so as a means to those complicated ends. That is, they hope that in virtue of being a shareholder of a successful company (or whatever) they will be in a better position to achieve their desires in the future.

One axis of specificity is to say things about what exactly they are imagining getting out of their investments or purchases (which will inform lots of low level choices they make). For example: the shareholders expect this company to pay dividends into their bank accounts, and they expect to be able to use the money in their bank accounts to buy things they want in the future, and they expect that if the company is not doing a good job they will be able to vote to replace the CEO, and so on. Some of the particular things they imagine buying: real estate and news coverage and security services. If they purchase security services: they hope that those security services will keep them safe in some broad and intuitive sense. There are some components of that they can articulate easily (e.g. they don't want to get shot) and some they can't (e.g. they want to feel safe, they don't want to be coerced, they want to retain as much flexibility as possible when using public facilities, etc.).

A second axis would be to break this down to the level of "single" AI systems, i.e. individual components which are optimized end-to-end. For example, one could enumerate the AI systems involved in running a factory or fighting a war or some other complex project. There are probably thousands of AI systems involved in each of those projects, but you could zoom in on some particular examples, e.g. what AI system is responsible for making decisions about the flight path of a particular drone, and the zoom in on one of the many AI systems involved in the choice to deploy that particular AI (and how to train it). We could talk about how of these individual AI systems trying to make things look good in hindsight (or instrumental subgoals thereof) result in bringing about an outcome that looks good in hindsight. (Though mostly I regard that as non-mysterious---if you have a bunch of AI systems trying to achieve X, or identifying intermediates Y that would tend to lead X and then deploying new AI to achieve Y, it's clear enough how that can lead to X. I also agree that it can lead to non-X, but that doesn't really happen in this story.)

A third axis would be to talk in more detail about exactly how a particular AI is constructed, e.g. over what time period is training data gathered from what sensors? How are simulated scenarios generated, when those are needed? What humans and other ML systems are involved in the actual evaluation of outcomes that is used to train and validate it?

For each of those three axes (and many others) it seems like there's a ton of things one could try to specify more precisely. You could easily write a dozen pages about the training of a single AI system, or a dozen pages enumerating an overview of the AI systems involved in a single complex project, or a dozen pages describing the hopes and intentions of the humans interacting with a particular AI. So you have to be pretty picky about which you spell out.

My question: Are you up for making your thinking and/or explaining about outer misalignment a bit more narratively precise here? E.g., could you say something like "«machine X» in the story is outer-misaligned because «reason»"?

Do you mean explaining why I judge these systems to be misaligned (a), or explaining causally how it is that they became misaligned (b)?

For (a): I'm judging these systems to be misaligned because they take concrete actions that they can easily determine are contrary to what their operators want. Skimming my story again, here are the main concrete decisions that I would describe as obviously contrary to the user's intentions:

The Ponzi scheme and factory that fabricates earnings reports understand that customers will be unhappy about this when they discover it several months in the future, yet they take those actions anyway. Although these failures are not particularly destructive on their own, they are provided as representative examples of a broader class of "alignment warning shots" that are happening and provide the justification for people deploying AI systems that avoid human disapproval over longer and longer time horizons.
The watchdogs who alternately scare or comfort us (based on what we asked for), with none of them explaining honestly what is going on, are misaligned. If we could build aligned systems, then those systems would sit down with us and talk about the risks and explain what's up as best they can, they would explain the likely bad outcomes in which sensors are corrupted and how that corruption occurs, and they would advise on e.g. what policies would avoid that outcome.
The machines that build/deploy/defend sensor networks are misaligned, which is why they actively insert vulnerabilities that would be exploited by attackers who intend to "cooperate" and avoid creating an appearance of trouble. Those vulnerabilities are not what the humans want in any sense. Similarly, The defense system that allows invaders to take over a city as long as they participate in perpetuating an illusion of security are obviously misaligned.
The machines that actually hack cameras and seize datacenters are misaligned, because the humans don't actually care about the cameras showing happy pictures or the datacenters recording good news. Machines were deployed to optimize those indicators because they can serve as useful proxies for "we are actually safe and happy."

Most complex activities involve a large number of components, and I agree that these descriptions are still "mult-agent" in the sense that e.g. managing an investment portfolio involves multiple distinct AIs. (The only possible exception is the watchdog system.) But these outcomes obtain because individual ML components are trying to bring them about, and so it still makes sense to intervene on the motivations of individual components in order to avoid these bad outcomes.

For example, carrying out and concealing a Ponzi scheme involves many actions that are taken because they successfully conceal the deception (e.g. you need to organize a financial statement carefully to deflect attention from an auditor), by a particular machine (e.g. an automated report-preparation system which is anticipating the consequences of emitting different possible reports) which is trying to carry out that deception (in the sense of considering many possible actions and selecting those that successfully deceive), despite being able to predict that the user will ultimately say that this was contrary to their preferences.

(b): these systems became misaligned because they are an implementation of an algorithm (the "unaligned benchmark") that seems unlikely to produce aligned systems. They were deployed because they were often useful despite their misalignment. They weren't replaced by aligned versions because we didn't know of any alternative algorithm that was similarly useful (and many unspecified alignment efforts have apparently failed). I do think we could have avoided this story in many different ways, and so you could highlight any of those as a causal factor (the story highlights none): we could have figured out how to build aligned systems, we could have anticipated the outcome and made deals to avoid it, more institutions could be managed by smarter or more forward-looking decision-makers, we could have a strong sufficiently competent world government, etc.

[-]ryan_b5y90

I hugely appreciate story posts, and I think the meta/story/variation stack is an excellent way to organize it. I would be ecstatic if this were adopted as one of the norms for engaging with this level of problem.

[-]Ben Pace5yΩ490

Curated. This was perhaps the most detailed yet informative story I've read about how failure will go down. As you say at the start it's making several key assumptions, it's not your 'mainline' failure story. Thx for making the assumptions explicit, and discussing how to vary them at the end. I'd like to see more people write stories written under different assumptions.

The sorts of stories Eliezer has told in the past have focused on 10-1000x faster takeoffs than discussed here, so those stories are less extended (you kinda just wake up one day then everyone dies). This slower one is helpful in seeing many of the relevant dynamics happen in more detail (although many of these issues wouldn't quite apply in the 1000x faster world).

Failure stories seem to me especially helpful in focusing research on what the actual problems will be. I also like this post in the context of Critch's post.

[-]Koen.Holtman5yΩ350

This story reminds me of the run-up to the 2007-2008 financial crisis:

But eventually the machinery for detecting problems does break down completely, in a way that leaves no trace on any of our reports.

There is also an echo of 'we know that we do not fully understand these complex financial products referencing other complex financial products, even the quants admit they do not fully understand them, but who cares if we are making that much money'.

Overall, if I replace 'AI' above with 'complex financial product', the story reads about the same. So was this story inspired and constructed by transposing certain historical events, or is it just a coincidence?

[-][anonymous]5mo30

It's clicking to me how legibility and Goodhardt relate. What if an elite's window into society is Goodhardt-corrupted? This can happen without AI if they're corrupt, or just out of touch. It does seem like a sycophantic AI can really do something novel to this situation. If legibility is low enough, and AI helps make the system legible, outer alignment failure might lead to a successful, permanent deception of the user.

Also it's sort of crazy how AI more than any other STEM I've ever worked with collapses the gap from "very out there" to "practical." If you're using AI for observability, this is a non-negotiable problem you have to solve. Claude Code does answers-faking today; that sort of thing can happen in an observability system.

[-]Adele Lopez5yΩ230

How bad is the ending supposed to be? Are just people who fight the system killed, and otherwise, humans are free to live in the way AI expects them to (which might be something like keep consuming goods and providing AI-mediated feedback on the quality of those goods)? Or is it more like once humans are disempowered no machine has any incentive to keep them around anymore, so humans are not-so-gradually replaced with machines?

The main point of intervention in this scenario that stood out to me would be making sure that (during the paragraph beginning with "For many people this is a very scary situation.") we at least attempt to use AI-negotiators to try to broker an international agreement to stop development of this technology until we understood it better (and using AI-designed systems for enforcement/surveillance). Is there anything in particular that makes this infeasible?

[-]paulfchristiano5yΩ660

I think that most likely either humans are killed incidentally as part of the sensor-hijacking (since that's likely to be the easiest way to deal with them), or else AI systems reserve a negligible fraction of their resources to keep humans alive and happy (but disempowered) based on something like moral pluralism or being nice or acausal trade (e.g. the belief that much of their influence comes from the worlds in which they are simulated by humans who didn't mess up alignment and who would be willing to exchange a small part of their resources in order to keep the people in the story alive and happy).

The main point of intervention in this scenario that stood out to me would be making sure that (during the paragraph beginning with "For many people this is a very scary situation.") we at least attempt to use AI-negotiators to try to broker an international agreement to stop development of this technology until we understood it better (and using AI-designed systems for enforcement/surveillance). Is there anything in particular that makes this infeasible?

I don't think this is infeasible. It's not the intervention I'm most focused on, but it may be the easiest way to avoid this failure (and it's an important channel for advance preparations to make things better / important payoff for understanding what's up with alignment and correctly anticipating problems).

[-]Rohin Shah5y*Ω330

Planned summary for the Alignment Newsletter:

Suppose we train AI systems to perform task T by having humans look at the results that the AI system achieves and evaluating how well the AI has performed task T. Suppose further that AI systems generalize “correctly” such that even in new situations they are still taking those actions that they predict we will evaluate as good. This does not mean that the systems are aligned: they would still deceive us into _thinking_ things are great when they actually are not. This post presents a more detailed story for how such AI systems can lead to extinction or complete human disempowerment. It’s relatively short, and a lot of the force comes from the specific details that I’m not going to summarize, so I do recommend you read it in full. I’ll be explaining a very abstract version below.
The core aspects of this story are:
1. Economic activity accelerates, leading to higher and higher growth rates, enabled by more and more automation through AI.
2. Throughout this process, we see some failures of AI systems where the AI system takes some action that initially looks good but we later find out was quite bad (e.g. investing in a Ponzi scheme, that the AI knows is a Ponzi scheme but the human doesn’t).
3. Despite this failure mode being known and lots of work being done on the problem, we are unable to find a good conceptual solution. The best we can do is to build better reward functions, sensors, measurement devices, checks and balances, etc. in order to provide better reward functions for agents and make it harder for them to trick us into thinking their actions are good when they are not.
4. Unfortunately, since the proportion of AI work keeps increasing relative to human work, this extra measurement capacity doesn’t work forever. Eventually, the AI systems are able to completely deceive all of our sensors, such that we can’t distinguish between worlds that are actually good and worlds which only appear good. Humans are dead or disempowered at this point.
(Again, the full story has much more detail.)

[-]Rohin Shah5yΩ220

Planned opinion (shared with What Multipolar Failure Looks Like, and Robust Agent-Agnostic Processes (RAAPs))

Both the previous story and this one seem quite similar to each other, and seem pretty reasonable to me as a description of one plausible failure mode we are aiming to avert. The previous story tends to frame this more as a failure of humanity’s coordination, while this one frames it (in the title) as a failure of intent alignment. It seems like both of these aspects greatly increase the plausibility of the story, or in other words, if we eliminated or made significantly less bad either of the two failures, then the story would no longer seem very plausible.
A natural next question is then which of the two failures would be best to intervene on, that is, is it more useful to work on intent alignment, or working on coordination? I’ll note that my best guess is that for any given person, this effect is minor relative to “which of the two topics is the person more interested in?”, so it doesn’t seem hugely important to me. Nonetheless, my guess is that on the current margin, for technical research in particular, holding all else equal, it is more impactful to focus on intent alignment. You can see a much more vigorous discussion in e.g. [this comment thread](https://www.alignmentforum.org/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=3czsvErCYfvJ6bBwf).

[-]Eli Tyre5y20

This sentence stuck out to me:

Again, we can tell the treaties at least kind of work because we can tell that no one is dying.

How can we tell? It's already the case that I'm pretty much at the mercy of my news sources. It seems like all kinds of horrible stuff might be happening all over the world, and I wouldn't know about it.

[-][anonymous]5y20

I like this story. Here's what I think is incorrect:

I don't think, from the perspective of humans monitoring single ML system running a concrete, quantifiable process - industry or mining or machine design - that it will be unexplainable. Just like today, tech stacks are already enormously complex, but at each layer someone does know how they work, and we know what they do at the layers that matter. Ever more complex designs for, say, a mining robot might start to resemble more and more some mix of living creatures and artwork out of a fractal, but we'll still have reports that measure how much performance the design gives per cost.

And systems that "lie to us" are a risk but not an inevitability in that careful engineering, auditing systems where finding True Discrepancies is their goal, etc, might become a thing.

Here's the part that's correct:

I was personally a little late to the smartphone party. So it felt like overnight everyone has QR codes plastered everywhere and is playing on their phone in bed. Most products adoption is a lot slower for reasons of cost (esp up front cost) and speed to make whatever new idea there is.

Self replicating robots that in vast swarms can make any product that the process to build is sufficiently defined would change all that. New cities could be built in a matter of months by enormous swarms of robotics, installing prefabricated components from elsewhere. Newer designs of cars, clothes, furniture - far less limits.

ML systems that can find a predicted optimal design, and send it for physical prototyping for it's design parameters to be checked are another way to get rid of some of the bottlenecks behind a new technology. Another one is that the 'early access' version might still have problems, but the financial model will probably be 'rental' not purchase.

This sounds worse but the upside is rental takes away the barrier to adoption. You don't need to come up with $XXX for the latest gadget, just make the first payment and you have it. The manufacturer doesn't need to force you into a contract either because their cost to recycle the gadget if you don't want it is low.

Anyways the combination of all these factors would create a world of, well, future shock. But it's not "the machines" doing this to humans, it would be a horde of separate groups of mainly humans doing this to each other. It's also quite possible this kind of technology will for some areas negate some of the advantages of large corporations, in that many types of products will be creatable without needing the support of a large institution.

[-]paulfchristiano5y60

I don't think, from the perspective of humans monitoring single ML system running a concrete, quantifiable process - industry or mining or machine design - that it will be unexplainable. Just like today, tech stacks are already enormously complex, but at each layer someone does know how they work, and we know what they do at the layers that matter.

This seems like the key question.

Ever more complex designs for, say, a mining robot might start to resemble more and more some mix of living creatures and artwork out of a fractal, but we'll still have reports that measure how much performance the design gives per cost.

I think that if we relate to our machines in the same way we relate to biological systems or ecologies, but AI systems actually understand those systems very well, then that's basically what I mean.

Having reports about outcomes is a kind of understanding, but it's basically the one I'm scared of (since e.g. it will be tough to learn about these kinds of systemic risks via outcome-driven reports, and attempts to push down near-misses may just transform them into full-blown catastrophes).

[-]PoignardAzur5y10

Yeah, that was my initial reaction as well.

Modern technologies are getting increasingly complicated... but when you get down to it, a car is just a box with wheels and a combustion engine. There aren't that many ways for a outcome-perception-driven AI to go "oops, I accidentally concealed a human-killing machine gun inside the steering wheel!", especially if the AI has to subcontract to independent suppliers for parts.

[-][anonymous]5y10

Moreover, tight constraints. Such a machine gun adds weight and cost without benefit to the AIs reward heuristic. A far more likely problem is it removes structure somewhere because every collision test doesn't need that material to pass. But the missing structure causes fatalities in crashes a conservatively designed vehicle would survive or long term durability problems.

Human designed products have exactly this happen also however. The difference is you could make a patch to add another reward heuristic component and have another design in the prototyping phase that same week. It would let you move fast and break things and fix them far faster than human organizations can.

[-]Resuna5y10

I like how it never actually attributes consciousness or self-awareness or anything "AGI like" to the automated systems, they could still just be very fast local optimizers, like something Peter Watts or Karl Schroeder would come up with.

[-]GravitasGradient5y-30

These stories always assume that an AI would be dumb enough to not realise the difference between measuring something and the thing measured.

Every AGI is a drug addict, unaware that it's high is a false one.

Why? Just for drama?

[-]dxu5y100

realise the difference between measuring something and the thing measured

What does this cash out to, concretely, in terms of a system's behavior? If I were to put a system in front of you that does "realize the difference between measuring something and the thing measured", what would that system's behavior look like? And once you've answered that, can you describe what mechanic in the system's design would lead to that (aspect of its) behavior?

[-]paulfchristiano5y50

I think the AI systems in this story have a clear understanding of the the difference between the measurement and the thing itself.

Are humans similarly like drug addicts, because we'd prefer experience play and love and friendship and so on even though we understand those things are mediocre approximations to "how many descendants we have"?

[+][comment deleted]3y10

Moderation Log

254

Another (outer) alignment failure story

254

Ω 76

Meta

Story

Ways you could vary the story

254

Ω 76

254

Ω 76