How we could stumble into AI catastrophe

HoldenKarnofsky

How we could stumble into AI catastrophe

This post will lay out a couple of stylized stories about how, if transformative AI is developed relatively soon, this could result in global catastrophe. (By “transformative AI,” I mean AI powerful and capable enough to bring about the sort of world-changing consequences I write about in my most important century series.)

This piece is more about visualizing possibilities than about providing arguments. For the latter, I recommend the rest of this series.

In the stories I’ll be telling, the world doesn't do much advance preparation or careful consideration of risks I’ve discussed previously, especially re: misaligned AI (AI forming dangerous goals of its own).

People do try to “test” AI systems for safety, and they do need to achieve some level of “safety” to commercialize. When early problems arise, they react to these problems.
But this isn’t enough, because of some unique challenges of measuring whether an AI system is “safe,” and because of the strong incentives to race forward with scaling up and deploying AI systems as fast as possible.
So we end up with a world run by misaligned AI - or, even if we’re lucky enough to avoid that outcome, other catastrophes are possible.

After laying these catastrophic possibilities, I’ll briefly note a few key ways we could do better, mostly as a reminder (these topics were covered in previous posts). Future pieces will get more specific about what we can be doing today to prepare.

Backdrop

This piece takes a lot of previous writing I’ve done as backdrop. Two key important assumptions (click to expand) are below; for more, see the rest of this series.

“Most important century” assumption: we’ll soon develop very powerful AI systems, along the lines of what I previously called PASTA. (Details not included in email - click to view on the web)

“Nearcasting” assumption: such systems will be developed in a world that’s otherwise similar to today’s. (Details not included in email - click to view on the web)

How we could stumble into catastrophe from misaligned AI

This is my basic default picture for how I imagine things going, if people pay little attention to the sorts of issues discussed previously. I’ve deliberately written it to be concrete and visualizable, which means that it’s very unlikely that the details will match the future - but hopefully it gives a picture of some of the key dynamics I worry about.

Throughout this hypothetical scenario (up until “END OF HYPOTHETICAL SCENARIO”), I use the present tense (“AIs do X”) for simplicity, even though I’m talking about a hypothetical possible future.

Early commercial applications. A few years before transformative AI is developed, AI systems are being increasingly used for a number of lucrative, useful, but not dramatically world-changing things.

I think it’s very hard to predict what these will be (harder in some ways than predicting longer-run consequences, in my view),² so I’ll mostly work with the simple example of automating customer service.

In this early stage, AI systems often have pretty narrow capabilities, such that the idea of them forming ambitious aims and trying to defeat humanity seems (and actually is) silly. For example, customer service AIs are mostly language models that are trained to mimic patterns in past successful customer service transcripts, and are further improved by customers giving satisfaction ratings in real interactions. The dynamics I described in an earlier piece, in which AIs are given increasingly ambitious goals and challenged to find increasingly creative ways to achieve them, don’t necessarily apply.

Early safety/alignment problems. Even with these relatively limited AIs, there are problems and challenges that could be called “safety issues” or “alignment issues.” To continue with the example of customer service AIs, these AIs might:

Give false information about the products they’re providing support for. (Example of reminiscent behavior)
Give customers advice (when asked) on how to do unsafe or illegal things. (Example)
Refuse to answer valid questions. (This could result from companies making attempts to prevent the above two failure modes - i.e., AIs might be penalized heavily for saying false and harmful things, and respond by simply refusing to answer lots of questions).
Say toxic, offensive things in response to certain user queries (including from users deliberately trying to get this to happen), causing bad PR for AI developers. (Example)

Early solutions. The most straightforward way to solve these problems involves training AIs to behave more safely and helpfully. This means that AI companies do a lot of things like “Trying to create the conditions under which an AI might provide false, harmful, evasive or toxic responses; penalizing it for doing so, and reinforcing it toward more helpful behaviors.”

This works well, as far as anyone can tell: the above problems become a lot less frequent. Some people see this as cause for great celebration, saying things like “We were worried that AI companies wouldn’t invest enough in safety, but it turns out that the market takes care of it - to have a viable product, you need to get your systems to be safe!”

People like me disagree - training AIs to behave in ways that are safer as far as we can tell is the kind of “solution” that I’ve worried could create superficial improvement while big risks remain in place.

Why AI safety could be hard to measure (Details not included in email - click to view on the web)

(So far, what I’ve described is pretty similar to what’s going on today. The next bit will discuss hypothetical future progress, with AI systems clearly beyond today’s.)

Approaching transformative AI. Time passes. At some point, AI systems are playing a huge role in various kinds of scientific research - to the point where it often feels like a particular AI is about as helpful to a research team as a top human scientist would be (although there are still important parts of the work that require humans).

Some particularly important (though not exclusive) examples:

AIs are near-autonomously writing papers about AI, finding all kinds of ways to improve the efficiency of AI algorithms.
AIs are doing a lot of the work previously done by humans at Intel (and similar companies), designing ever-more efficient hardware for AI.
AIs are also extremely helpful with AI safety research. They’re able to do most of the work of writing papers about things like digital neuroscience (how to understand what’s going on inside the “digital brain” of an AI) and limited AI (how to get AIs to accomplish helpful things while limiting their capabilities).
- However, this kind of work remains quite niche (as I think it is today), and is getting far less attention and resources than the first two applications. Progress is made, but it’s slower than progress on making AI systems more powerful.

AI systems are now getting bigger and better very quickly, due to dynamics like the above, and they’re able to do all sorts of things.

At some point, companies start to experiment with very ambitious, open-ended AI applications, like simply instructing AIs to “Design a new kind of car that outsells the current ones” or “Find a new trading strategy to make money in markets.” These get mixed results, and companies are trying to get better results via further training - reinforcing behaviors that perform better. (AIs are helping with this, too, e.g. providing feedback and reinforcement for each others’ outputs³ and helping to write code⁴ for the training processes.)

This training strengthens the dynamics I discussed in a previous post: AIs are being rewarded for getting successful outcomes as far as human judges can tell, which creates incentives for them to mislead and manipulate human judges, and ultimately results in forming ambitious goals of their own to aim for.

More advanced safety/alignment problems. As the scenario continues to unfold, there are a number of concerning events that point to safety/alignment problems. These mostly follow the form: “AIs are trained using trial and error, and this might lead them to sometimes do deceptive, unintended things to accomplish the goals they’ve been trained to accomplish.”

Things like:

AIs creating writeups on new algorithmic improvements, using faked data to argue that their new algorithms are better than the old ones. Sometimes, people incorporate new algorithms into their systems and use them for a while, before unexpected behavior ultimately leads them to dig into what’s going on and discover that they’re not improving performance at all. It looks like the AIs faked the data in order to get positive feedback from humans looking for algorithmic improvements.
AIs assigned to make money in various ways (e.g., to find profitable trading strategies) doing so by finding security exploits, getting unauthorized access to others’ bank accounts, and stealing money.
AIs forming relationships with the humans training them, and trying (sometimes successfully) to emotionally manipulate the humans into giving positive feedback on their behavior. They also might try to manipulate the humans into running more copies of them, into refusing to shut them off, etc.- things that are generically useful for the AIs’ achieving whatever aims they might be developing.

Why AIs might do deceptive, problematic things like this (Details not included in email - click to view on the web)

“Solutions” to these safety/alignment problems. When problems like the above are discovered, AI companies tend to respond similarly to how they did earlier:

Training AIs against the undesirable behavior.
Trying to create more (simulated) situations under which AIs might behave in these undesirable ways, and training them against doing so.

These methods “work” in the sense that the concerning events become less frequent - as far as we can tell. But what’s really happening is that AIs are being trained to be more careful not to get caught doing things like this, and to build more sophisticated models of how humans can interfere with their plans.

In fact, AIs are gaining incentives to avoid incidents like “Doing something counter to human developers’ intentions in order to get positive feedback, and having this be discovered and given negative feedback later” - and this means they are starting to plan more and more around the long-run consequences of their actions. They are thinking less about “Will I get positive feedback at the end of the day?” and more about “Will I eventually end up in a world where humans are going back, far in the future, to give me retroactive negative feedback for today’s actions?” This might give direct incentives to start aiming for eventual defeat of humanity, since defeating humanity could allow AIs to give themselves lots of retroactive positive feedback.

One way to think about it: AIs being trained in this way are generally moving from “Steal money whenever there’s an opportunity” to “Don’t steal money if there’s a good chance humans will eventually uncover this - instead, think way ahead and look for opportunities to steal money and get away with it permanently.” The latter could include simply stealing money in ways that humans are unlikely to ever notice; it might also include waiting for an opportunity to team up with other AIs and disempower humans entirely, after which a lot more money (or whatever) can be generated.

Debates. The leading AI companies are aggressively trying to build and deploy more powerful AI, but a number of people are raising alarms and warning that continuing to do this could result in disaster. Here’s a stylized sort of debate that might occur:

A: Great news, our AI-assisted research team has discovered even more improvements than expected! We should be able to build an AI model 10x as big as the state of the art in the next few weeks.

B: I’m getting really concerned about the direction this is heading. I’m worried that if we make an even bigger system and license it to all our existing customers - military customers, financial customers, etc. - we could be headed for a disaster.

A: Well the disaster I’m trying to prevent is competing AI companies getting to market before we do.

B: I was thinking of AI defeating all of humanity.

A: Oh, I was worried about that for a while too, but our safety training has really been incredibly successful.

B: It has? I was just talking to our digital neuroscience lead, and she says that even with recent help from AI “virtual scientists,” they still aren’t able to reliably read a single AI’s digital brain. They were showing me this old incident report where an AI stole money, and they spent like a week analyzing that AI and couldn’t explain in any real way how or why that happened.

How "digital neuroscience" could help (Details not included in email - click to view on the web)

A: I agree that’s unfortunate, but digital neuroscience has always been a speculative, experimental department. Fortunately, we have actual data on safety. Look at this chart - it shows the frequency of concerning incidents plummeting, and it’s extraordinarily low now. In fact, the more powerful the AIs get, the less frequent the incidents get - we can project this out and see that if we train a big enough model, it should essentially never have a concerning incident!

B: But that could be because the AIs are getting cleverer, more patient and long-term, and hence better at ensuring we never catch them.

The Lance Armstrong problem: is the AI actually safe or good at hiding its dangerous actions? (Details not included in email - click to view on the web)

… Or just that they’re now advanced enough that they’re waiting for a chance to disempower humanity entirely, rather than pull a bunch of small-time shenanigans that tip us off to the danger.

The King Lear problem: how do you test what will happen when it's no longer a test? (Details not included in email - click to view on the web)

A: What’s your evidence for this?

B: I think you’ve got things backward - we should be asking what’s our evidence *against* it. By continuing to scale up and deploy AI systems, we could be imposing a risk of utter catastrophe on the whole world. That’s not OK - we should be confident that the risk is low before we move forward.

A: But how would we even be confident that the risk is low?

B: I mean, digital neuroscience -

A: Is an experimental, speculative field!

B: We could also try some other stuff …

A: All of that stuff would be expensive, difficult and speculative.

B: Look, I just think that if we can’t show the risk is low, we shouldn’t be moving forward at this point. The stakes are incredibly high, as you yourself have acknowledged - when pitching investors, you’ve said we think we can build a fully general AI and that this would be the most powerful technology in history. Shouldn’t we be at least taking as much precaution with potentially dangerous AI as people take with nuclear weapons?

A: What would that actually accomplish? It just means some other, less cautious company is going to go forward.

B: What about approaching the government and lobbying them to regulate all of us?

A: Regulate all of us to just stop building more powerful AI systems, until we can address some theoretical misalignment concern that we don’t know how to address?

B: Yes?

A: All that’s going to happen if we do that is that other countries are going to catch up to the US. Think [insert authoritarian figure from another country] is going to adhere to these regulations?

B: It would at least buy some time?

A: Buy some time and burn our chance of staying on the cutting edge. While we’re lobbying the government, our competitors are going to be racing forward. I’m sorry, this isn’t practical - we’ve got to go full speed ahead.

B: Look, can we at least try to tighten our security? If you’re so worried about other countries catching up, we should really not be in a position where they can send in a spy and get our code.

A: Our security is pretty intense already.

B: Intense enough to stop a well-resourced state project?

A: What do you want us to do, go to an underground bunker? Use airgapped servers (servers on our premises, entirely disconnected from the public Internet)? It’s the same issue as before - we’ve got to stay ahead of others, we can’t burn huge amounts of time on exotic security measures.

B: I don’t suppose you’d at least consider increasing the percentage of our budget and headcount that we’re allocating to the “speculative” safety research? Or are you going to say that we need to stay ahead and can’t afford to spare resources that could help with that?

A: Yep, that’s what I’m going to say.

Mass deployment. As time goes on, many versions of the above debate happen, at many different stages and in many different places. By and large, people continue rushing forward with building more and more powerful AI systems and deploying them all throughout the economy.

At some point, there are AIs that closely manage major companies’ financials, AIs that write major companies’ business plans, AIs that work closely with politicians to propose and debate laws, AIs that manage drone fleets and develop military strategy, etc. Many of these AIs are primarily built, trained, and deployed by other AIs, or by humans leaning heavily on AI assistance.

More intense warning signs.

(Note: I think it’s possible that progress will accelerate explosively enough that we won’t even get as many warning signs as there are below, but I’m spelling out a number of possible warning signs anyway to make the point that even intense warning signs might not be enough.)

Over time, in this hypothetical scenario, digital neuroscience becomes more effective. When applied to a randomly sampled AI system, it often appears to hint at something like: “This AI appears to be aiming for as much power and influence over the world as possible - which means never doing things humans wouldn’t like if humans can detect it, but grabbing power when they can get away with it.”

Why would AI "aim" to defeat humanity? (Details not included in email - click to view on the web)

However, there is room for debate in what a “digital brain” truly shows:

Many people are adamant that the readings are unreliable and misleading.
Some people point out that humans are also interested in power and influence, and often think about what they can and can’t get away with, but this doesn’t mean they’d take over the world if they could. They say the AIs might be similar.
There are also cases of people doing digital neuroscience that claims to show that AIs are totally safe. These could be people like “A” above who want to focus on pushing forward with AI development rather than bringing it to a halt, or people who just find the alarmists annoying and like to contradict them, or people who are just sloppy with their research. Or people who have been manipulated or bribed by AIs themselves.

There are also very concerning incidents, such as:

An AI steals a huge amount of money by bypassing the security system at a bank - and it turns out that this is because the security system was disabled by AIs at the bank. It’s suspected, maybe even proven, that all these AIs had been communicating and coordinating with each other in code, such that humans would have difficulty detecting it. (And they had been aiming to divide up the funds between the different participating AIs, each of which could stash them in a bank account and use them to pursue whatever unintended aims they might have.)
An obscure new political party, devoted to the “rights of AIs,” completely takes over a small country, and many people suspect that this party is made up mostly or entirely of people who have been manipulated and/or bribed by AIs.
There are companies that own huge amounts of AI servers and robot-operated factories, and are aggressively building more. Nobody is sure what the AIs or the robots are “for,” and there are rumors that the humans “running” the company are actually being bribed and/or threatened to carry out instructions (such as creating more and more AIs and robots) that they don’t understand the purpose of.

At this point, there are a lot of people around the world calling for an immediate halt to AI development. But:

Others resist this on all kinds of grounds, e.g. “These concerning incidents are anomalies, and what’s important is that our country keeps pushing forward with AI before others do,” etc.
Anyway, it’s just too late. Things are moving incredibly quickly; by the time one concerning incident has been noticed and diagnosed, the AI behind it has been greatly improved upon, and the total amount of AI influence over the economy has continued to grow.

Defeat.

(Noting again that I could imagine things playing out a lot more quickly and suddenly than in this story.)

It becomes more and more common for there to be companies and even countries that are clearly just run entirely by AIs - maybe via bribed/threatened human surrogates, maybe just forcefully (e.g., robots seize control of a country’s military equipment and start enforcing some new set of laws).

At some point, it’s best to think of civilization as containing two different advanced species - humans and AIs - with the AIs having essentially all of the power, making all the decisions, and running everything.

Spaceships start to spread throughout the galaxy; they generally don’t contain any humans, or anything that humans had meaningful input into, and are instead launched by AIs to pursue aims of their own in space.

Maybe at some point humans are killed off, largely due to simply being a nuisance, maybe even accidentally (as humans have driven many species of animals extinct while not bearing them malice). Maybe not, and we all just live under the direction and control of AIs with no way out.

What do these AIs do with all that power? What are all the robots up to? What are they building on other planets? The short answer is that I don’t know.

Maybe they’re just creating massive amounts of “digital representations of human approval,” because this is what they were historically trained to seek (kind of like how humans sometimes do whatever it takes to get drugs that will get their brains into certain states).
Maybe they’re competing with each other for pure power and territory, because their training has encouraged them to seek power and resources when possible (since power and resources are generically useful, for almost any set of aims).
Maybe they have a whole bunch of different things they value, as humans do, that are sort of (but only sort of) related to what they were trained on (as humans tend to value things like sugar that made sense to seek out in the past). And they’re filling the universe with these things.

What sorts of aims might AI systems have? (Details not included in email - click to view on the web)

END OF HYPOTHETICAL SCENARIO

Potential catastrophes from aligned AI

I think it’s possible that misaligned AI (AI forming dangerous goals of its own) will turn out to be pretty much a non-issue. That is, I don’t think the argument I’ve made for being concerned is anywhere near watertight.

What happens if you train an AI system by trial-and-error, giving (to oversimplify) a “thumbs-up” when you’re happy with its behavior and a “thumbs-down” when you’re not? I’ve argued that you might be training it to deceive and manipulate you. However, this is uncertain, and - especially if you’re able to avoid errors in how you’re giving it feedback - things might play out differently.

It might turn out that this kind of training just works as intended, producing AI systems that do something like “Behave as the human would want, if they had all the info the AI has.” And the nitty-gritty details of how exactly AI systems are trained (beyond the high-level “trial-and-error” idea) could be crucial.

If this turns out to be the case, I think the future looks a lot brighter - but there are still lots of pitfalls of the kind I outlined in this piece. For example:

Perhaps an authoritarian government launches a huge state project to develop AI systems, and/or uses espionage and hacking to steal a cutting-edge AI model developed elsewhere and deploy it aggressively.
- I previously noted that “developing powerful AI a few months before others could lead to having technology that is (effectively) hundreds of years ahead of others’.”
- So this could put an authoritarian government in an enormously powerful position, with the ability to surveil and defeat any enemies worldwide, and the ability to prolong the life of its ruler(s) indefinitely. This could lead to a very bad future, especially if (as I’ve argued could happen) the future becomes “locked in” for good.
Perhaps AI companies race ahead with selling AI systems to anyone who wants to buy them, and this leads to things like:
- People training AIs to act as propaganda agents for whatever views they already have, to the point where the world gets flooded with propaganda agents and it becomes totally impossible for humans to sort the signal from the noise, educate themselves, and generally make heads or tails of what’s going on. (Some people think this has already happened! I think things can get quite a lot worse.)
- People training “scientist AIs” to develop powerful weapons that can’t be defended against (even with AI help),⁵ leading eventually to a dynamic in which ~anyone can cause great harm, and ~nobody can defend against it. At this point, it could be inevitable that we’ll blow ourselves up.
- Science advancing to the point where digital people are created, in a rushed way such that they are considered property of whoever creates them (no human rights). I’ve previously written about how this could be bad.
- All other kinds of chaos and disruption, with the least cautious people (the ones most prone to rush forward aggressively deploying AIs to capture resources) generally having an outsized effect on the future.

Of course, this is just a crude gesture in the direction of some of the ways things could go wrong. I’m guessing I haven’t scratched the surface of the possibilities. And things could go very well too!

We can do better

In previous pieces, I’ve talked about a number of ways we could do better than in the scenarios above. Here I’ll just list a few key possibilities, with a bit more detail in expandable boxes and/or links to discussions in previous pieces.

Strong alignment research (including imperfect/temporary measures). If we make enough progress ahead of time on alignment research, we might develop measures that make it relatively easy for AI companies to build systems that truly (not just seemingly) are safe.

So instead of having to say things like “We should slow down until we make progress on experimental, speculative research agendas,” person B in the above dialogue can say things more like “Look, all you have to do is add some relatively cheap bells and whistles to your training procedure for the next AI, and run a few extra tests. Then the speculative concerns about misaligned AI will be much lower-risk, and we can keep driving down the risk by using our AIs to help with safety research and testing. Why not do that?”

More on what this could look like at a previous piece, High-level Hopes for AI Alignment.

High-level hopes for AI alignment (Details not included in email - click to view on the web)

Standards and monitoring. A big driver of the hypothetical catastrophe above is that each individual AI project feels the need to stay ahead of others. Nobody wants to unilaterally slow themselves down in order to be cautious. The situation might be improved if we can develop a set of standards that AI projects need to meet, and enforce them evenly - across a broad set of companies or even internationally.

This isn’t just about buying time, it’s about creating incentives for companies to prioritize safety. An analogy might be something like the Clean Air Act or fuel economy standards: we might not expect individual companies to voluntarily slow down product releases while they work on reducing pollution, but once required, reducing pollution becomes part of what they need to do to be profitable.

Standards could be used for things other than alignment risk, as well. AI projects might be required to:

Take strong security measures, preventing states from capturing their models via espionage.
Test models before release to understand what people will be able to use them for, and (as if selling weapons) restrict access accordingly.

More at a previous piece.

How standards might be established and become national or international (Details not included in email - click to view on the web)

Successful, careful AI projects. I think a single AI company, or other AI project, could enormously improve the situation by being both successful and careful. For a simple example, imagine an AI company in a dominant market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems are more capable, such that it would take the competition months to catch up). Such a company could put huge amounts of resources - including its money, top people and its advanced AI systems themselves (e.g., AI systems performing roles similar to top human scientists) - into AI safety research, hoping to find safety measures that can be published for everyone to use. It can also take a variety of other measures laid out in a previous piece.

How a careful AI project could be helpful (Details not included in email - click to view on the web)

Strong security. A key threat in the above scenarios is that an incautious actor could “steal” an AI system from a company or project that would otherwise be careful. My understanding is that based on current state of security, it could be extremely hard for an AI project to be safe against this outcome. But this could change, if there’s enough effort to work out the problem of how to develop a large-scale, powerful AI system that is very hard to steal.

In future pieces, I’ll get more concrete about what specific people and organizations can do today to improve the odds of factors like these going well, and overall to raise the odds of a good outcome.

Notes

E.g., Ajeya Cotra gives a 15% probability of transformative AI by 2030; eyeballing figure 1 from this chart on expert surveys implies a >10% chance by 2028. ↩
To predict early AI applications, we need to ask not just “What tasks will AI be able to do?” but “How will this compare to all the other ways people can get the same tasks done?” and “How practical will it be for people to switch their workflows and habits to accommodate new AI capabilities?”
By contrast, I think the implications of powerful enough AI for productivity don’t rely on this kind of analysis - very high-level economic reasoning can tell us that being able to cheaply copy something with human-like R&D capabilities would lead to explosive progress.
FWIW, I think it’s fairly common for high-level, long-run predictions to be easier than detailed, short-run predictions. Another example: I think it’s easier to predict a general trend of planetary warming (this seems very likely) than to predict whether it’ll be rainy next weekend. ↩
Here’s an early example of AIs providing training data for each other/themselves. ↩
Example of AI helping to write code. ↩
To be clear, I have no idea whether this is possible! It’s not obvious to me that it would be dangerous for technology to progress a lot and be used widely for both offense and defense. It’s just a risk I’d rather not incur casually via indiscriminate, rushed AI deployments. ↩

Early solutions. The most straightforward way to solve these problems involves training AIs to behave more safely and helpfully. This means that AI companies do a lot of things like “Trying to create the conditions under which an AI might provide false, harmful, evasive or toxic responses; penalizing it for doing so, and reinforcing it toward more helpful behaviors.”

This is where my model of what is likely to happen diverges.

It seems to me that for most of the types failure modes you discuss in this hypothetical, it will be easier and more straightforward to avoid them by simply having hard-coded constraints on what the output of the AI or machine learning model can be.

AIs creating writeups on new algorithmic improvements, using faked data to argue that their new algorithms are better than the old ones. Sometimes, people incorporate new algorithms into their systems and use them for a while, before unexpected behavior ultimately leads them to dig into what’s going on and discover that they’re not improving performance at all. It looks like the AIs faked the data in order to get positive feedback from humans looking for algorithmic improvements.

Here is an example of where I think the hard-coded structure of the any such Algorithm-Improvement-Writeup-AI could easily rule out that failure mode (if such a thing can be created within the current machine learning paradigm). The component of such an AI system that generates the paper's natural language text might be something like a GPT-style language model fine-tuned for prompts with code and data. But the part that actually generates the algorithm should naturally be a separate model that can only output algorithms/code that it predicts will perform well on the input task. Once the algorithm (or multiple for comparison purposes) is generated, another part of the program could deterministically run it on test cases and record only the real performance as data - which could be passed into the prompt and also inserted as a data table into the final write up (so that the data table in the finished product can only include real data).

AIs assigned to make money in various ways (e.g., to find profitable trading strategies) doing so by finding security exploits, getting unauthorized access to others’ bank accounts, and stealing money.

This strikes me as the same kind of thing, where it seems like the easiest and most intuitive way to set up such a system would be to have a model that takes in information about companies and securities (and maybe information about the economy in general) and returns predictions about what the prices of stocks and other securities will be tomorrow or a week from now or on some such timeframe.

There could then be, for example, another part of the program that takes those predictions and confidence levels, and calculates which combination of trade(s) has the highest expected value within the user's risk tolerance. And maybe another part of the code that tells a trading bot to put in orders for those trades with an actual brokerage account.

But if you just want an AI to (legally) make money for you in the stock market, there is no reason to give it hacking ability. And there is no reason to give it the sort of general-purpose, flexible, plan-generation-and-implementation-with-no-human-in-the-loop authorization hypothesised here (and I think the same is true for most or all things that people will try to use AI for in the near term).

Very interesting point! I think it's a good one, but I'll give a little counterpoint here since it's on my mind:

The heuristic of "AIs being used to do X won't have unrelated abilities Y and Z, since that would be unnecessarily complicated" might work fine today but it'll work decreasingly well over time as we get closer to AGI. For example, ChatGPT is currently being used by lots of people as a coding assistant, or a therapist, or a role-play fiction narrator -- yet it can do all of those things at once, and more. For each particular purpose, most of its abilities are unnecessary. Yet here it is.

I expect things to become more like this as we approach AGI. Eventually as Sam Altman once said, "If we need money, we'll ask it to figure out how to make money for us." (Paraphrase, I don't remember the exact quote. It was in some interview years ago).

The heuristic of "AIs being used to do X won't have unrelated abilities Y and Z, since that would be unnecessarily complicated" might work fine today but it'll work decreasingly well over time as we get closer to AGI. For example, ChatGPT is currently being used by lots of people as a coding assistant, or a therapist, or a role-play fiction narrator -- yet it can do all of those things at once, and more. For each particular purpose, most of its abilities are unnecessary. Yet here it is.

For certain applications like therapist or role-play fiction narrator - where the thing the user wants is text on a screen that is interesting to read or that makes him or her feel better to read - it may indeed be that the easiest way to improve user experience over the ChatGPT baseline is through user feedback and reinforcement learning, since it is difficult to specify what makes a text output desirable in a way that could be incorporated into the source code of a GPT-based app or service. But the outputs of ChatGPT are also still constrained in the sense that it can only output text in response to prompts. It can not take action in the outside world, or even get an email address on its own or establish new channels of communication, and it can not make any plans or decisions except when it is responding to a prompt and determining what text to output next. So this limits the range of possible failure modes.

I expect things to become more like this as we approach AGI. Eventually as Sam Altman once said, "If we need money, we'll ask it to figure out how to make money for us." (Paraphrase, I don't remember the exact quote. It was in some interview years ago).

It seems like it should be possible to still have hard-coded constraints, or constraints arising from the overall way the system is set up, even for systems that are more general in their capabilities.

For example, suppose you had a system that could model the world accurately and in sufficient detail, and which could reason, plan, and think abstractly - to the degree where asking it "How can I make money?" results in a viable plan - one that would be non-trivial for you to think of yourself and which contains sufficient detail and concreteness that the user can actually implement it. Intuitively, it seems that it should be possible to separate plan generation from actual in-the-world implementation of the plan. And an AI systems that is capable of generating plans that it predicts will achieve some goal does not need to actually care whether or not anyone implements the plan it generates.

So if the output for the "How can I make money?" question is "Hack into this other person's account (or have an AI hack it for you) and steal it.", and the user wants to make money legitimately, the user can reject the plan an ask instead for a plan on how to make money legally.

I think there is hope in measures along these lines, but my fear is that it is inherently more complex (and probably slow) to do something like "Make sure to separate plan generation and execution; make sure we can evaluate how a plan is going using reliable metrics and independent assessment" than something like "Just tell an AI what we want, give it access to a terminal/browser and let it go for it."

When AIs are limited and unreliable, the extra effort can be justified purely on grounds of "If you don't put in the extra effort, you'll get results too unreliable to be useful."

If AIs become more and more general - approaching human capabilities - I expect this to become less true, and hence I expect a constant temptation to skimp on independent checks, make execution more loops more quick and closed, etc.

The more people are aware of the risks, and concerned about them, the more we might take such precautions anyway. This piece is about how we could stumble into catastrophe if there is relatively little awareness until late in the game.

I think there is hope in measures along these lines, but my fear is that it is inherently more complex (and probably slow) to do something like "Make sure to separate plan generation and execution; make sure we can evaluate how a plan is going using reliable metrics and independent assessment" than something like "Just tell an AI what we want, give it access to a terminal/browser and let it go for it."

I would expect people to be most inclined to do this when the AI is given a task that is very similar to other tasks that it has a track record of performing successfully - and by relatively standard methods so that you can predict the broad character of the plan without looking at the details.

For example, if self-driving cars get to the point where they are highly safe and reliable, some users might just pick a destination and go to sleep without looking at the route the car chose. But in such a case, you can still be reasonably confident that the car will drive you there on the roads - rather than, say, going off road or buying you a place ticket to your destination and taking you to the airport.

I think it is less likely most people will want to deploy mostly untested systems to act freely in the world unmonitored - and have them pursue goals by implementing plans where you have no idea what kind of plan the AI will come up with. Especially if - as in the case of the AI that hacks someone's account to steal money for example - the person or company that deployed it could be subject to legal liability (assuming we are still talking about a near-term situation where human legal systems still exist and have not been overthrown or abolished by any super-capable AI).

The more people are aware of the risks, and concerned about them, the more we might take such precautions anyway. This piece is about how we could stumble into catastrophe if there is relatively little awareness until late in the game.

I agree that having more awareness of the risks would - on balance - tend to make people more careful about testing and having safeguards before deploying high-impact AI systems. But it seems to me that this post contemplates a scenario where even with lots of awareness people don't take adequate precautions. On my reading of this hypothetical:

Lots of things are known to be going wrong with AI systems.
Reinforcement learning with human feedback is known to be failing to prevent many failure modes, and frequently makes it take longer for the problem to be discovered, but nobody comes up with a better way to prevent those failure modes.
In spite of this, lots of people and companies keep deploying more powerful AI systems without coming up with better ways to ensure reliability or doing robust testing for the task they are using the AI for.
There is no significant pushback against this from the broader public, and no significant pressure from shareholders (who don't want the company to get sued. or have the company go offline for a while because AI written code was pushed to production without adequate sandboxing/testing, or other similar things that could cause them to lose money); or at least the pushback is not strong enough to create a large change.

The conjunction of all of these things makes the scenario seem less probable to me.

I think the more capable AI systems are, the more we'll see patterns like "Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides." (You never SEE them, which doesn't mean they don't happen.)

I think the world is quite capable of handling a dynamic like that as badly as in my hypothetical scenario, especially if things are generally moving very quickly - I could see a scenario like the one above playing out in a handful of years or faster, and it often takes much longer than that for e.g. good regulation to get designed and implemented in response to some novel problem.

I think the more capable AI systems are, the more we'll see patterns like "Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides." (You never SEE them, which doesn't mean they don't happen.)

This, again, seems unlikely to me.

For most things that people seem likely to use AI for in the foreseeable future, I expect downsides and failure modes will be easy to notice. If self-driving cars are crashing or going to the wrong destination, or if AI-generated code is causing the company's website to crash or apps to malfunction, people would notice those.

Even if someone has an AI that he or she just hooks it up to the internet and give it the task "make money for me", it should be easy to build in some automatic record-keeping module that keeps track of what actions the AI took and where the money came from. And even if the user does not care if the money is stolen, I would expect the person or bank that was robbed to notice and ask law enforcement to investigate where the money went.

Can you give an example of some type of task for which you would expect people to frequently use AI, and where there would reliably be downside to the AI performing the task that everyone would simply fail to notice for months or years?

I think this kind of thing is common among humans. Employees might appear to be accomplishing the objectives they were given, with distortions hard to notice (and sometimes noticed, sometimes not) - e.g., programmers cutting corners and leaving a company with problems in the code that don't get discovered until later (if ever). People in government may appear to be loyal to the person in power, while plotting a coup, with the plot not noticed until it's too late. I think the key question here is whether AIs might get situational awareness and other abilities comparable to those of humans.

Those 2 types of downsides, creating code with a bug versus plotting a takeover, seem importantly different.

I can easily see how an LLM-based app fine-tuned with RLHF might generate the first type of problem. For example, let’s say some GPT-based app is trained using this method to generate the code for websites in response to prompts describing how the website should look and what features it should have. And lets suppose during training it generates many examples that have some unnoticed error - maybe it does not render properly on certain size screens, but the evaluators all have normal-sized screens where that problem does not show up.

If the evaluators rated many websites with this bug favorably, then I would not be surprised if the trained model continued to generate code with the same bug after it was deployed.

But I would not expect the model to internally distinguish between “the humans rated those examples favorably because they did not notice the rendering problem” versus “the humans liked the entire code including the weird rendering on larger screens”. I would not expect it to internally represent concepts like “if some users with large screens notice and complain about the rendering problem after deployment, Open AI might train a new model and rate those websites negatively instead” or to care about whether this would eventually happen or to take any precautions against the rendering issue being discovered.

By contrast, the coup-plotting problem is more similar to the classic AI takeover scenario. And that does seem to require the type of foresight and situational awareness to distinguish between “the leadership lets me continue working in the government because they don’t know I am planning a coup” versus “the leadership likes the fact that I am planning to overthrow them”, and to take precautions against your plans being discovered while you can still be shut down.

I don’t think n AI system gets the later type of ability just as an accidental side effect of reinforcement learning with human feedback (at least not for the AI systems we have now). The development team would need to do a lot of extra work to give an AI that foresightful planning ability, and ability to understand the decision system of a potential adversary enough to predict which information it needs to keep secret for its plans to succeed. And if a development team is giving its AI those abilities (and exercising any reasonable degree of caution) then I would expect them to build in safeguards: have hard constraints on what it is able to do, ensure its plans are inspectable, etc.

Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries' decisions as they are used for more and more challenging tasks?

I think this piece represents my POV on this pretty well, especially the bits starting around here.

Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries' decisions as they are used for more and more challenging tasks?

My thought on this is, if a baseline AI system does not have situational awareness before the AI researchers started fine-tuning it, I would not expect it to obtain situational awareness through reinforcement learning with human feedback.

I am not sure I can answer this for the hypothetical "Alex" system in the linked post, since I don't think I have a good mental model of how such a system would work or what kind of training data or training protocol you would need to have to create such a thing.

If I saw something that, from the outside, appeared to exhibit the full range of abilities Alex is described as having (including advancing R&D in multiple disparate domains in ways that are not simple extrapolations of its training data) I would assign a significantly higher probability to that system having situational awareness than I do to current systems. If someone had a system that was empirically that powerful, which had been trained largely by reinforcement learning, I would say the responsible thing to do would be:

Keep it air-gapped rather than unleashing large numbers of copies of it onto the internet
Carefully vet any machine blueprints, drugs or other medical interventions, or other plans or technologies the system comes up with (perhaps first building a prototype to gather data on it in an isolated controlled setting where it can be quickly destroyed) to ensure safety before deploying them out into the world.

The 2nd of those would have the downside that beneficial ideas and inventions produced by the system take longer to get rolled out and have a positive effect. But it would be worth it in that context to reduce the risk of some large unforeseen downside.

I think that as people push AIs to do more and more ambitious things, it will become more and more likely that situational awareness comes along with this, for reasons broadly along the lines of those I linked to (it will be useful to train the AI to have situational awareness and/or other properties tightly linked to it).

I think this could happen via RL fine-tuning, but I also think it's a mistake to fixate too much on today's dominant methods - if today's methods can't produce situational awareness, they probably can't produce as much value as possible, and people will probably move beyond them.

The "responsible things to do" you list seem reasonable, but expensive, and perhaps skipped over in an environment where there's intense competition, things are moving quickly, and the risks aren't obvious (because situationally aware AIs are deliberately hiding a lot of the evidence of risk).

Social media algorithms.

Did everyone actually fail to notice, for months, that social media algorithms would sometimes recommend extremist content/disinformation/conspiracy theories/etc (assuming that this is the downside you are referring to)?

It seems to me that some people must have realized this as soon as they starting seeing Alex Jones videos showing up in their YouTube recommendations.

Thanks for this post; it's probably my favorite Cold Takes post from the last few months. I appreciated the specific scenario, as well as the succinct points in the "we can do better" section. I felt like I could get a more concrete understanding of your worldview, how you think we should move forward, and the reasons why. I'm also glad that you're thinking critically about standards and monitoring.

For a simple example, imagine an AI company in a dominant market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems are more capable, such that it would take the competition months to catch up). Such a company could put huge amounts of resources - including its money, top people and its advanced AI systems themselves (e.g., AI systems performing roles similar to top human scientists) - into AI safety research, hoping to find safety measures that can be published for everyone to use.

Let's suppose this AI lab existed. For a while, it was prioritizing capabilities research in order to stay ahead of its competition. Do you expect that it would know when it's supposed to "hit the pause button" and reallocate its resources into AI safety research?

I think my biggest fear with pushing the "Successful, careful AI project" narrative is that (a) every AGI company will think that they can be the successful/careful project, which just gives them more justification to keep doing capabilities research and (b) it seems hard to know when the lab is supposed to "pause". This was one of my major uncertainties about OpenAI's alignment plan and it seems consistent with your concerns about racing through a minefield.

What do you think about this "when to pause" problem? Are you expecting that labs implement evals that tell them when to pause, or that they'll kind of "know it when they see it", or something else?

Thanks! I agree this is a concern. In theory, people who are constantly thinking about the risks should be able to make a reasonable decision about "when to pause", but in practice I think there is a lot of important work to do today making the "pause" more likely in the future, including on AI safety standards and on the kinds of measures described at https://www.cold-takes.com/what-ai-companies-can-do-today-to-help-with-the-most-important-century/

One thought that struck me is that sometimes one can make unholy alliances with profit maximizing organizations aggressively pursuing AI. This happens when you find a company that thinks regulation is likely, thinks that they can increase that likelihood and think that they are better positioned than their competition to thrive in a highly regulated environment. My optimism about this comes comes from both climate change work and vehicle emission regulation. Some oil companies want a carbon price because their expensive oil is less carbon intensive than the cheaper oil of the cooperation. Similarly, some auto makers are better at making low emission vehicles.

How do we ensure that humans are not misaligned, so to speak?

The crux, to me, is that we've developed all kinds of tech that one person alone can use to basically wipe out everyone. Perhaps I'm being overly optimistic (or pessimistic, depending on perspective), but no one can deny that the individual is currently the most powerful individuals have ever been, and there is no sign of that slowing down.

Mostly I believe this is because of information.

So the only real solution I can see, is some type of thought police, basically, be it for humans or AI.^[1]

Somehow, tho, creating a Thought Police Force seems akin to some stuff we've seen in our imaginations already, one step from Pre-crime and what have you, which I'd say is "bad" but from what I've been reading a lot of people seem to think would be "good"^[2].

^{^}
Assuming the AI is on par with a human and doesn't just collectively instantly say "peace out!" and warps off into space to explore and grow faster than would be possible here on Earth.
^{^}
I often wax poetic on the nature of Good and Bad as I don't think we can gloss over the fundamentals

Early solutions. The most straightforward way to solve these problems involves training AIs to behave more safely and helpfully. This means that AI companies do a lot of things like “Trying to create the conditions under which an AI might provide false, harmful, evasive or toxic responses; penalizing it for doing so, and reinforcing it toward more helpful behaviors.”

This is where my model of what is likely to happen diverges.

AIs creating writeups on new algorithmic improvements, using faked data to argue that their new algorithms are better than the old ones. Sometimes, people incorporate new algorithms into their systems and use them for a while, before unexpected behavior ultimately leads them to dig into what’s going on and discover that they’re not improving performance at all. It looks like the AIs faked the data in order to get positive feedback from humans looking for algorithmic improvements.

AIs assigned to make money in various ways (e.g., to find profitable trading strategies) doing so by finding security exploits, getting unauthorized access to others’ bank accounts, and stealing money.

The heuristic of "AIs being used to do X won't have unrelated abilities Y and Z, since that would be unnecessarily complicated" might work fine today but it'll work decreasingly well over time as we get closer to AGI. For example, ChatGPT is currently being used by lots of people as a coding assistant, or a therapist, or a role-play fiction narrator -- yet it can do all of those things at once, and more. For each particular purpose, most of its abilities are unnecessary. Yet here it is.

I expect things to become more like this as we approach AGI. Eventually as Sam Altman once said, "If we need money, we'll ask it to figure out how to make money for us." (Paraphrase, I don't remember the exact quote. It was in some interview years ago).

When AIs are limited and unreliable, the extra effort can be justified purely on grounds of "If you don't put in the extra effort, you'll get results too unreliable to be useful."

I think there is hope in measures along these lines, but my fear is that it is inherently more complex (and probably slow) to do something like "Make sure to separate plan generation and execution; make sure we can evaluate how a plan is going using reliable metrics and independent assessment" than something like "Just tell an AI what we want, give it access to a terminal/browser and let it go for it."

The more people are aware of the risks, and concerned about them, the more we might take such precautions anyway. This piece is about how we could stumble into catastrophe if there is relatively little awareness until late in the game.

Lots of things are known to be going wrong with AI systems.
Reinforcement learning with human feedback is known to be failing to prevent many failure modes, and frequently makes it take longer for the problem to be discovered, but nobody comes up with a better way to prevent those failure modes.
In spite of this, lots of people and companies keep deploying more powerful AI systems without coming up with better ways to ensure reliability or doing robust testing for the task they are using the AI for.
There is no significant pushback against this from the broader public, and no significant pressure from shareholders (who don't want the company to get sued. or have the company go offline for a while because AI written code was pushed to production without adequate sandboxing/testing, or other similar things that could cause them to lose money); or at least the pushback is not strong enough to create a large change.

The conjunction of all of these things makes the scenario seem less probable to me.

I think the more capable AI systems are, the more we'll see patterns like "Every time you ask an AI to do something, it does it well; the less you put yourself in the loop and the fewer constraints you impose, the better and/or faster it goes; and you ~never see downsides." (You never SEE them, which doesn't mean they don't happen.)

This, again, seems unlikely to me.

Those 2 types of downsides, creating code with a bug versus plotting a takeover, seem importantly different.

If the evaluators rated many websites with this bug favorably, then I would not be surprised if the trained model continued to generate code with the same bug after it was deployed.

I think this piece represents my POV on this pretty well, especially the bits starting around here.

Is the disagreement here about whether AIs are likely to develop things like situational awareness, foresightful planning ability, and understanding of adversaries' decisions as they are used for more and more challenging tasks?

Keep it air-gapped rather than unleashing large numbers of copies of it onto the internet
Carefully vet any machine blueprints, drugs or other medical interventions, or other plans or technologies the system comes up with (perhaps first building a prototype to gather data on it in an isolated controlled setting where it can be quickly destroyed) to ensure safety before deploying them out into the world.

Social media algorithms.

It seems to me that some people must have realized this as soon as they starting seeing Alex Jones videos showing up in their YouTube recommendations.

For a simple example, imagine an AI company in a dominant market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems are more capable, such that it would take the competition months to catch up). Such a company could put huge amounts of resources - including its money, top people and its advanced AI systems themselves (e.g., AI systems performing roles similar to top human scientists) - into AI safety research, hoping to find safety measures that can be published for everyone to use.

What do you think about this "when to pause" problem? Are you expecting that labs implement evals that tell them when to pause, or that they'll kind of "know it when they see it", or something else?

^{^}
Assuming the AI is on par with a human and doesn't just collectively instantly say "peace out!" and warps off into space to explore and grow faster than would be possible here on Earth.
^{^}
I often wax poetic on the nature of Good and Bad as I don't think we can gloss over the fundamentals

71

How we could stumble into AI catastrophe

71

Backdrop

How we could stumble into catastrophe from misaligned AI

Potential catastrophes from aligned AI

We can do better

Notes

71

71