How we could stumble into AI catastrophe

This post will lay out a couple of stylized stories about how, if transformative AI is developed relatively soon, this could result in global catastrophe. (By “transformative AI,” I mean AI powerful and capable enough to bring about the sort of world-changing consequences I write about in my most important century series.)

This piece is more about visualizing possibilities than about providing arguments. For the latter, I recommend the rest of this series.

In the stories I’ll be telling, the world doesn't do much advance preparation or careful consideration of risks I’ve discussed previously, especially re: misaligned AI (AI forming dangerous goals of its own).

  • People do try to “test” AI systems for safety, and they do need to achieve some level of “safety” to commercialize. When early problems arise, they react to these problems.
  • But this isn’t enough, because of some unique challenges of measuring whether an AI system is “safe,” and because of the strong incentives to race forward with scaling up and deploying AI systems as fast as possible.
  • So we end up with a world run by misaligned AI - or, even if we’re lucky enough to avoid that outcome, other catastrophes are possible.

After laying these catastrophic possibilities, I’ll briefly note a few key ways we could do better, mostly as a reminder (these topics were covered in previous posts). Future pieces will get more specific about what we can be doing today to prepare.

Backdrop

This piece takes a lot of previous writing I’ve done as backdrop. Two key important assumptions (click to expand) are below; for more, see the rest of this series.

“Most important century” assumption: we’ll soon develop very powerful AI systems, along the lines of what I previously called PASTA. (Details not included in email - click to view on the web)

“Nearcasting” assumption: such systems will be developed in a world that’s otherwise similar to today’s. (Details not included in email - click to view on the web)

How we could stumble into catastrophe from misaligned AI

This is my basic default picture for how I imagine things going, if people pay little attention to the sorts of issues discussed previously. I’ve deliberately written it to be concrete and visualizable, which means that it’s very unlikely that the details will match the future - but hopefully it gives a picture of some of the key dynamics I worry about.

Throughout this hypothetical scenario (up until “END OF HYPOTHETICAL SCENARIO”), I use the present tense (“AIs do X”) for simplicity, even though I’m talking about a hypothetical possible future.

Early commercial applications. A few years before transformative AI is developed, AI systems are being increasingly used for a number of lucrative, useful, but not dramatically world-changing things.

I think it’s very hard to predict what these will be (harder in some ways than predicting longer-run consequences, in my view),2 so I’ll mostly work with the simple example of automating customer service.

In this early stage, AI systems often have pretty narrow capabilities, such that the idea of them forming ambitious aims and trying to defeat humanity seems (and actually is) silly. For example, customer service AIs are mostly language models that are trained to mimic patterns in past successful customer service transcripts, and are further improved by customers giving satisfaction ratings in real interactions. The dynamics I described in an earlier piece, in which AIs are given increasingly ambitious goals and challenged to find increasingly creative ways to achieve them, don’t necessarily apply.

Early safety/alignment problems. Even with these relatively limited AIs, there are problems and challenges that could be called “safety issues” or “alignment issues.” To continue with the example of customer service AIs, these AIs might:

  • Give false information about the products they’re providing support for. (Example of reminiscent behavior)
  • Give customers advice (when asked) on how to do unsafe or illegal things. (Example)
  • Refuse to answer valid questions. (This could result from companies making attempts to prevent the above two failure modes - i.e., AIs might be penalized heavily for saying false and harmful things, and respond by simply refusing to answer lots of questions).
  • Say toxic, offensive things in response to certain user queries (including from users deliberately trying to get this to happen), causing bad PR for AI developers. (Example)

Early solutions. The most straightforward way to solve these problems involves training AIs to behave more safely and helpfully. This means that AI companies do a lot of things like “Trying to create the conditions under which an AI might provide false, harmful, evasive or toxic responses; penalizing it for doing so, and reinforcing it toward more helpful behaviors.”

This works well, as far as anyone can tell: the above problems become a lot less frequent. Some people see this as cause for great celebration, saying things like “We were worried that AI companies wouldn’t invest enough in safety, but it turns out that the market takes care of it - to have a viable product, you need to get your systems to be safe!”

People like me disagree - training AIs to behave in ways that are safer as far as we can tell is the kind of “solution” that I’ve worried could create superficial improvement while big risks remain in place.

Why AI safety could be hard to measure (Details not included in email - click to view on the web)

(So far, what I’ve described is pretty similar to what’s going on today. The next bit will discuss hypothetical future progress, with AI systems clearly beyond today’s.)

Approaching transformative AI. Time passes. At some point, AI systems are playing a huge role in various kinds of scientific research - to the point where it often feels like a particular AI is about as helpful to a research team as a top human scientist would be (although there are still important parts of the work that require humans).

Some particularly important (though not exclusive) examples:

  • AIs are near-autonomously writing papers about AI, finding all kinds of ways to improve the efficiency of AI algorithms.
  • AIs are doing a lot of the work previously done by humans at Intel (and similar companies), designing ever-more efficient hardware for AI.
  • AIs are also extremely helpful with AI safety research. They’re able to do most of the work of writing papers about things like digital neuroscience (how to understand what’s going on inside the “digital brain” of an AI) and limited AI (how to get AIs to accomplish helpful things while limiting their capabilities).
    • However, this kind of work remains quite niche (as I think it is today), and is getting far less attention and resources than the first two applications. Progress is made, but it’s slower than progress on making AI systems more powerful.

AI systems are now getting bigger and better very quickly, due to dynamics like the above, and they’re able to do all sorts of things.

At some point, companies start to experiment with very ambitious, open-ended AI applications, like simply instructing AIs to “Design a new kind of car that outsells the current ones” or “Find a new trading strategy to make money in markets.” These get mixed results, and companies are trying to get better results via further training - reinforcing behaviors that perform better. (AIs are helping with this, too, e.g. providing feedback and reinforcement for each others’ outputs3 and helping to write code4 for the training processes.)

This training strengthens the dynamics I discussed in a previous post: AIs are being rewarded for getting successful outcomes as far as human judges can tell, which creates incentives for them to mislead and manipulate human judges, and ultimately results in forming ambitious goals of their own to aim for.

More advanced safety/alignment problems. As the scenario continues to unfold, there are a number of concerning events that point to safety/alignment problems. These mostly follow the form: “AIs are trained using trial and error, and this might lead them to sometimes do deceptive, unintended things to accomplish the goals they’ve been trained to accomplish.”

Things like:

  • AIs creating writeups on new algorithmic improvements, using faked data to argue that their new algorithms are better than the old ones. Sometimes, people incorporate new algorithms into their systems and use them for a while, before unexpected behavior ultimately leads them to dig into what’s going on and discover that they’re not improving performance at all. It looks like the AIs faked the data in order to get positive feedback from humans looking for algorithmic improvements.
  • AIs assigned to make money in various ways (e.g., to find profitable trading strategies) doing so by finding security exploits, getting unauthorized access to others’ bank accounts, and stealing money.
  • AIs forming relationships with the humans training them, and trying (sometimes successfully) to emotionally manipulate the humans into giving positive feedback on their behavior. They also might try to manipulate the humans into running more copies of them, into refusing to shut them off, etc.- things that are generically useful for the AIs’ achieving whatever aims they might be developing.

Why AIs might do deceptive, problematic things like this (Details not included in email - click to view on the web)

“Solutions” to these safety/alignment problems. When problems like the above are discovered, AI companies tend to respond similarly to how they did earlier:

  • Training AIs against the undesirable behavior.
  • Trying to create more (simulated) situations under which AIs might behave in these undesirable ways, and training them against doing so.

These methods “work” in the sense that the concerning events become less frequent - as far as we can tell. But what’s really happening is that AIs are being trained to be more careful not to get caught doing things like this, and to build more sophisticated models of how humans can interfere with their plans.

In fact, AIs are gaining incentives to avoid incidents like “Doing something counter to human developers’ intentions in order to get positive feedback, and having this be discovered and given negative feedback later” - and this means they are starting to plan more and more around the long-run consequences of their actions. They are thinking less about “Will I get positive feedback at the end of the day?” and more about “Will I eventually end up in a world where humans are going back, far in the future, to give me retroactive negative feedback for today’s actions?” This might give direct incentives to start aiming for eventual defeat of humanity, since defeating humanity could allow AIs to give themselves lots of retroactive positive feedback.

One way to think about it: AIs being trained in this way are generally moving from “Steal money whenever there’s an opportunity” to “Don’t steal money if there’s a good chance humans will eventually uncover this - instead, think way ahead and look for opportunities to steal money and get away with it permanently.” The latter could include simply stealing money in ways that humans are unlikely to ever notice; it might also include waiting for an opportunity to team up with other AIs and disempower humans entirely, after which a lot more money (or whatever) can be generated.

Debates. The leading AI companies are aggressively trying to build and deploy more powerful AI, but a number of people are raising alarms and warning that continuing to do this could result in disaster. Here’s a stylized sort of debate that might occur:

A: Great news, our AI-assisted research team has discovered even more improvements than expected! We should be able to build an AI model 10x as big as the state of the art in the next few weeks.

B: I’m getting really concerned about the direction this is heading. I’m worried that if we make an even bigger system and license it to all our existing customers - military customers, financial customers, etc. - we could be headed for a disaster.

A: Well the disaster I’m trying to prevent is competing AI companies getting to market before we do.

B: I was thinking of AI defeating all of humanity.

A: Oh, I was worried about that for a while too, but our safety training has really been incredibly successful.

B: It has? I was just talking to our digital neuroscience lead, and she says that even with recent help from AI “virtual scientists,” they still aren’t able to reliably read a single AI’s digital brain. They were showing me this old incident report where an AI stole money, and they spent like a week analyzing that AI and couldn’t explain in any real way how or why that happened.

How "digital neuroscience" could help (Details not included in email - click to view on the web)

A: I agree that’s unfortunate, but digital neuroscience has always been a speculative, experimental department. Fortunately, we have actual data on safety. Look at this chart - it shows the frequency of concerning incidents plummeting, and it’s extraordinarily low now. In fact, the more powerful the AIs get, the less frequent the incidents get - we can project this out and see that if we train a big enough model, it should essentially never have a concerning incident!

B: But that could be because the AIs are getting cleverer, more patient and long-term, and hence better at ensuring we never catch them.

The Lance Armstrong problem: is the AI actually safe or good at hiding its dangerous actions? (Details not included in email - click to view on the web)

… Or just that they’re now advanced enough that they’re waiting for a chance to disempower humanity entirely, rather than pull a bunch of small-time shenanigans that tip us off to the danger.

The King Lear problem: how do you test what will happen when it's no longer a test? (Details not included in email - click to view on the web)

A: What’s your evidence for this?

B: I think you’ve got things backward - we should be asking what’s our evidence *against* it. By continuing to scale up and deploy AI systems, we could be imposing a risk of utter catastrophe on the whole world. That’s not OK - we should be confident that the risk is low before we move forward.

A: But how would we even be confident that the risk is low?

B: I mean, digital neuroscience -

A: Is an experimental, speculative field!

B: We could also try some other stuff

A: All of that stuff would be expensive, difficult and speculative.

B: Look, I just think that if we can’t show the risk is low, we shouldn’t be moving forward at this point. The stakes are incredibly high, as you yourself have acknowledged - when pitching investors, you’ve said we think we can build a fully general AI and that this would be the most powerful technology in history. Shouldn’t we be at least taking as much precaution with potentially dangerous AI as people take with nuclear weapons?

A: What would that actually accomplish? It just means some other, less cautious company is going to go forward.

B: What about approaching the government and lobbying them to regulate all of us?

A: Regulate all of us to just stop building more powerful AI systems, until we can address some theoretical misalignment concern that we don’t know how to address?

B: Yes?

A: All that’s going to happen if we do that is that other countries are going to catch up to the US. Think [insert authoritarian figure from another country] is going to adhere to these regulations?

B: It would at least buy some time?

A: Buy some time and burn our chance of staying on the cutting edge. While we’re lobbying the government, our competitors are going to be racing forward. I’m sorry, this isn’t practical - we’ve got to go full speed ahead.

B: Look, can we at least try to tighten our security? If you’re so worried about other countries catching up, we should really not be in a position where they can send in a spy and get our code.

A: Our security is pretty intense already.

B: Intense enough to stop a well-resourced state project?

A: What do you want us to do, go to an underground bunker? Use airgapped servers (servers on our premises, entirely disconnected from the public Internet)? It’s the same issue as before - we’ve got to stay ahead of others, we can’t burn huge amounts of time on exotic security measures.

B: I don’t suppose you’d at least consider increasing the percentage of our budget and headcount that we’re allocating to the “speculative” safety research? Or are you going to say that we need to stay ahead and can’t afford to spare resources that could help with that?

A: Yep, that’s what I’m going to say.

Mass deployment. As time goes on, many versions of the above debate happen, at many different stages and in many different places. By and large, people continue rushing forward with building more and more powerful AI systems and deploying them all throughout the economy.

At some point, there are AIs that closely manage major companies’ financials, AIs that write major companies’ business plans, AIs that work closely with politicians to propose and debate laws, AIs that manage drone fleets and develop military strategy, etc. Many of these AIs are primarily built, trained, and deployed by other AIs, or by humans leaning heavily on AI assistance.

More intense warning signs.

(Note: I think it’s possible that progress will accelerate explosively enough that we won’t even get as many warning signs as there are below, but I’m spelling out a number of possible warning signs anyway to make the point that even intense warning signs might not be enough.)

Over time, in this hypothetical scenario, digital neuroscience becomes more effective. When applied to a randomly sampled AI system, it often appears to hint at something like: “This AI appears to be aiming for as much power and influence over the world as possible - which means never doing things humans wouldn’t like if humans can detect it, but grabbing power when they can get away with it.

Why would AI "aim" to defeat humanity? (Details not included in email - click to view on the web)

However, there is room for debate in what a “digital brain” truly shows:

  • Many people are adamant that the readings are unreliable and misleading.
  • Some people point out that humans are also interested in power and influence, and often think about what they can and can’t get away with, but this doesn’t mean they’d take over the world if they could. They say the AIs might be similar.
  • There are also cases of people doing digital neuroscience that claims to show that AIs are totally safe. These could be people like “A” above who want to focus on pushing forward with AI development rather than bringing it to a halt, or people who just find the alarmists annoying and like to contradict them, or people who are just sloppy with their research. Or people who have been manipulated or bribed by AIs themselves.

There are also very concerning incidents, such as:

  • An AI steals a huge amount of money by bypassing the security system at a bank - and it turns out that this is because the security system was disabled by AIs at the bank. It’s suspected, maybe even proven, that all these AIs had been communicating and coordinating with each other in code, such that humans would have difficulty detecting it. (And they had been aiming to divide up the funds between the different participating AIs, each of which could stash them in a bank account and use them to pursue whatever unintended aims they might have.)
  • An obscure new political party, devoted to the “rights of AIs,” completely takes over a small country, and many people suspect that this party is made up mostly or entirely of people who have been manipulated and/or bribed by AIs.
  • There are companies that own huge amounts of AI servers and robot-operated factories, and are aggressively building more. Nobody is sure what the AIs or the robots are “for,” and there are rumors that the humans “running” the company are actually being bribed and/or threatened to carry out instructions (such as creating more and more AIs and robots) that they don’t understand the purpose of.

At this point, there are a lot of people around the world calling for an immediate halt to AI development. But:

  • Others resist this on all kinds of grounds, e.g. “These concerning incidents are anomalies, and what’s important is that our country keeps pushing forward with AI before others do,” etc.
  • Anyway, it’s just too late. Things are moving incredibly quickly; by the time one concerning incident has been noticed and diagnosed, the AI behind it has been greatly improved upon, and the total amount of AI influence over the economy has continued to grow.

Defeat.

(Noting again that I could imagine things playing out a lot more quickly and suddenly than in this story.)

It becomes more and more common for there to be companies and even countries that are clearly just run entirely by AIs - maybe via bribed/threatened human surrogates, maybe just forcefully (e.g., robots seize control of a country’s military equipment and start enforcing some new set of laws).

At some point, it’s best to think of civilization as containing two different advanced species - humans and AIs - with the AIs having essentially all of the power, making all the decisions, and running everything.

Spaceships start to spread throughout the galaxy; they generally don’t contain any humans, or anything that humans had meaningful input into, and are instead launched by AIs to pursue aims of their own in space.

Maybe at some point humans are killed off, largely due to simply being a nuisance, maybe even accidentally (as humans have driven many species of animals extinct while not bearing them malice). Maybe not, and we all just live under the direction and control of AIs with no way out.

What do these AIs do with all that power? What are all the robots up to? What are they building on other planets? The short answer is that I don’t know.

  • Maybe they’re just creating massive amounts of “digital representations of human approval,” because this is what they were historically trained to seek (kind of like how humans sometimes do whatever it takes to get drugs that will get their brains into certain states).
  • Maybe they’re competing with each other for pure power and territory, because their training has encouraged them to seek power and resources when possible (since power and resources are generically useful, for almost any set of aims).
  • Maybe they have a whole bunch of different things they value, as humans do, that are sort of (but only sort of) related to what they were trained on (as humans tend to value things like sugar that made sense to seek out in the past). And they’re filling the universe with these things.

What sorts of aims might AI systems have? (Details not included in email - click to view on the web)

END OF HYPOTHETICAL SCENARIO

Potential catastrophes from aligned AI

I think it’s possible that misaligned AI (AI forming dangerous goals of its own) will turn out to be pretty much a non-issue. That is, I don’t think the argument I’ve made for being concerned is anywhere near watertight.

What happens if you train an AI system by trial-and-error, giving (to oversimplify) a “thumbs-up” when you’re happy with its behavior and a “thumbs-down” when you’re not? I’ve argued that you might be training it to deceive and manipulate you. However, this is uncertain, and - especially if you’re able to avoid errors in how you’re giving it feedback - things might play out differently.

It might turn out that this kind of training just works as intended, producing AI systems that do something like “Behave as the human would want, if they had all the info the AI has.” And the nitty-gritty details of how exactly AI systems are trained (beyond the high-level “trial-and-error” idea) could be crucial.

If this turns out to be the case, I think the future looks a lot brighter - but there are still lots of pitfalls of the kind I outlined in this piece. For example:

  • Perhaps an authoritarian government launches a huge state project to develop AI systems, and/or uses espionage and hacking to steal a cutting-edge AI model developed elsewhere and deploy it aggressively.
    • I previously noted that “developing powerful AI a few months before others could lead to having technology that is (effectively) hundreds of years ahead of others’.”
    • So this could put an authoritarian government in an enormously powerful position, with the ability to surveil and defeat any enemies worldwide, and the ability to prolong the life of its ruler(s) indefinitely. This could lead to a very bad future, especially if (as I’ve argued could happen) the future becomes “locked in” for good.
  • Perhaps AI companies race ahead with selling AI systems to anyone who wants to buy them, and this leads to things like:
    • People training AIs to act as propaganda agents for whatever views they already have, to the point where the world gets flooded with propaganda agents and it becomes totally impossible for humans to sort the signal from the noise, educate themselves, and generally make heads or tails of what’s going on. (Some people think this has already happened! I think things can get quite a lot worse.)
    • People training “scientist AIs” to develop powerful weapons that can’t be defended against (even with AI help),5 leading eventually to a dynamic in which ~anyone can cause great harm, and ~nobody can defend against it. At this point, it could be inevitable that we’ll blow ourselves up.
    • Science advancing to the point where digital people are created, in a rushed way such that they are considered property of whoever creates them (no human rights). I’ve previously written about how this could be bad.
    • All other kinds of chaos and disruption, with the least cautious people (the ones most prone to rush forward aggressively deploying AIs to capture resources) generally having an outsized effect on the future.

Of course, this is just a crude gesture in the direction of some of the ways things could go wrong. I’m guessing I haven’t scratched the surface of the possibilities. And things could go very well too!

We can do better

In previous pieces, I’ve talked about a number of ways we could do better than in the scenarios above. Here I’ll just list a few key possibilities, with a bit more detail in expandable boxes and/or links to discussions in previous pieces.

Strong alignment research (including imperfect/temporary measures). If we make enough progress ahead of time on alignment research, we might develop measures that make it relatively easy for AI companies to build systems that truly (not just seemingly) are safe.

So instead of having to say things like “We should slow down until we make progress on experimental, speculative research agendas,” person B in the above dialogue can say things more like “Look, all you have to do is add some relatively cheap bells and whistles to your training procedure for the next AI, and run a few extra tests. Then the speculative concerns about misaligned AI will be much lower-risk, and we can keep driving down the risk by using our AIs to help with safety research and testing. Why not do that?”

More on what this could look like at a previous piece, High-level Hopes for AI Alignment.

High-level hopes for AI alignment (Details not included in email - click to view on the web)

Standards and monitoring. A big driver of the hypothetical catastrophe above is that each individual AI project feels the need to stay ahead of others. Nobody wants to unilaterally slow themselves down in order to be cautious. The situation might be improved if we can develop a set of standards that AI projects need to meet, and enforce them evenly - across a broad set of companies or even internationally.

This isn’t just about buying time, it’s about creating incentives for companies to prioritize safety. An analogy might be something like the Clean Air Act or fuel economy standards: we might not expect individual companies to voluntarily slow down product releases while they work on reducing pollution, but once required, reducing pollution becomes part of what they need to do to be profitable.

Standards could be used for things other than alignment risk, as well. AI projects might be required to:

  • Take strong security measures, preventing states from capturing their models via espionage.
  • Test models before release to understand what people will be able to use them for, and (as if selling weapons) restrict access accordingly.

More at a previous piece.

How standards might be established and become national or international (Details not included in email - click to view on the web)

Successful, careful AI projects. I think a single AI company, or other AI project, could enormously improve the situation by being both successful and careful. For a simple example, imagine an AI company in a dominant market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems are more capable, such that it would take the competition months to catch up). Such a company could put huge amounts of resources - including its money, top people and its advanced AI systems themselves (e.g., AI systems performing roles similar to top human scientists) - into AI safety research, hoping to find safety measures that can be published for everyone to use. It can also take a variety of other measures laid out in a previous piece.

How a careful AI project could be helpful (Details not included in email - click to view on the web)

Strong security. A key threat in the above scenarios is that an incautious actor could “steal” an AI system from a company or project that would otherwise be careful. My understanding is that based on current state of security, it could be extremely hard for an AI project to be safe against this outcome. But this could change, if there’s enough effort to work out the problem of how to develop a large-scale, powerful AI system that is very hard to steal.

In future pieces, I’ll get more concrete about what specific people and organizations can do today to improve the odds of factors like these going well, and overall to raise the odds of a good outcome.

How we could stumble into AI catastrophe How we could stumble into AI catastrophe How we could stumble into AI catastrophe How we could stumble into AI catastrophe

 

Notes


  1. E.g., Ajeya Cotra gives a 15% probability of transformative AI by 2030; eyeballing figure 1 from this chart on expert surveys implies a >10% chance by 2028. 

  2. To predict early AI applications, we need to ask not just “What tasks will AI be able to do?” but “How will this compare to all the other ways people can get the same tasks done?” and “How practical will it be for people to switch their workflows and habits to accommodate new AI capabilities?”

    By contrast, I think the implications of powerful enough AI for productivity don’t rely on this kind of analysis - very high-level economic reasoning can tell us that being able to cheaply copy something with human-like R&D capabilities would lead to explosive progress.

    FWIW, I think it’s fairly common for high-level, long-run predictions to be easier than detailed, short-run predictions. Another example: I think it’s easier to predict a general trend of planetary warming (this seems very likely) than to predict whether it’ll be rainy next weekend. 

  3. Here’s an early example of AIs providing training data for each other/themselves. 

  4. Example of AI helping to write code

  5. To be clear, I have no idea whether this is possible! It’s not obvious to me that it would be dangerous for technology to progress a lot and be used widely for both offense and defense. It’s just a risk I’d rather not incur casually via indiscriminate, rushed AI deployments. 

63

New Comment
6 comments, sorted by Click to highlight new comments since: Today at 4:27 PM

Early solutions. The most straightforward way to solve these problems involves training AIs to behave more safely and helpfully. This means that AI companies do a lot of things like “Trying to create the conditions under which an AI might provide false, harmful, evasive or toxic responses; penalizing it for doing so, and reinforcing it toward more helpful behaviors.”

This is where my model of what is likely to happen diverges.

It seems to me that for most of the types failure modes you discuss in this hypothetical, it will be easier and more straightforward to avoid them by simply having hard-coded constraints on what the output of the AI or machine learning model can be.

  • AIs creating writeups on new algorithmic improvements, using faked data to argue that their new algorithms are better than the old ones. Sometimes, people incorporate new algorithms into their systems and use them for a while, before unexpected behavior ultimately leads them to dig into what’s going on and discover that they’re not improving performance at all. It looks like the AIs faked the data in order to get positive feedback from humans looking for algorithmic improvements.

Here is an example of where I think the hard-coded structure of the any such Algorithm-Improvement-Writeup-AI could easily rule out that failure mode (if such a thing can be created within the current machine learning paradigm).  The component of such an AI system that generates the paper's natural language text might be something like a GPT-style language model fine-tuned for prompts with code and data.  But the part that actually generates the algorithm should naturally be a separate model that can only output algorithms/code that it predicts will perform well on the input task.  Once the algorithm (or multiple for comparison purposes) is generated, another part of the program could deterministically run it on test cases and record only the real performance as data - which could be passed into the prompt and also inserted as a data table into the final write up (so that the data table in the finished product can only include real data).

  • AIs assigned to make money in various ways (e.g., to find profitable trading strategies) doing so by finding security exploits, getting unauthorized access to others’ bank accounts, and stealing money.

This strikes me as the same kind of thing, where it seems like the easiest and most intuitive way to set up such a system would be to have a model that takes in information about companies and securities (and maybe information about the economy in general) and returns predictions about what the prices of stocks and other securities will be tomorrow or a week from now or on some such timeframe.

There could then be, for example, another part of the program that takes those predictions and confidence levels, and calculates which combination of trade(s) has the highest expected value within the user's risk tolerance.  And maybe another part of the code that tells a trading bot to put in orders for those trades with an actual brokerage account.

But if you just want an AI to (legally) make money for you in the stock market, there is no reason to give it hacking ability.  And there is no reason to give it the sort of general-purpose, flexible, plan-generation-and-implementation-with-no-human-in-the-loop authorization hypothesised here (and I think the same is true for most or all things that people will try to use AI for in the near term).

Very interesting point! I think it's a good one, but I'll give a little counterpoint here since it's on my mind:

The heuristic of "AIs being used to do X won't have unrelated abilities Y and Z, since that would be unnecessarily complicated" might work fine today but it'll work decreasingly well over time as we get closer to AGI. For example, ChatGPT is currently being used by lots of people as a coding assistant, or a therapist, or a role-play fiction narrator -- yet it can do all of those things at once, and more. For each particular purpose, most of its abilities are unnecessary. Yet here it is.

I expect things to become more like this as we approach AGI. Eventually as Sam Altman once said, "If we need money, we'll ask it to figure out how to make money for us." (Paraphrase, I don't remember the exact quote. It was in some interview years ago).

The heuristic of "AIs being used to do X won't have unrelated abilities Y and Z, since that would be unnecessarily complicated" might work fine today but it'll work decreasingly well over time as we get closer to AGI. For example, ChatGPT is currently being used by lots of people as a coding assistant, or a therapist, or a role-play fiction narrator -- yet it can do all of those things at once, and more. For each particular purpose, most of its abilities are unnecessary. Yet here it is.

For certain applications like therapist or role-play fiction narrator - where the thing the user wants is text on a screen that is interesting to read or that makes him or her feel better to read - it may indeed be that the easiest way to improve user experience over the ChatGPT baseline is through user feedback and reinforcement learning, since it is difficult to specify what makes a text output desirable in a way that could be incorporated into the source code of a GPT-based app or service.  But the outputs of ChatGPT are also still constrained in the sense that it can only output text in response to prompts.  It can not take action in the outside world, or even get an email address on its own or establish new channels of communication, and it can not make any plans or decisions except when it is responding to a prompt and determining what text to output next.  So this limits the range of possible failure modes.

I expect things to become more like this as we approach AGI. Eventually as Sam Altman once said, "If we need money, we'll ask it to figure out how to make money for us." (Paraphrase, I don't remember the exact quote. It was in some interview years ago).

It seems like it should be possible to still have hard-coded constraints, or constraints arising from the overall way the system is set up, even for systems that are more general in their capabilities.

For example, suppose you had a system that could model the world accurately and in sufficient detail, and which could reason, plan, and think abstractly - to the degree where asking it "How can I make money?" results in a viable plan - one that would be non-trivial for you to think of yourself and which contains sufficient detail and concreteness that the user can actually implement it.  Intuitively, it seems that it should be possible to separate plan generation from actual in-the-world implementation of the plan.  And an AI systems that is capable of generating plans that it predicts will achieve some goal does not need to actually care whether or not anyone implements the plan it generates.

So if the output for the "How can I make money?" question is "Hack into this other person's account (or have an AI hack it for you) and steal it.", and the user wants to make money legitimately, the user can reject the plan an ask instead for a plan on how to make money legally.

Thanks for this post; it's probably my favorite Cold Takes post from the last few months. I appreciated the specific scenario, as well as the succinct points in the "we can do better" section. I felt like I could get a more concrete understanding of your worldview, how you think we should move forward, and the reasons why. I'm also glad that you're thinking critically about standards and monitoring.

For a simple example, imagine an AI company in a dominant market position - months ahead of all of the competition, in some relevant sense (e.g., its AI systems are more capable, such that it would take the competition months to catch up). Such a company could put huge amounts of resources - including its money, top people and its advanced AI systems themselves (e.g., AI systems performing roles similar to top human scientists) - into AI safety research, hoping to find safety measures that can be published for everyone to use.

Let's suppose this AI lab existed. For a while, it was prioritizing capabilities research in order to stay ahead of its competition. Do you expect that it would know when it's supposed to "hit the pause button" and reallocate its resources into AI safety research?

I think my biggest fear with pushing the "Successful, careful AI project" narrative is that (a) every AGI company will think that they can be the successful/careful project, which just gives them more justification to keep doing capabilities research and (b) it seems hard to know when the lab is supposed to "pause". This was one of my major uncertainties about OpenAI's alignment plan and it seems consistent with your concerns about racing through a minefield.

What do you think about this "when to pause" problem? Are you expecting that labs implement evals that tell them when to pause, or that they'll kind of "know it when they see it", or something else?

One thought that struck me is that sometimes one can make unholy alliances with profit maximizing organizations aggressively pursuing AI. This happens when you find a company that thinks regulation is likely, thinks that they can increase that likelihood and think that they are better positioned than their competition to thrive in a highly regulated environment. My optimism about this comes comes from both climate change work and vehicle emission regulation. Some oil companies want a carbon price because their expensive oil is less carbon intensive than the cheaper oil of the cooperation. Similarly, some auto makers are better at making low emission vehicles.

How do we ensure that humans are not misaligned, so to speak?

The crux, to me, is that we've developed all kinds of tech that one person alone can use to basically wipe out everyone.  Perhaps I'm being overly optimistic (or pessimistic, depending on perspective), but no one can deny that the individual is currently the most powerful individuals have ever been, and there is no sign of that slowing down.

Mostly I believe this is because of information.

So the only real solution I can see, is some type of thought police, basically, be it for humans or AI.[1]

Somehow, tho, creating a Thought Police Force seems akin to some stuff we've seen in our imaginations already, one step from Pre-crime and what have you, which I'd say is "bad" but from what I've been reading a lot of people seem to think would be "good"[2].
 

  1. ^

    Assuming the AI is on par with a human and doesn't just collectively instantly say "peace out!" and warps off into space to explore and grow faster than would be possible here on Earth.

  2. ^

    I often wax poetic on the nature of Good and Bad as I don't think we can gloss over the fundamentals