“Endgame safety” for AGI

Steven Byrnes

“Endgame safety” for AGI

by Steven Byrnes

7 min read24th Jan 202310 comments

84 Ω 39

AI RiskSelf-DeceptionAI

Frontpage

Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.

(Status: no pretense to originality, but a couple people said they found this terminology useful, so I’m sharing it more widely.)

There’s a category of AGI safety work that we might call “Endgame Safety”, where we’re trying to do all the AGI safety work that we couldn’t or didn’t do ahead of time, in the very last moments before (or even after) people are actually playing around with powerful AGI algorithms of the type that could get irreversibly out of control and cause catastrophe.

I think everyone agrees that Endgame Safety is important and unavoidable. If nothing else, for every last line of AGI source code, we can do an analysis of what happens if that line of code has a bug, or if a cosmic ray flips a bit, and how do we write good unit tests, etc. But we’re obviously not going to have AGI source code until the endgame. That was an especially straightforward example, but I imagine that there will be many other things that also fall into the Endgame Safety bucket,^[1] i.e. bigger-picture important things to know about AGI that we only realize when we’re in the thick of building it.

So I am not an “Endgame Safety denialist”; I don’t think anyone is. But I find that people are sometimes misled by thinking about Endgame Safety, in the following two ways:

Bad argument 1: “Endgame Safety is really important. So let’s try to make the endgame happen ASAP, so that we can get to work on Endgame Safety!”

This is a bad argument because, what’s the rush? There’s going to be an endgame sooner or later, and we can do Endgame Safety Research then! Bringing the endgame sooner is basically equivalent to having all the AI alignment and strategy researchers hibernate for some number N years, and then wake up and get back to work. And that, in turn, is strictly worse than having all the AI alignment and strategy researchers do what they can during the next N years, and also continue doing work after those N years have elapsed.

I claim that there are plenty of open problems in AGI safety / alignment that we can do right now, that people are in fact working on right now, that seem robustly useful, and that are not in the category of “Endgame Safety”, e.g. my list of 7 projects, these 200 interpretability projects, this list, ELK, everything on Alignment Forum, etc.

For example, sometimes I’ll have this discussion:

ME: “I don’t want to talk about (blah) aspect of how I think future AGI will be built, because all my opinions are either wrong or infohazards—the latter because (if correct) they might substantially speed the arrival of AGI, which gives us less time for safety / alignment research.”
THEM: “WTF dude, I’m an AGI safety / alignment researcher like you! That’s why I’m standing here asking you these questions! And I assure you: if you answer my questions, it will help me do good AGI safety research.”

So there’s my answer.^[2] I claim that this person is trying to do Endgame Safety right now, and I don’t want to help them. I think they should find something else to do right now instead, while they wait for some AI researcher to publish an answer to their prerequisite capabilities question. That’s bound to happen sooner or later! Or they can do contingency-planning for each of the possible answers to their capabilities question. Whatever.

Aside: More Sensible Cousin of Bad Argument 1 [but I still strongly disagree with it]: Three steps: “(A) Endgame safety is really the main thing we should be thinking about; (B) Endgame safety will be likelier to succeed if it happens sooner than if it happens later [for one of various possible reasons]; (C) Therefore let’s make the endgame happen ASAP.”

See examples of this argument by Kaj Sotala and by Rohin Shah.

I mainly don’t like this argument because of (A), as elaborated in the next section.

But I also think (B) is problematic. First, why might someone believe (B)? The main argument I’ve seen revolves around the idea that making the endgame happen ASAP minimizes computing overhang, and thus slow takeoff will be likelier, which in turn might make the endgame less of a time-crunch.

I mostly don’t buy this argument because I think it relies on guessing what bottlenecks will hit in what order on the path to AGI—not only for capabilities but also for safety. That kind of guessing seems very difficult. For example, Kaj mentions “more neuroscience understanding” as bad because it shortens the endgame, but couldn’t Kaj have equally well said that “more neuroscience understanding” is exactly the thing we want right now to make the endgame start sooner??

As another example, this is controversial, but I don’t expect that compute will be a bottleneck to fast takeoff, and thus I don’t expect future advances in compute to meaningfully speed the Endgame. I think we already have a massive compute overhang, with current technology, once we know how to make AGI at all. The horse has already left the barn, IMO. See here.

So anyway, I see this whole (B) argument as quite fragile, even leaving aside the (A) issues. And speaking of (A), that brings us to the next item:

Bad argument 2: “Endgame Safety researchers will obviously be in a much better position to do safety / alignment research than we are today, because they’ll know more about how AGI works, and probably have proto-AGI test results, etc. So other things equal, we should move resources from current less-productive safety research to future more-productive Endgame Safety research.”

The biggest problem here is that, while Endgame Safety researchers will be in a better position to make progress in some ways, they’ll be in a much worse position than us in other ways. In particular:

Endgame Safety researchers will be in a severe time crunch thanks to the fact that if they don’t deploy AGI soon, someone less careful probably will.
…And for pretty much anything that you want to do in life, you can’t just make it happen 10× or 100× faster by hiring 10× or 100× more people! (“You can't produce a baby in one month by getting nine women pregnant.”) In particular:
- Conceptual (“deconfusion”) research seems especially likely to require years of wall-clock time (further discussion by Nate Soares).
- And even if the conceptual research is going well, spreading those concepts to reach broad consensus might take further years of wall-clock time, especially when the ideas are counterintuitive.
  - It takes wall-clock time for arguments to be refined, and for evidence to be marshaled, and for pedagogy to be created, etc. It certainly takes a lot of wall-clock time for the stubborn holdouts to die and be replaced by the next generation! And the latter seems to have been necessary in many cases in the past—think of the gradual acceptance of evolution or plate tectonics. In the case of AGI safety, consider that, for example, some AI thought-leaders like Yann LeCun and Jeff Hawkins have continued to say very stupid things about very basic AGI safety concepts for years, despite being smart people who have definitely been exposed to good counterarguments.^[3] Clearly we still have work to do!
- If it turns out that the path to safety entails building complex supporting tools and infrastructure (e.g. this or this or this), then that likewise takes a certain amount of wall-clock time to spin up, architect, build, test, debug, iterate, etc.
- Even leaving aside issues with unwieldy bureaucracies etc., you just can’t hire an unlimited number of people fast without scraping the bottom of the hiring pool, i.e. getting people with less relevant knowledge and competence, and/or who need time-consuming training. (In this sense, spending resources on AGI safety research & pedagogy right now helps enable spending resources on endgame safety, by increasing the number of future people who already have some knowledge about AGI safety, both directly and via outreach / pedagogy efforts.)
Early hints about safety can inform early R&D decisions—including via “Differential Technological Development”. By contrast, once a concrete development path to AGI is widely known and nearing completion, it would be very difficult for any other approach to AGI to catch up to it in a race.
Delaying the deployment of AGI by some number of years is by no means easy right now, but I bet it will be even harder in the endgame, when probably many powerful actors around the world will see a sprint-to-AGI as feasible.
Even if a future Endgame Safety researcher is aware of a failure mode (e.g. deception), and has proto-AGI code right in front of them that reproducibly exhibits that failure mode, that doesn’t mean that they will know how to solve that problem. Or maybe they’ll think they know how to solve it, but actually their plan amounts to building a band-aid that hides the problem while making it worse (e.g. making the deceptive proto-AGI better at not getting caught). And even if they do think of a way to solve the problem, maybe the solution looks like “our whole approach to AGI is wrong; throw it out and rebuild from scratch”, and there won’t be time to do so.
Also, maybe we’re already in the endgame! Maybe there are no important new insights standing between us and AGI. I don’t think that this is the case, but it’s difficult to be super confident about that kind of thing. So we should have people contingency-planning for that.

Aside: More Sensible Cousin of Bad Argument 2: “There are some safety things that we can and should do now, and other safety things that need to get done in the Endgame. We should be planning for both.”

As mentioned at the top, I think this is obviously true.

The best of both worlds would be to combine the knowledge-about-AGI of the endgame with the time-to-prepare of right now. How do we do that? Easy: time travel! In this fictional depiction, a time-traveling “Endgame Safety” expert from the future (center-left) is sharing advice with two leading experts on AGI governance (center-right & right), plus an expert on AGI capabilities (left).

^{^}
I’m talking as if there’s a sharp line where things either are or aren’t “Endgame Safety”. More realistically, I’m sure there will be a continuum. I don’t think that matters for anything I’m saying in this post.
^{^}
I guess I could share information in secret, but let’s assume for the sake of argument that I don’t want to. For example, maybe I don’t know them very well. Or maybe they’re planning to publish their research. (A very reasonable thing to do!) Or maybe secrets are a giant awful mess with tons of negative consequences that I don’t want to burden them with.
^{^}
E.g. Jeff Hawkins discusses in this video how someone sent him a free copy of Human Compatible, and he read it, and he thought that the book’s arguments were all stupid. For Yann LeCun see here.

New to LessWrong?

Getting Started

FAQ

Library

AI RiskSelf-DeceptionAI

Frontpage

84 Ω 39

Mentioned in

307Shallow review of live agendas in alignment & safety

144AI doom from an LLM-plateau-ist perspective

112Four visions of Transformative AI success

92Connectomics seems great from an AI x-risk perspective

71My take on Jacob Cannell’s take on AGI safety

Load More (5/6)

“Endgame safety” for AGI

New Comment

10 comments, sorted by

top scoring

Click to highlight new comments since: Today at 7:31 AM

[-]beren1y74

Strongly upvoted this post. I agree very strongly with every point here. The biggest consideration for me is that alignment seems like the kind of problem which is primarily bottlenecked on serial conceptual insights rather than parallel compute. If we already had alignment methods that we know would work if we just scaled them up, the same way we have with capabilities, then racing to endgame might make sense given the opportunity costs of delaying aligned AGI. Given that a.) we don't have such techniques and b.) even if we did it would be hard to be so certain that they are actually correct, racing to endgame appears very unwise.

There is a minor tension with capabilities in that I think that for alignment to progress it does need some level of empirical capabilities results both in revealing information about likely AGI design and threat models and also so we can actually test alignment techniques. I think that e.g. if ML capabilities had frozen at the level of 2007 for 50 years, then at some point we would stop being able to make alignment progress without capabilities advancements but I think that in the current situation we are very very far from this Pareto frontier.

[-]Rohin Shah1yΩ560

Downvoted for mischaracterizing the views of the people you're arguing against in the "Bad argument 1" section.

(Footnote 2 is better, but it's a footnote, not the main text.)

EDIT: Removed my downvote given the edits.

[-]Steven Byrnes1yΩ220

Thanks, I just added the following text:

(Edited to add: Obviously, the existence of bad arguments for X does not preclude the existence of good arguments for X! The following is a narrow point of [hopefully] clarification, not an all-things-considered argument for a conclusion. See also footnote [2].)

I know that you don’t make Bad Argument 1—you were specifically one of the people I was thinking of when I wrote Footnote 2. I disagree that nobody makes Bad Argument 1. I think that Lone Pine’s comment on this very post is probably an example. I have seen lots of other examples, although I’m having trouble digging up other ones right now.

I guess you can say it’s unvirtuous / un-scout-mindset of me to spend more time refuting bad arguments for positions I disagree with, than refuting bad arguments for positions I agree with? Hmm.

I also changed the Kaj link from “Example of this argument” to “Example of something close to this argument”. As a matter of fact, I do actually think that Kaj’s post had some actual Bad-Argument-1-thinking slipping in in various places in his text. At least, that’s how it came across to me. But it’s probably not a good use of time to argue about that.

[-]Rohin Shah1yΩ440

The edits help, thanks. I was in large part reacting to the fact that Kaj's post reads very differently from your summary of Bad Argument 1 (rather than the fact that I don't make Bad Argument 1). In the introductory paragraph where he states his position (the third paragraph of the post), he concludes:

Thus by doing capabilities research now, we buy ourselves a longer time period in which it's possible to do more effective alignment research.

Which is clearly not equivalent to "alignment researchers hibernate for N years and then get back to work".

Plausibly some actually Bad-Argument-1 is slipping in but it's not the thing that is explicitly said in the introduction.

I guess you can say it’s unvirtuous / un-scout-mindset of me to spend more time refuting bad arguments for positions I disagree with, than refuting bad arguments for positions I agree with? Hmm.

Eh, this seems fine. I just don't actually know of people who seriously believe Bad Argument 1 and would bet they are rare. So I predict that the main effect is making people who already believe your position even more confident in the position by virtue of thinking that the opponents of that position are stupid. More generally I wish that when people wrote takedowns of some incorrect position they try to establish that the position is actually commonly held.

(Separately, I am generally more in favor of posts that lay out the strongest arguments for and against a position and then coming to an overall conclusion, at which point you don't have to worry about considerations like "does anyone believe the thing I'm refuting", but those are significantly more effort.)

[-]Steven Byrnes1yΩ440

On further reflection, I promoted the thing from a footnote to the main text, elaborated on it, and added another thing at the end.

(I think I wrote this post in a snarkier way than my usual style, and I regret that. Thanks again for the pushback.)

[-]TAG1yΩ010

Of you course, Hawkins doesn't just say they are stupid. It is Byrnes who is summarily dismissing Hawkins, in fact.

[-]Steven Byrnes1yΩ220

Do you think my post implied that Hawkins said they were stupid for no reason at all? If so, can you suggest how to change the wording?

To my ears, if I hear someone say “Person X thinks Argument Y is stupid”, it’s very obvious that I could then go ask Person X why they think it’s stupid, and they would have some answer to that question.

So when I wrote “Jeff Hawkins thought the book’s arguments were all stupid”, I didn’t think I was implying that Jeff wasn’t paying attention, or that Jeff wasn’t thinking, or whatever. If I wanted to imply those things, I would have said “Jeff Hawkins ignored the book’s arguments” or “Jeff Hawkins unthinkingly dismissed the book’s arguments” or “Jeff Hawkins dismissed the book’s arguments without any justification” or something like that. I really meant no negative connotation. I describe myself as thinking that lots of things are stupid, and I don’t think of that as a self-deprecating kind of thing to say. Again, I’m open to changing the wording.

As it turns out, Jeff Hawkins has written extensively on why he thinks that AGI x-risk is not going to happen, and I in turn have written extensively (probably more than literally anyone else on earth) on why his arguments are wrong: See in particular

[-]Aaron_Scher1y20

Here's an idea that is in it's infancy which seems related (at least my version of it is, others may have fleshed it out, and links are appreciated). This is not written particularly well and it is speculative:

Say I believe that language models will accelerate research in the lead-up to AGI. (likely assumption)

Say I think that AI systems will be able to automate most of the research process before we get AGI (though at this point we might stop and consider if we're moving the goalpost). This seems to be an assumption in OpenAI's alignment plan, though I think it's unclear how AGI fits into their timetable. (less likely assumption, still plausible)

Given the above, our chances of getting a successfully aligned AGI may depend on "what proportion of automated AI scientists are working on AI Alignment vs. work that moves us closer to AGI?" There's some extreme version of this where the year before AGI has more scientific research than all of human history, or something wild.

Defining "Endgame" in this scenario seems hard, but maybe Endgame is "AIs can do 50% of the (2022) research work". It seems likely that in this scenario, those who care about AI Safety might reasonably want to start Endgame sooner if they think it increases the likelihood of enough automated Alignment research happening.

For example: In the current climate, maybe 3 actors have access to the most powerful models (OpenAI, DeepMind, Anthropic), and given that Anthropic is largely focused on AI Safety, we might naively expect that the current climate would have ~⅓ of the automated research be AI Safety if Endgame started tomorrow (depending on the beliefs of decision makers at OpenAI and DeepMind it could be much higher, and in actuality given the current world I think it would be higher). But, say we wait 2 years and then Endgame starts, now there's 2 other big tech labs and 3 startups who have similarly powerful models (and don't care much about Safety), and the fraction of automated research which is Alignment is only ~⅛.

In the scenario I've outlined above, it is not obvious that Endgame coming sooner is worse. There's quite a few assumptions that I think are doing a lot of work for this model:

These pre-AGI AIs can do useful alignment research, or significantly speed it up.
- I think this one of the bigger cruxes here, where it seems like coming up with useful, novel, theoretical insights to solve for instance deception is just really hard and will require models to be incredibly smart. Folks will disagree here about which alignment research is most important, and some of this will be affected by motivated reasoning around what kind of research you can automate easily (I think Jan Leike's plan here worries me because it feels a lot like "we need to be able to automate alignment research just as effectively as ML research so that we can ask for more compute for alignment" and this seems likely to lead to an automating of easy-to-automate-but-not-central alignment research; it worries me because I expect the plan to fail; see also Nate Soares).
Research done by AIs in the years leading up to AGI will represent a large chunk of the alignment research that happens, such that it is worth it to lose some human-research-time in order to make this shift come sooner (maybe for this argument to go through it only needs to be the case that the ratio of (good) Alignment/Capabilities research is higher once research is automated).
These pre-AGI AIs are sufficiently aligned. We have a fair amount of confident in our language models' ability to help with research without causing catastrophic failures. i.e., they are either not trying to overthrow humanity, or they are unable to do so, including through research outputs like plans for future AI systems.
There is some action folks in the AI Safety space could take which would make the move to this Endgame happen sooner. Examples might include: Anthropic scaling language models (and trying to keep them private), developing AI tools to speed up research (Elicit),
List goes on, but I'm tired

[-]Lone Pine1y2-1

Great post!

“I don’t want to talk about (blah) aspect of how I think future AGI will be built, because all my opinions are either wrong or infohazards—the latter because (if correct) they might substantially speed the arrival of AGI, which gives us less time for safety / alignment research.”

It seems to me that infohazards are the unstated controversy behind this post. The researchers you are in debate with don't believe in infohazards, or more precisely they believe that framing problems as infohazards makes progress impossible since you can't solve an engineering problem if you are not allowed to talk about it freely.

Presumably in the endgame, there will be no infohazards since all the important dangerous secrets are already widely known, or it's too late to keep secrets anyway. I think most researchers would prefer to work in an environment where they didn't have to deal with censorship. Therefore, if we can work as if it was the endgame already, then we might make more progress. That is the impetus behind getting to the endgame.

[-]Steven Byrnes1y31

I feel like this comment combines my “Bad Argument 1” with “Bad Argument 2”! If it doesn’t, then what am I missing? Or if it does, then do you think one or both of my “Bad Arguments” are not actually bad arguments?

Let’s say I have Secret Knowledge X, and let’s assume (generously!) that this knowledge is correct as opposed to wrong. And let’s say that if I share Secret Knowledge X, it would enable you to figure out Alignment Idea Y. But also assume that Secret Knowledge X is a key ingredient for building AGI.

Your proposal is: I should share Secret Knowledge X so that you can get to work on Alignment Idea Y.

My counter-proposal is: Somebody is going to publish Secret Knowledge X sooner or later on arxiv. And when they do, then you can go right ahead and figure out Alignment Idea Y. And in the meantime, there are plenty of other productive alignment-related things that you can do with your time. I listed some of them in the post.

(Alternatively, maybe nobody will ever publish Secret Knowledge X, but rather it will be discovered at DeepMind and kept secret from competitors. In that case, then someone on the DeepMind Safety Team can figure out Alignment Idea Y. And by the way, I’m super happy that in this scenario, DeepMind can go slower and spend more time on endgame safety, thanks to the fact that Secret Knowledge X has remained secret.)

Moderation Log