Accurate Models of AI Risk Are Hyperexistential Exfohazards

Thane Ruthenis

(Where "an exfohazard" is information which leads to bad outcomes if known by a large fraction of society.)

Let us suppose that we've solved the technical problem of AI Alignment — i. e., the problem of AI control. We have some method of reliably pointing our AGIs towards the tasks or goals we want, such as the universal flourishing of all sapient life. As per the Orthogonality Thesis, no such method would allow us to only point it at universal flourishing — any such method would allow us to point the AGI at anything whatsoever.

Which means that, if we succeed at the technical problem, there'll be a moment at the very end of the world as we know it, where a person or a group of people will be making a decision regarding the future of the universe.

Above and beyond preventing an omnicide, we need to ensure that in the timelines where we do solve the technical problem, this decision is made right.

1. Establishing the Framework

The one good thing about the technical problem of alignment is that it makes hyperexistential risks — the risks of astronomical suffering — very unlikely.

The problem of AI Alignment can be viewed as the problem of encoding our preferences into an AGI, bit by bit. The strength of alignment tools, in turn, translates to how many bits we can encode. With the current methods of end-to-end training, we're essentially sampling preferences at random. Perfect interpretability and parameter-surgery tools would allow us to encode an arbitrary amount of bits. The tools we'll actually have will be somewhere between these two extremes.

"Build us our perfect world" is a very complicated ask, and it surely takes up many, many thousands of bits. That's why alignment is hard.

"Build us a hell" is its mirror. It's essentially the same ask, except for a flipped sign. As such, specifying it would require pretty much the same amount of bits.

Thus, in the timelines where we have alignment tools advanced enough to build a hell-making AGI, it's overwhelmingly likely that we have the technical capability to build an utopia-building AGI. On the flipside, conditioning on our inability to build an utopia-builder, our tools are probably so bad we can't come close to a hell-builder. In that case, we just sample some random preferences, and the AGI kills us quickly and painlessly.

Screwing up so badly we create a suffering-maximizer is vanishingly unlikely: it's only possible in a very, very narrow range of technical capabilities.

Note the emphasis, though: technical capabilities.

The question of how these capabilities are used is an entirely different one.

2. How Bad Can It Be?

The preferences of most people, and most groups of people, are not safe to enforce upon the future.

It's trivially true if we look upon the history: non-secular countries would consider it Just to see billions of people eternally tortured in their gods' hells, xenophobic nations would genocide their enemies, power-hungry sociopaths ruling moral mazes would instinctively wish to erase all value from the universe, and so on.

Less than a century ago, most "progressive countries" would've erased today's protected minorities. And even the random sample of contemporary "good guys" would probably be happy to do something extremely monstrous to, say, billionaires or rapists or something.

Then there's the problem of "human chauvinism". Not even mundane humanism is safe to maximize, inasmuch as it may exclude sapient animals, uploads, AIs, or aliens.

The bottom line is, most people are not drunk on overly nuanced sci-fi-ish altruistic philosophy. "What is the universal good, defined mathematically?" is a question that's extremely off-distribution for them; most didn't spend a minute in their life seriously contemplating it, and have none of the requisite background. They don't have coherent preferences over the far-off future, and if forced to compile them on the spot, they'd produce something incoherent and pretty bad. Worse than annihilation.

And that's not even taking into account people who've actually been selected for monstrous qualities, or incentivized to develop them. And such people are most likely to end up in power, inasmuch as power is correlated with signaling ruthlessness or radical patriotism, and being willing to climb to the top while trampling others underfoot.

So we can't trust major systems, we can't trust the public consensus, we can't trust random people, and we can't trust random powerful people.

As such, there's a major sociopolitical dimension to AI Risk — beyond ensuring that we can point the AGI at the utopia, we need to ensure that the AGI actually ends up pointed at it.

Otherwise, getting AI Alignment solved would be much worse than staying back and letting humanity paperclip themselves out of existence.

3. What Does This Mean For Our AI Policy Work?

To be clear, that doesn't mean that I think we should stop all sociopolitical activism. We just need to be careful about it. The specific outcomes we want to avoid are:

The higher echelons of some government or military develop an accurate model of AI Risk.
- They'd want to enforce their government's superiority, or national superiority, or ideological superiority, and they'd trample over the rest of humanity.
- There are no eudaimonia-interested governments on Earth.
The accurate model of AI Risk makes its way into the public consciousness.
- The "general public", as I've outlined, is not safe either. And in particular, what we don't want is some "transparency policy" where the AGI-deploying group is catering to the public's whims regarding the AGI's preferences.
- Just look at modern laws, and the preferences they imply! Humanity-in-aggregate is not eudaimonia-aligned either.
A large subset of wealthy or influential people not pre-selected by their interest in EA/LW ideas form an accurate model of AI Risk.
- We'd either get some revenue-maximizer for a given corporation, or a dystopian dictatorship, or some such outcome.
- And even if the particular influential person is conventionally nice, we get all the problems with sampling a random nice individual from the general population (the off-distribution problem).

By "accurate model" here I mean "the orthogonality thesis + the real power of intelligence + the complexity of human preferences". The model with enough gears that it'd allow people to ask, "aligned to whose preferences?", and then wonder if maybe it can be their personal preferences.^[1]

Note what this doesn't exclude:

Communicating a more opaque model of AI Risk to politicians/the public. A model that just tells people that scaling up capabilities will lead to bad outcomes, without a particularly nuanced understanding of why.
- This should still be sufficient to slow down the timelines, and pass regulations controlling AI development. E. g., the current uproar about AI art theft is a good example. It's not even in the neighborhood of "hey, can we use this to create an Old Testament God?", yet it can lead to semi-effective regulations.
Convincing select powerful individuals, e. g. the leadership of major AI Labs, or the most prominent AI researchers.
- They weren't strongly pre-selected for monstrousness or an adherence to some outdated ideology, and are plausibly both (1) at-least-conventionally-nice and (2) willing to listen to philosophical arguments.
- If the system they're embedded won't go crazy trying to pressure them to develop e. g. a revenue-maximizer, they'll probably be free to use their actual moral reasoning to decide on the AGI's preferences.
- (Not that I'm saying it's completely safe. See: the recent case where a EA-aligned billionaire turned out to be... not conventionally nice.)

Also: transferring accurate models is pretty hard, actually. Most people don't take ideas seriously, and the more abstract an idea is, the more unlikely it is to be taken seriously. I think it's a major factor in why so many economically profitable technologies are successfully restricted. And inasmuch as "AI is dangerous" is much less abstract than "AI can be used to impose arbitrary values upon the future", it actually shouldn't be too difficult to increase the timelines without increasing the hyperexistential risk at the same time.

But one should still be mindful not to succeed too hard.

4. Wait, So Who Should Be In Charge Then?

A small group of philosophically-erudite, altruistically-minded people, probably sampled from this community.

No, I don't like the optics on that either. It irks my aesthetic senses. It makes me feel like a non-genre-savvy supervillain, especially when I concretely imagine that future of building a doomsday device in secret in my basement.

But, like, it seems that any other approach is much more likely to lead to bad outcomes, after thinking it through at the object level?

One point to make, here:

Power Does Not Corrupt

I think "power corrupts" is factually incorrect, as platitudes go. It's almost paradoxical: how exactly can a boost to your ability to enforce your preferences upon reality make these preferences worse? And, by implication, does that mean we should expect the powerless to be most un-corrupted? The people struggling to make ends meet, driven mad by hunger and deprivation, oppressed and trampled-over — we should expect them to be paragons of ascended philosophical virtue?

No, what corrupts isn't power. What corrupts is the road to power, and what one has to do to keep power.

The powerful are pre-selected to be the sort of people who primarily optimize for getting power. Thus, they're much more likely than average to be the kinds of people who'd use the handles of knives embedded in others' back as handholds.
Once one has power, it's a constant struggle to maintain that power while protecting yourself from the power-hungry individuals described in (1). Even if you started out decent, it's pretty difficult not to sink to their level. And if you started out bad, why, you'd just get worse.

This is why most people in power are corrupt. Not because "power" has a magical property of turning people into monsters.

And the people who'd acquire absolute power over the future in this hypothetical would acquire it by very different means, compared to the usual. And they would not need to hold onto it.

So, while it's not guaranteed that they'd be nice, there at least isn't any prior reason to think they'd be evil. Updating off "they're in power" is incorrect, here: corruption is correlated with power, but not caused by it.

... Which is not to say our community is safe. We're as vulnerable to being taken over by power-seekers as any other group.

This is an additional reason not to broadcast how much potential power lies down this road.

5. Critique of Specific Ideas

Long Reflection

There's a plan that goes:

Figure out "strawberry alignment" — i. e., how to make an AI pursue some concrete, localized goal like "synthesize a single strawberry", without committing omnicide and e. g. tiling the universe with strawberries or weird upstream-strawberry-correlates.
- This is contrasted with more complex goals like "build an utopia", which combine the difficulty of AI control with the philosophical difficulty of "what even is an utopia?".
Use this weakly-aligned AI to "end the acute risk period" — somehow slow down or halt unsafe AGI research world-wide.
There's some exit condition on this research ban: maybe it's lifted after a century, maybe some person or group of people have the authority to lift it, maybe there's some other recognition function on when it's fine to do so.
It's implied that, once the ban is lifted, humanity has matured enough to figure out its preferences and build an AGI implementing them safely.

First off, I'm skeptical that "strawberry alignment" is a thing. "Create a strawberry" is deeply value-laden in itself, it includes all the clauses like "but don't murder people over it" and what a "real strawberry" means, etc. I think if we can get the AGI to do that, if we can encode this many bits of preferences into it, we can probably just say "build an utopia" and have that command be safely executed too. The AGI will either know what we really mean, or help us figure it out.

However, if this is possible, I think this just leads to us building a hell. An AGI that can't build an utopia can't distinguish a hell from an utopia, so the recognition function on "what preferences should we enforce upon the future" is implemented by the entire humanity, and...

I think about it as... Imagine a list of existential and hyperexistential risks, ordered by probability-of-occurrence. This scheme doesn't somehow resolve AI Risk, in a complex way that updates the probabilities of all the risk below it. It just strikes off this first item off the list.

And I think what we have just below it is "Eternal-Dystopia Risk, probability 90%+".

So we end the acute risk period, and then immediately find ourselves in a totalitarian hellscape where sub-AGI drone swarms quell any rebellion, trillions of simulated humans are exploited for cheap labor, cults with procedurally-generated synthetic ideologies burn through minds like wildfire, etc., etc.

And then the condition for lifting the ban on AGI research is met, and all these marvels are enforced upon the future forever.

Again, this is worse than just sitting back and let omnicide happen. So if it's possible to strawberry-align but not utopia-align an AGI, and you face the choice between proliferating strawberry-alignment and doing nothing, it's better to do nothing.

These Arms Race Models

Point 1: I think the people in charge of such decisions aren't going to be using nuanced rational models like this. They weren't an accurate description of governments' thinking regarding nuclear MAD, and they won't be an accurate description of the AGI race.

In particular, I expect no-one is going to pay attention to the "safety generalization" parameter. For our work to be used to help these heathens lock-in their barbaric values? No, better classify all of it!

Point 2: If the Powers that Be do coordinate to finish alignment research before implementing their AGI, and so succeed at aligning their AGI with their values, that would be a hyperexistential catastrophe.

If the relevant players know that what they're racing over isn't a just weapon, but the entirety of the future, then humanity has already lost.

Dying With Dignity

It's fine to maximize for "death with dignity", i. e. to attempt to increase the log odds of humanity's survival... if you think that "not dying " is always preferable.

But, uh. What if you're successful? What if you get enough dignity points not to die...

And then survive in an undignified manner?

As above: better to die, I think.

6. Conclusion

I don't think we should give up and aim for an omnicide!

I think it's totally possible to get the sociopolitical part of the problem right! Especially in the possibility-branch where we succeed at technical alignment. There aren't many actors, right now, who are projected to eventually get the capability to deploy an AGI, and they're not controlled by anti-altruistic people, and there's no (to my knowledge) any powerful anti-altruistic organizations that take AI Alignment ideas seriously! We can totally get this done.

But to do that, we should shut up about some aspects of the problem. Public proliferation of accurate models of AI Risk is not conductive to a marvelous future.

Raising the awareness of generic "dangers of AI capabilities", and inviting funding towards generic "AI Safety research"? Sure, that's fine. And also much easier than actually transferring a nuanced understanding! In fact, transferring an accurate model is probably so difficult you shouldn't need to worry about accidentally doing it at all! (I even approve of the general message of e. g. this post.)

But if you do find a way to greatly increase the timelines at the cost of cluing a lot of people in regarding what's really on offer here, regarding what the good outcome can be, regarding the fact that "AI Alignment" research is actually "AI Control" research, don't do it. There are fates worse than death, and you'd be beckoning them.

... I'm not sure I should've even written this post, to be honest. I think it'd be pretty bad if "should we be supervillains trying to unilaterally steer the future of humanity?" becomes a frequent part of discourse. And of course I'm also spelling out here why all the governments/corporations/sociopaths should be looking in this direction, and while the effect should be very small, I'm not exactly redirecting their attention.

But the sentiment that we should do more public outreach has been picking up this year, and I think it'd lead to worse-in-expectation outcomes if I don't present this counter-argument.

I am also acutely aware that this potentially stokes the flames of adversity within the EA/LW community, between those who'd disagree with me and those who'd agree.

But, yeah. I think we should all shut up about certain matters.

^{^}
There's an objection here that goes, "but come on, the person who'd be actually coding-in the AI's preferences won't be the embodied avatar of the Government/the Will of the Public/the power-hungry sociopath personally, it'd be some poor ML engineer, and they'd realize what a mistake this is and go rogue and heroically code-in altruistic preferences instead!". And yeah, that's totally how it'd go in real life! You know, like how back in the early March of this year, a bunch of Russian siloviki and oligarchs realized how much suffering a single person's continued existence will bring upon the world and upon them personally, and just coordinated to unceremoniously shank him. That happened, right?
It's not really how this works.

"I wouldn't abuse power." is a lie so old that we evolved to believe it and act like it, in order to convince others, until we're in power. See Why Does Power Corrupt?.

I believe I address most of that in the post. The "evolution selected for it" explanation doesn't actually explain much about the mechanism by which power is correlated with corruption.

I suppose one additional "mechanism" is if one is immoral to begin with, and upon acquiring power, it just allows them to act on previously-suppressed urges to hurt people.

But if a person is selected not on the basis of power-hunger but on the basis of intellectual curiosity and commitment to abstract altruistic ideals, and then only has to "exercise" that power once by way of writing some lines of code, while supervised by other people selected on the basis of curiosity/abstract altruism? Seems plausible that it won't involve any corrupting incentives.

I guess we can end up with several plausible targets for the AI, and then there'd be a disagreement over them, and the most vicious guy will win, and e. g. the chosen target will be DWIM, which will make the paradigm shift slow enough to potentially give the vicious guy time to feel threatened by the opposition, which will make them abuse the power and therefore get corrupted by it...

But, uh, that can happen with any organization tasked with deploying the AGI? Except if it's done "through the proper channels", on top of this problem, we also have the problem where the people nominally in charge of the AGI deployment were subjected to all the corrupting incentives.

There's a kind of equivalence between being immoral to begin with, which is suppressed all our lives, and moral to begin with, but corrupted upon gaining power. To resolve ontological crises, I recommend https://arbital.com/p/rescue_utility. Afaic, the instinct to identify with past and future inhabitants of our body has the game-theoretic purpose of establishing mutual cooperation. This suggests to identify with whichever aspects of oneself would mutually cooperate.

Sure, one can avert corruption by never triggering the ancient condition of feeling in power. That's perhaps a core purpose of democracy.

That non-genre-savvy supervillain feeling? Check whether this was a case of "Honest Inside View vs. Unfortunate Deontological Injunction" or of "Ancient politics adaptation vs. Deontological Injunction working exactly as intended".

the key task is to prevent a single moment of total control. if one occurs, we failed. pivotal acts are the problem, not the solution.

Moments of total control may be a very difficult feature of the gameboard to eliminate.

yep. that's kind of the whole challenge, isn't it? I think we can do it. it's time to just shut up and do the impossible, can't walk away...

an alternate perspective - what if everyone already has, ehh, near-total control? it's a property of chaotic systems that small elements can have cascading effects, after all...

If we were all just butterflies flapping in the breeze of fate, I'd call that near-no control not near-total control. Am I missing something about this line of reasoning?

yeah I think that was just a bad train of thought and I was wrong.

How do you propose to robustly align the "pivotal process" (?) that we'd want to have in place of the pivotal act? How do we ensure that it'll output eudaimonic values, instead of being taken over by the power-hungry, as such processes are wont to do?

by describing to each other efficiently enough how to prevent takeover by the power hungry at every level, anything less than a solution to loss of slack to power seekers is insufficient to prevent power seekers from suddenly eating up all of humanity's slack.

or in other words, I don't have a magic solution to post in this comment, but a solution has to be something that solves this above everything else. It is a challenge of hyper generalizing defense analysis because now we have to write down morality precisely enough that we don't get an a war with a new species.

I genuinely think it might be harder than the actual technical problem of alignment, and we ought to look for any path which isn't that hopelessly doomed.

Let us suppose that we've solved the technical problem of AI Alignment — i. e., the problem of AI control. We have some method of reliably pointing our AGIs towards the tasks or goals we want, such as the universal flourishing of all sapient life. As per the Orthogonality Thesis, no such method would allow us to only point it at universal flourishing — any such method would allow us to point the AGI at anything whatsoever.

This does not hold if we get alignment by default.

Or any similar scheme where we solve the alignment problem via training the AI (via e.g. self supervised learning) on human generated/curated data (perhaps with additional safety features).

[Janus' Simulators cover important differences of self supervised models in more detail.]

It may be the case that such models are the only generally intelligent systems, but systems trained in such a way do not exhibit strong orthogonality.

And it does not follow that we can in full generality, point such systems at arbitrary other targets.

It may be the case that such models are the only generally intelligent systems, but systems trained in such a way do not exhibit strong orthogonality.

And it does not follow that we can in full generality, point such systems at arbitrary other targets.

I disagree with the first claim, primarily due to the only part. I do believe that language modeling might be enough, though I also think certain other paths are enough, like RL. It's just that SSL took off first here.

Well, the argument holds if there's a meaningful time period in which all general AI systems were trained via self supervised learning.

Fair point, I should've mentioned alignment by default. That said, even the original post introducing it considers it ~10% likely at best.

I think Wentworth is too pessimistic:

Wentworth's scheme of alignment by default is not the only scheme to it
We might get partial alignment by default and strengthen it

There are two approaches to solving alignment:

Targeting AI systems at values we'd be "happy" (where we fully informed) for powerful systems to optimise for [AKA intent alignment] [RHLF, IRL, value learning more generally, etc.]
Safeguarding systems that are not necessarily robustly intent aligned [Corrigibility, impact regularisation, boxing, myopia, non agentic systems, mild optimisation, etc.]

We might solve alignment by applying the techniques of 2, to a system that is somewhat aligned. Such an approach becomes more likely if we get partial alignment by default.

More concretely, I currently actually believe not just pretending to believe that:

Self supervised learning on human generated/curated data will get to AGI first
Systems trained in such a way may be very powerful while still being reasonably safe from misalignment risks(enhanced with safeguarding techniques) without us mastering intent alignment/being able to target arbitrary AI systems at arbitrary goals

I really do not think this is some edge case, but a way the world can be with significant probability mass.

Nah, this is way too dark and gritty.

Yes, obviously we should be consequentialists in what we emphasize. Yes, people in political power are precisely those selected to think "but what if the AI just maximized my values?" sounds extra-great.

But this worse-than-death dystopia really comes out of nowhere. Who is it that actually prefers that future? Why does torturing people give you a competitive edge when there's superintelligent AI running around, ready to give a huge competitive edge to whoever asks the right questions?

There are certainly sci-fi answers to these questions (the bad guy actively prefers suffering, torturing humans is an efficient way to get things done because we need uploads to do all our work for us), but they don't seem likely in real life. If we succeed technically but fail spectacularly at coordination, the most likely outcome - aside from death for all because political mechanisms often fail at implementing technical solutions - seems better than the status quo, because most humans want good things to happen for humanity in the abstract.

Who is it that actually prefers that future?

Tons of people? Xenophobes, homophobes, fascists, religious fanatics, elitists of various flavors. Some governments are run by such people; some countries have a majority of such people.

I'm not saying torture gives you a competitive edge (where did I say that?), I'm saying a lot of people genuinely prefer terrible fates for their outgroups. And while, sure, getting exposed to said outgroups may change their minds, it's not their current values, and the AI wouldn't care about their nice counterfactual selves who'd learned the value of friendship. The AI would just enforce their current reflectively endorsed preferences.

Even religious fanatics I'd call incoherent even more than they are malicious. Sure, the Taliban want unbelievers to be punished, but they also want God to be real and for the unbelievers to convert to the true faith.

When you talk about their "current values" without any process of growth, I don't think there's any there there - it's a big mess, not a utility function. Talking about good processes of growth is a vital part of getting an AI to do something that looks like "what you want."

Okay, maybe you could get to dystopia without just killing everyone by building an AI that tries to do some very specific thing ("maintain US military supremacy"), but only in the way that people typically imagine that very specific thing (can't just kill all humans and maintain empty U.S. military bases). But mostly I'd expect we'd just die.

When you talk about their "current values" without any process of growth, I don't think there's any there there - it's a big mess, not a utility function

Sure, yes, exactly my point. The problem is, you don't need to untangle this mess, or care about having coherent values, to tell an AGI to do things. It's not going to loop back to you and complain that what you're telling it to do is incoherent, inasmuch as you've solved the control problem and successfully made it do what you want. It'll just do what you want, the way you're imagining it, however incoherent it is.

"Maintain US military supremacy the way I typically imagine it" is, in fact, the primary use-case I have in mind, not a weird, unlikely exception.

Talking about good processes of growth is a vital part of getting an AI to do something that looks like "what you want."

How so? I have wants now. Why do I have to do some kind of "growth", for these wants to become legitimate? What'd prevent an AGI from understanding them as they are now?

The most useful / potentially-behavior-changing part of the post for me is the section describing how certain groups shouldn't develop detailed models of AI risk (pasted below). But the arguments are light on details. I'd like to see a second post building a more detailed model of why you think these outcomes are net negative.

The specific outcomes we want to avoid are:
The higher echelons of some government or military develop an accurate model of AI Risk.
They'd want to enforce their government's superiority, or national superiority, or ideological superiority, and they'd trample over the rest of humanity.
There are no eudaimonia-interested governments on Earth.
The accurate model of AI Risk makes its way into the public consciousness.
The "general public", as I've outlined, is not safe either. And in particular, what we don't want is some "transparency policy" where the AGI-deploying group is catering to the public's whims regarding the AGI's preferences.
Just look at modern laws, and the preferences they imply! Humanity-in-aggregate is not eudaimonia-aligned either.
A large subset of wealthy or influential people not pre-selected by their interest in EA/LW ideas form an accurate model of AI Risk.
We'd either get some revenue-maximizer for a given corporation, or a dystopian dictatorship, or some such outcome.
And even if the particular influential person is conventionally nice, we get all the problems with sampling a random nice individual from the general population (the off-distribution problem).

There's an objection here that goes, "but come on, the person who'd be actually coding-in the AI's preferences won't be the embodied avatar of the Government/the Will of the Public/the power-hungry sociopath personally, it'd be some poor ML engineer, and they'd realize what a mistake this is and go rogue and heroically code-in altruistic preferences instead!". And yeah, that's totally how it'd go in real life! You know, like how back in the early March of this year, a bunch of Russian siloviki and oligarchs realized how much suffering a single person's continued existence will bring upon the world and upon them personally, and just coordinated to unceremoniously shank him. That happened, right?

It's not really how this works.

It's not how it works because coups require coordination between a lot of different parties who have limited visibility into one another, and an immediate incentive to snitch to the administration. But here such coordination may be unnecessary, depending on technical details and how well organized the AGI research effort is. If one or two insiders can take unilateral action to completely steal the lightcone, then it's more akin to being able to shank the entire Russian military and FSB, as it were.

I expect one or two insiders wouldn't be enough; that the actual technical implementation of the AI's target will require coordination with, say, a dozen people at different stages of the process, or at least that the target will be visible/verifiable by many people at different stages of the process. And if the people in charge actually understand the stakes, it'll probably be then cross-reviewed by a different group entirely, before being deployed.

It'd still be possible to steal the lightcone unilaterally/with 1-2 collaborators, but it'd require defeating security measures built specifically against this sort of thing. I. e. the rogue actor would need to be (1) someone in the position on the project with the skills to code-in a different target, (2) willing to defy orders/ideology/procedure head-on like this, and (3) competent at conspiracy.

Which makes it a clearer analogy to a coup, I think. There is a disanalogy in that if the conspiracy is successful, the odds of being punished for it are near-zero, but I'm not convinced it'd be enough.

I expect one or two insiders wouldn't be enough; that the actual technical implementation of the AI's target will require coordination with, say, a dozen people at different stages of the process, or at least that the target will be visible/verifiable by many people at different stages of the process. And if the people in charge actually understand the stakes, it'll probably be then cross-reviewed by a different group entirely, before being deployed.

There's a limit on how much siloing and information security a state sponsored AGI research team can afford when it's competing with teams like DeepMind that don't necessarily have to operate under the same constraints. My guess is that there will be a dozen or so managers and system administrators capable of doing this sort of thing by default. If team members are incapable of modifying and understanding their AGI systems individually how are they supposed to keep up with the leading edge?

It'd still be possible to steal the lightcone unilaterally/with 1-2 collaborators, but it'd require defeating security measures built specifically against this sort of thing. I. e. the rogue actor would need to be (1) someone in the position on the project with the skills to code-in a different target, (2) willing to defy orders/ideology/procedure head-on like this, and (3) competent at conspiracy.

It seems much more likely that #2 and #3 will end up being satisfied by leading AI researchers to me, than by base rate military and police officials. Not because being a genius automatically makes you a nonconformist, but because there is much less slack and data available for filtering AI researchers for loyalty than when e.g. appointing a new FSB chief. And the more these governments reject Von Neumanns for less risky applicants the harder it will be to compete.

There's a limit on how much siloing and information security a state sponsored AGI research team can afford when it's competing with teams like DeepMind that don't necessarily have to operate under the same constraints

Mm, I'm not sure we're talking about the same problem? I'm saying that a lot of people will have read-access, and each of them would be able to notice that there's something very wrong with the AI's target, and then they'll need not to raise a fuss about it, for the conspiracy to succeed.

there is much less slack available for selecting AI researchers for loyalty

In which case there'll be more oversight over them, no?

In addition, the leading engineers won't necessarily need to be genius-level. The people doing foundational alignment research would need to be, but if we're in a hypothetical where we have alignment tools good enough to avoid omnicide, we're past the stage where theory was the bottleneck. In that world, verifying the AI's preferences should be trivial for only normally-competent ML engineers.

Which you'd then select for loyalty and put in the oversight team.

This post was incredibly interesting and useful to me. I would strong-upvote it, but I don't think this post should be promoted to more people. I've been thinking about the question of "who are we aligning AI to" for the past two months.

I really liked your criticism of the Long Reflection because it is refreshingly different from e.g. Macaskill and Ord's writing on the long reflection. I'm still not convinced that we can't avoid all of the hellish things you mentioned like synthetic superstimuli cults and sub-AGI drones. Why can't we just have a simple process of open dialogue with values of truth, individual agency during the reflection, and some clearly defined contract at the end of the long reflection to like, take power away from the AGI drones?

Thanks!

Why can't we just have a simple process of open dialogue with values of truth, individual agency during the reflection, and some clearly defined contract at the end of the long reflection to like, take power away from the AGI drones?

How is that process implemented? How do we give power to that process, which optimizes for things like "truth" and "individual agency", over processes that optimize just for power; over processes that Goodhart for whatever metric we're looking at in order to decide which process to give power to?

And if we have some way to define "truth" and "individual agency" directly, and tell our strawberry-aligned AI to only give power to such processes — is it really just strawberry-aligned? Why can't we instead just tell it to build an utopia, if our command of alignment is so strong as to robustly define "truth" and "individual agency" to the AI?

Yep, fair point. In my original comment I seemed to forget about the problem of AIs goodharting our long reflection. I probably agree now that doing a pivotal act into a long reflection is approximately as difficult as solving alignment.

(Side-note about how my brain works: I notice that when I think through all the argumentative steps deliberately, I do believe this statement: "Making an AI which helps humans clarify their values is approximately as hard as making an AI care about any simple, specific thing." However it does not come to mind automatically when I'm reasoning about alignment. 2 Possible fixes:

Think more concretely about Retargeting the Search when I think about solving alignment. This makes the problems seem similar in difficulty.
Meditate on just how hard it is to target an AI at something. Sometimes I forget how Goodhartable any objective is. )

I totally agree with the core logic. I've been refraining from spreading these ideas, as much as I want to.

Here's the problem: Do you really think the whole government and military complex is dumb enough to miss this logic, right up to successful AGI? You don't think they'll roll in and nationalize the efforts when the power of AI keeps on progressively freaking people out more and more?

I think a lot of folks in the military are a lot smarter than you give them credit for. Or the issue will become much more obvious than you assume, as we get closer to general AI.

But I don't think that's necessarily going to spell doom.

I hope that emphasizing corrigability might be adequate. That would at least let the one group who've controlled creation of AGI change their minds down the road.

I think a lot of folks in the government and military might be swayed by logic, once they can perfectly protect and provide abundantly for themselves and everyone they value. Their circle of compassion can expand, just like everyone here has expanded theirs.

Here's the problem: Do you really think the whole government and military complex is dumb enough to miss this logic, right up to successful AGI? You don't think they'll roll in and nationalize the efforts when the power of AI keeps on progressively freaking people out more and more?

If the timelines are sufficiently short, and the takeoff sufficiently hard, they may not have time to update. (If they haven't already, that is.)

Yes. But that seems awfully unlikely to me. What would it need to be, two years from now? AI hype is going to keep ramping up as chatGPT and its successors are more widely used and improved.

If the odds of slipping it by governments and miltaries is slight, wouldn't the conclusion be the opposite - we should spread understanding of AGI alignment issues so that those in power have thought about them by the time they appropriate the leading projects?

This strikes me as a really practically important question. I personally may be rearranging my future based on what the community comes to believe about this.

Edit: I think the community tends to agree with you and be working in hopes that we reach the finish line before the broader world takes note. But this seems more like wishful thinking than a realistic guess about likely futures.

If the odds of slipping it by governments and miltaries is slight, wouldn't the conclusion be the opposite - we should spread understanding of AGI alignment issues so that those in power have thought about them by the time they appropriate the leading projects?

The original post has been arguing that this leads to a hyperexistential catastrophe, and it's better to let them destroy everything, if they are to win the race.

But you have a different model implied here:

I hope that emphasizing corrigability might be adequate. That would at least let the one group who've controlled creation of AGI change their minds down the road.

Can you describe in more detail how you picture this going? I think I can guess, and I have objections to that vision, but I'd prefer if you outline it first.

You are probably guessing correctly. I'm hoping that whoever gets ahold of aligned AGI will also make it corrigible, and that over time they'll trend toward a similar moral view to that generally held in this community. It doesn't have to be fast.

To be fair, I'm probably pretty biased against the idea that all we can realistically hope for is extinction. The recent [case against AGI alignment](https://www.lesswrong.com/posts/CtXaFo3hikGMWW4C9/the-case-against-ai-alignment) post was the first time I'd seen arguments that strong in that direction. I haven't really assimilated them yet.

My take on human nature is that, while humans are often stunningly vicious, they are also often remarkably generous. Further, it seems that the viciousness is usually happening when they feel materially threatened. Someone in charge of an aligned AGI will not feel very threatened for very long. And generosity will be safer and easier than it usually is.

The problem with that is that "corrigibility" may be a transient feature. As in, you train up a corrigible AI, it starts up very uncertain about which values it should enforce/how it should engage with the world. You give it feedback, and gradually make it more and more certain about some aspects of its behavior, so it can use its own judgement instead of constantly querying you. Eventually, you lock in some understanding of how it should extrapolate your values, and then the "corrigibility" phase is past and it just goes to rearrange reality to your preferences.

And my concern, here, is that in the rearranged reality, there may not be any places for you to change your mind. Like, say you really hate people from Category A, and tell the AI to make them suffer eternally. Do you then visit their hell to gloat? Probably not: you're just happy knowing they're suffering in the abstract. Or maybe you do visit, and see the warped visages of these monsters with no humanity left in them, and think that yeah, that seems just and good.

I think those are perfectly good concerns. But they don't seem so likely that they make me want to exterminate humanity to avoid them.

I think you're describing a failure of corrigibility. Which could certainly happen, for the reason you give. But it does seem quite possible (and perhaps likely) that an agentic system will be designed primarily for corrigibility, or alternately, alignment by obedience.

The second seems like a failure of morality. Which could certainly happen. But I see very few people who both enjoy inflicting suffering, and who would continue to enjoy that even given unlimited time and resources to become happy themselves.

Dumb question: In the omnicide outcome agi continues its existence, but whether it's sentient or not no one else from earth does. In the undignified survival outcome agi continues its existence and potentially so do many humans for functionally ever and those human may have lives better than death (even barely). Why is parochial value-alignment necessarily worse than no alignment at all?

I think there's a possibility that their lives, or some of them, are vastly worse than death. See the recent post the case against value alignment for some pretty convincing concerns.

"Undignified survival" likely also involves many people/entities whose lives are worse than death. E. g., being eternally tortured because they committed some crime that the parochial values consider unforgivable, or because these entities aren't recognized as morally relevant by the parochial values and are therefore exploited for work/entertainment.