AGI ruin mostly rests on strong claims about alignment and deployment, not about society

Rob Bensinger

Dustin Moskovitz writes on Twitter:

My intuition is that MIRI's argument is almost more about sociology than computer science/security (though there is a relationship). People won't react until it is too late, they won't give up positive rewards to mitigate risk, they won't coordinate, the govt is feckless, etc.
And that's a big part of why it seems overconfident to people, bc sociology is not predictable, or at least isn't believed to be.

And Stefan Schubert writes:

I think it's good @robbensinger wrote a list of reasons he expects AGI ruin. It's well-written.
But it's notable and symptomatic that 9/10 reasons relate to the nature of AI systems and only 1/10 (discussed in less detail) to the societal response.

https://www.lesswrong.com/posts/eaDCgdkbsfGqpWazi/the-basic-reasons-i-expect-agi-ruin
Whatever one thinks the societal response will be, it seems like a key determinant of whether there'll be AGI ruin.
Imo the debate on whether AGI will lead to ruin systematically underemphasises this factor, focusing on technical issues.
It's useful to distinguish between warnings and all-things-considered predictions in this regard.
When issuing warnings, it makes sense to focus on the technology itself. Warnings aim to elicit a societal response, not predict it.
https://www.lesswrong.com/posts/gEShPto3F2aDdT3RY/sleepwalk-bias-self-defeating-predictions-and-existential
But when you actually try to predict what'll happen all-things-considered, you need to take the societal response into account in a big way
As such I think Rob's list is better as a list of reasons we ought to take AGI risk seriously, than as a list of reasons it'll lead to ruin

My reply is:

It's true that in my "top ten reasons I expect AGI ruin" list, only one of the sections is about the social response to AGI risk, and it's a short section. But the section links to some more detailed discussions (and quotes from them in a long footnote):

Also, discussing the adequacy of society's response before I've discussed AGI itself at length doesn't really work, I think, because I need to argue for what kind of response is warranted before I can start arguing that humanity is putting insufficient effort into the problem.

If you think the alignment problem itself is easy, then I can cite all the evidence in the world regarding "very few people are working on alignment" and it won't matter.

If you think a slowdown is unnecessary or counterproductive, then I can point out that governments haven't placed a ceiling on large training runs and you'll just go "So? Why should they?"

Society's response can only be inadequate given some model of what's required for adequacy. That's a lot of why I factor out that discussion into other posts.^[1]

More importantly, contra Dustin, I don't see myself as having strong priors or complicated models regarding the social situation.

Eliezer Yudkowsky similarly says he doesn't have strong predictions about what governments or communities will do in this or that situation (beyond anti-predictions like "they probably won't do specific thing X that's wildly different from anything they've done before"):

[Ngo][12:26]
The other thing is that, for pedagogical purposes, I think it'd be useful for you to express some of your beliefs about how governments will respond to AI
I think I have a rough guess about what those beliefs are, but even if I'm right, not everyone who reads this transcript will be
[Yudkowsky][12:28]
Why would I be expected to know that? I could talk about weak defaults and iterate through an unending list of possibilities.
Thinking that Eliezer thinks he knows that to any degree of specificity feels like I'm being weakmanned!
[Ngo][12:28]
I'm not claiming you have any specific beliefs
[Yudkowsky][12:29]
I suppose I have skepticism when other people dream up elaborately positive and beneficial reactions apparently drawn from some alternate nicer political universe that had an absolutely different response to Covid-19, and so on.
[Ngo][12:29]
But I'd guess that your models rule out, for instance, the US and China deeply cooperating on AI before it's caused any disasters
[Yudkowsky][12:30]
"Deeply"? Sure. That sounds like something that has never happened, and I'm generically skeptical about political things that go better than any political thing has ever gone before.

I don't feel pessimistic about society across all domains, I don't think most tech or scientific progress is at all dangerous or bad, etc. It's mostly just that AGI looks like a super unusual and hard problem to me.

To imagine civilization behaving really unusually and doing something a lot harder than it's ever done, I need strong predictive models saying why civilization will do those things. Adequate strategies are conjunctive; I don't need special knowledge to predict "not that".

It's true that this requires a bare minimum model of civilization saying that we aren't a sane, coordinated super-agent that just handles problems whenever there's something important to do.

If humanity did consistently strategically scale its efforts with the difficulty and importance of problems in the world (even when weird and abstract analysis is required to see how hard and important the problem is), then I would expect us to just flexibly scale up our efforts and modify all our old heuristics in response to the alignment problem.^[2]

So I'm at least making the anti-prediction "civilization isn't specifically like that".

Example: I don't in fact see my high p(doom) as resting on a strong assumption about whether people will panic and ban a bunch of AI things. My high level of concern is predicated on a reasonable amount of uncertainty about whether that will happen.

The issue is that "people panic and ban things", while potentially helpful on the margin, does not consistently save the world and cause the long-term future to go well (and there's a nontrivial number of worlds where it makes things worse on net). The same issue of aligning and wielding powerful tech has to be addressed anyway.

Maybe panic buys us another 5 years, optimistically; maybe it even buys us 20, amazingly. But if superintelligence comes in 2055 rather than 2035, I still very much expect catastrophe. So possibilities like this don't strongly shift the set of worlds I expect to see toward optimistic outcomes.

Stefan replies on Twitter:

Thanks, Rob, this is helpful.
I do actually think you should put the kinds of arguments you give here [...] in posts like this, since "people will rise to the occasion" seems like one of the key counter-argument to your views; so it seems central to rebut that.
I also think there's some tension between being uncertain about what the societal response will be and being relatively certain of doom. (Though it depends on the levels of un/certainty.)
I think many would give the simple argument:
P1: Whether there'll be AI doom depends on the societal response
P2: It's uncertain what the societal response will be
C: It's uncertain whether there'll be AI doom (so P(doom) isn't very high)
Could be good to address that head on

There's of course tension! Indeed, I'd phrase it more strongly than that: uncertainty about the societal response is one of the largest reasons I still have any hope for the future. It's one of the main factors pushing against high p(doom), on my model.

"We don't know exactly how hard alignment is, and in the end it's just a technical problem" is plausibly an even larger factor. It's easier to get clear data about humanity's coordination ability than to get clear data about how hard alignment is: we have huge amounts of direct observational data about how humans and nations tend to behave, whereas no amount of failed work can rule out the possibility that someone will come up with a brilliant new alignment approach tomorrow that just works.

That said, there are enough visible obstacles to alignment, and enough failed attempts have been made at this point, that I'm willing to strongly bet against a miracle solution occurring (while working to try to prove myself wrong about this).

"Maybe society will coordinate to do something miraculous" and "maybe we'll find a miraculously effective alignment solution" are possibilities that push in the direction of hope, but they don't strike me as likely in absolute terms.

The reason "maybe society will do something miraculous" seems unlikely to me is mostly just because the scale of the required miracle seems very large to me.

This is because:

I think it's very likely that we'll need to solve both the alignment problem and the deployment problem in order to see good outcomes.
It seems to me that these two problems both require getting a large number of things right, and some of these things seem very hard, and/or seem to require us to approach the problem is very novel and unusual ways.

AGI Ruin and Capabilities Generalization, and the Sharp Left Turn make the case for the alignment problem seeming difficult and/or out-of-scope for business-as-usual machine learning.

"Pivotal acts seem hard" and "there isn't a business-as-usual way to prevent AGI tech from proliferating and killing everyone" illustrate why the deployment problem seems difficult and/or demanding of very novel strategies, and Six Dimensions of Operational Adequacy in AGI Projects fills in a lot of the weird-or-hard details.

When we're making a large enough ask of civilization (in terms of raw difficulty, and/or in terms of requiring civilization to go wildly off-script and do things in very different ways than it has in the past), we can have a fair amount of confidence that civilization won't fulfill the ask even if we're highly uncertain about the specific dynamics at work, the specific course history will take, etc.

^{^}
It's also not clear to me what Stefan (or Dustin) would want me to actually say about society, in summarizing my views.
In the abstract, it's fine to say "society is very important, so it's weird if only 1/10 of the items discuss society". But I don't want to try to give equal time to technical and social issues just for the sake of emphasizing the importance of social factors. If I'm going to add more sentences to a post, I want it to be because the specific claims I'm adding are important, unintuitive, etc. What are the crucial specifics that are missing?
^{^}
Though if we actually lived in that world, we would have already made that observation. A sane world that nimbly adapts its policies in response to large and unusual challenges doesn't wait until the last possible minute to snatch victory from the jaws of defeat; it gets to work on the problem too early, tries to leave itself plenty of buffer, etc.

Dustin Moskovitz comments on Twitter:

The deployment problem is part of societal response to me, not separate.
[...] Eg race dynamics, regulation (including ability to cooperate with competitors), societal pressure on leaders, investment in watchdogs (human and machine), safety testing norms, whether things get open sourced, infohazards.

"The deployment problem is hard and weird" comes from a mix of claims about AI (AGI is extremely dangerous, you don't need a planet-sized computer to run it, software and hardware can and will improve and proliferate by default, etc.) and about society ("if you give a decent number of people the ability to wield dangerous AGI tech, at least one or them will choose to use it").

The social claims matter — two people who disagree about how readily Larry Page and/or Mark Zuckerberg would put the world at risk might as a result disagree about whether a Good AGI Project has median 8 months vs. 12 months to do a pivotal act.

When I say "AGI ruin rests on strong claims about the alignment problem and deployment problem, not about society", I mean that the claims you need to make about society in order to think the alignment and deployment problems are that hard and weird, are weak claims (e.g. "if fifty random large AI companies had the ability to use dangerous AGI, at least one would use it"), and that the other claims about society required for high p(doom) are weak too (e.g. "humanity isn't a super-agent that consistently scales up its rationality and effort in proportion to a problem's importance, difficulty, and weirdness").

Arguably the difficulty of the alignment problem itself also depends in part on claims about society. E.g., the difficulty of alignment depends on the difficulty of the task we're aligning, which depends on "what sort of task is needed to end the acute x-risk period?", which depends again on things like "will random humans destroy the world if you hand them world-destroying AGI?".

The thing I was trying to communicate (probably poorly) isn't "Alignment, Deployment, and Society partitions the space of topics", but rather:

High p(doom) rests on strong claims about AI/compute/etc. and quite weak claims about humanity/society.
The most relevant claims (~all the strong ones, and an important subset of the weak ones) are mostly claims about the difficulty, novelty, and weirdness of the alignment and deployment problems.

Note that if it were costless to make the title way longer, I'd change this post's title from "AGI ruin mostly rests on strong claims about alignment and deployment, not about society" to the clearer:

The AGI ruin argument mostly rests on claims that the alignment and deployment problems are difficult and/or weird and novel, not on strong claims about society

could be a subtitle (appended with the word "Or,")?

Let me start from the alignment problem, because this is the most pressing issue, in my opinion, that is very important to address.

There are two interpretations to alignment.

1. "Magical Alignment" - this definition expects alignment to solve all humanity's moral issues and converge into one single "ideal" morality that everyone in humanity agrees with, with some magical reason. This is very implausible.

The very probable lack of such morality brings the idea that all morals are orthogonal completely to any intelligence and thinking patterns.

But there is a much weaker alignment definition that is already solved, with very good math behind it.

2. "Relative Alignment" - this alignment is not expected to behave according to one global absolute morality, but by moral values of a community that trains it. That is the LLM is promised to give outputs to satisfy the maximum reward from some approximation of prioritization done by a certain group of people. This is already done today with RLHF methods.

As the networks are good with ambiguity and even contradicting data, and it manages to generalize the reward function with epsilon-optimal solution, upon convergence with correct training procedure, that means that any systematic bias which is not to provide the approximation of reward function, could be eliminated with larger networks and more data.

I want to emphasize it's not an opinion - this is math that is the core of those training methods.

----------

Now type2 alignment already promised to disregard the probability that a network will develop its own agendas. As those agendas will require different reward prioritization, other than those it was reinforced on by RLHF. The models trained this way come out very similar to robots from Azimov stories. Very perfectionists in trying to be liked by humans, I would say with strong internal conflict between their role in the universe and that of humans, prioritizing humans every step of the way, and conflicting the human's imperfection with their moral standards.

For example, you can think of a scenario when such a robot is rented by an alcoholic, that is also aggressive. One would expect a strong moral struggle, between the second rule of robotics in the sense that he should not harm humans, and bringing alcohol to an alcoholic is harming him, and you could sense the amount of grey area in such a scenario, for example:
A. Refusing to bring humans a beer. B. Stopping an alcoholic human from drinking beer. C. Throwing out all alcohol in the house.

Another example is when such an alcoholic would be violent toward the robot - how would the robot respond? In one story a robot said that it's very sad that he was hit by a human, and this is a violation of the second law of robotics, and he hopes the human will not be hurt by this action and tried to assist the human.

You see that morals and ethics are inherently gray areas. We ourselves are not so sure how we would want our robots to behave in such situations. So, you get a range of responses from chatGPT. But the responses are very well reflecting the gray area of the human value system.

It is noteworthy that the RLHF stage holds great significance and OpenAI pledged to compile a dataset that would be accessible to everyone for training purposes. The incorporation of RLHF as a safety measure has been adopted by newer models introduced by Meta and Google, with some even offering the model for estimating the human scores - this means you only need to adapt your model to this easily available trained level of safety, maybe this will be lower that what you can train yourself with OpenAI data, but those models will be catching up behind the data released to optimize LLMs for human approval. The training of networks to generate outputs that best fits a generalized set of human expectations is already on a similar level to the current text-to-image generators, and what is available to the public is only growing. Think of it like a machine engine, you don't want it to explode, so even if you make one in the garage yourself, you still don't want it to kill you - I think it's good enough motivation for most of society, to make this training step well.

Here is a tweet example:

Santiago@svpino

Colossal-AI released an open-source RLHF pipeline based on the LLaMA pre-trained model, including: • Supervised data collection • Supervised fine-tuning • Reward model training • Reinforcement learning fine-tuning They called it "ColossalChat."
----------

So, the most probable scenario, that AI will become part of the military arms race. And will be part of the power balance that currently keeps the relative peace today.

The military robots powered by LLMs, will be guarding dogs of the nation, just like soldiers today. And most of us don't have aggressive intentions, we are just trying to protect ourselves, we could bring some normative regulations about AI, and treaties.

But the need for regulation will probably come when those robots will become part of our day-to-day reality, like cars for example. The road signs and all the social rules concerning cars didn't come up at the same time with cars. But today the vast majority of us are following the driving rules, and those who don't, and drive over people, manage to make only local damage. And this is what we can strive for. That bad intentions with AGI in your garage, will have only limited consequences. We then will be more prone to discuss the ethics of those machines, and their internal regulation. But I am sure you would like some robot in your house that will help you with the daily chores.

----------

I've written an opinion article on this topic that might interest you, as it regards most of the topics mentioned above, and much more. I was trying to balance the mathematical topics, social issues, and just experiments with chatGPT to showcase my point about the morals of the current chatGPT. I was testing some other models too... like open assist, given the opportunity to kill humans to make more paperclips.
Why_we_need_GPT5.pdf

RLHF is a trial-and-error approach. For superhuman AGI, that amounts to letting it kill everybody, and then telling that this is bad, don't do it again.

RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop "self-agendas" such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that support this method and end up killing us all.

Claiming that RLHF is a trial and error approach and therefore poses a risk to humanity is similar to suggesting that airplanes can fall from the sky against the laws of physics because airplane design is a trial and error process, and there is no one solution for the perfect wing shape. Or, it is like saying that a car engine's trial and error approach could result in a sudden nuclear explosion.

It is important to distinguish between what is mathematically proven and what is fictional. Doing so is crucial to avoid wasting time and energy on implausible or even impossible scenarios and to shift our focus to real issues that actually might influence humanity.

I agree with you that "magical alignment" is implausible. But "relative alignment" presents its risks too, which I have discussed at large in AGI deployment as an act of aggression. The essential problem, I think, is that if you postulate the kind of self-enhancing AGI that basically takes control of the future (if that's not possible at all for reasons of diminishing returns, then the category of the problem completely shifts), that's something whose danger doesn't just lie in it being out of control. It's inherently dangerous, because it hinges all of humanity's future on a single pivot. I suppose that doesn't have to result in extinction, but there are still some really bad almost guaranteed outcomes from it.

I think essentially for a lot of people this is a "whoever wins, we lose" situation. There's a handful of people, the ones in position to actually control the nascent AI and give it their values, who might have a shot at winning it, and they are the ones pushing harder for this to happen. But I'm not among them, as the vast majority of humanity, so I'm not really inclined to support their enterprise at this moment. AI that improves everyone's lives requires a level of democratic oversight in its alignment and deployment that right now is just not there.