This was originally written for Twitter and thus is predictably low quality (hence the "[LQ]" tag).
It has only been minimally edited (if at all).
Some thoughts on messaging around alignment with respect to advanced AI systemsA 🧵
Misaligned ASI poses a credible existential threat. Few things in the world actually offer a genuine threat of human extinction. Even global thermonuclear war might not cause it.
The fundamentally different nature of AI risks...
That we have a competent entity that is optimising at cross-purposes with human welfare.
One which might find the disempowerment of humans to be instrumentally beneficial or for whom humans might be obstacles (e.g. we are competing with it for access to the earth's resources).
An entity that would actively seek to thwart us if we tried to neutralise it. Nuclear warheads wouldn't try to stop us from disarming them.
Pandemics might be construed as seeking to continue their existence, but they aren't competent optimisers. They can't plan or strategise. They can't persuade individual humans or navigate the complexities of human institutions.
That's not a risk scenario that is posed by any other advanced technology we've previously developed. Killing all humans is really hard. Especially if we actually try for existential security.
Somewhere like New Zealand could be locked down to protect against a superpandemic, and might be spared in a nuclear holocaust. Nuclear Winter is pretty hard to trigger, and it's unlikely that literally every urban centre in the world will be hit.
Global thermonuclear war may very well trigger civilisational collapse, and derail humanity for centuries, but actual extinction? That's incredibly difficult.
It's hard to "accidentally kill all humans". Unless you're trying really damn hard to wipe out humanity, you will probably fail at it.
The reason why misaligned ASI is a credible existential threat — a bar which few other technologies meet — is because of the "competent optimiser". Because it can actually try really damn hard to wipe out humanity.
And it's really good at steering the future into world states ranked higher in its preference ordering.
By the stipulation that it has decisive strategic advantage, it's already implicit that should it decide on extinction, it's at a vantage point from which it can execute such a plan.
But. It's actually really damn hard to convince people of this. The inferential distance that needs to be breached is often pretty far.
If concrete risk scenarios are presented, then they'll be concretely discredited. And we do not have enough information to conclusively settle the issues of disagreement.
For example, if someone poses the concrete threat of developing superweapons via advanced nanotechnology, someone can reasonably object that developing new advanced technology requires considerable:
An AI could not accomplish all of this under stealth, away from the prying eyes of human civilisation.
"Developing a novel superweapon in stealth mode completely undetected is pure fantasy" is an objection that I've often heard. And it's an objection I'm sympathetic to somewhat. I am sceptical that intelligence can substitute for experiment (especially in R & D).
For any other concrete scenario of AI induced extinction one can present, reasonable objections can be formulated. And because we don't know what SSIs are capable of, we can't settle the facts of those objections.
If instead, the scenarios are left abstract, then people will remain unconvinced about the threat. The "how?" will remain unanswered.
Because of cognitive uncontainability — because some of the strategies available to the AI are strategies that we would never have thought of* — I find myself loathe to specify concrete threat scenarios (they probably wouldn't be how the threat manifests).
* It should be pointed out that some of the tactics AlphaGo used against Kie Jie were genuinely surprising and unlike tactics that have previously manifested in human games.
In the rigidly specified and fully observable environment of a Go board, AlphaGo was already uncontainable for humans. In bare reality with all its unbounded choice, an SSI would be even more so.
It is possible that — should the AI be far enough in the superhuman domain — we wouldn't even be able to comprehend its strategy (in much the same way scholars of the 10th century could not understand the design of an air conditioner).
Uncontainability is reason to be wary of an existential risk from SSIs even if I can't formulate airtight scenarios illustrating said risk. However, it's hardly a persuasive argument to convince someone who didn't already take AI risk very seriously.
I think the possibility of SSI is pretty obvious, so I will not spend much time justifying it. I will list a few arguments in favour though.
Note: "brain" = "brain of homo sapiens".
Furthermore, positing that an AI has "decisive strategic advantage" is already assuming the conclusion. If you posited that an omnicidal maniac had decisive strategic advantage, then you've also posited a credible existential threat.
It is obvious that a misaligned AI system with considerable power over humanity is a credible existential threat to humanity.
What is not obvious is that an advanced AI system would acquire "considerable power over humanity". Emergence of superintelligence is not self-evident.
Discussions of superintelligence often come with the implicit assumption that "high cognitive powers" when applied to the real world either immediately confer decisive strategic advantage, or allow one to quickly attain it.
My honest assessment is that the above hypothesis is very non-obvious without magical thinking:Speaking honestly as someone sympathetic to AI x-risk (I've decided to become a safety researcher because I take the threat seriously), many of the proposed vectors I've heard people pose for how an AI might attain decisive strategic advantage seem "magical" to me.
I don't buy those arguments and I'm someone who alieves that misaligned advanced AI systems can pose an existential threat.
Of course, just because we can't formulate compelling airtight arguments for SSI quickly attaining decisive strategic advantage doesn't mean it won't.
Hell, our inability to find such arguments isn't particularly compelling evidence either; uncontainability suggests that this is something we'd find it difficult to determine beforehand.
Uncontainability is a real and important phenomenon, but it may prove too much. If my best justification for why SSI poses a credible existential threat is "uncontainability", I can't blame any would be interlocutors for being sceptical.
Regardless, justifications aside, I'm still left with a conundrum; I'm unable to formulate arguments for x-risk from advanced AI systems that I am fully satisfied with. And if I can't fully persuade myself of the credible existential threat, then how am I to persuade others?
I've been thinking that maybe I don't need to make an airtight argument for the existential threat. Advanced AI systems don't need to pose an existential threat to necessitate safety or governance work.
If I simply wanted to make the case for why safety and governance are important, then it is sufficient only to demonstrate that misaligned SSI will be very bad.
Some ways in which misaligned SSI can be bad that are worth discussing:
A. Disempowering humans (individuals, organisations, states, civilisation)
Humanity losing autonomy and the ability to decide their future is something we can generally agree is bad. With advanced AI this may manifest this on smaller scales (individuals) up to civilisation.
An argument for disempowerment can be made via systems with longer time horizons, more coherent goal driven behaviour, better planning ability/strategic acumen, etc. progressively acquiring more resources, influence and power, reducing what's left in human hands.
In the limit, most of the power and economic resources will belong to such systems. Such human disempowerment will be pretty bad, even if it's not an immediate existential catastrophe.
I think Joe Carl Smith made a pretty compelling argument for why agentic, planning systems were especially risky along this front:https://docs.google.com/document/d/1smaI1lagHHcrhoi6ohdq3TYIZv0eNWWZMPEy8C8byYg/edit#
B. Catastrophic scenarios (e.g. > a billion deaths. A less stringent requirement than literally all humans die).
Misaligned AI systems could play a destabilising role in geopolitics, exacerbating the risk of thermonuclear war.
Alternatively, they could be involved in the development, spread or release of superpandemics.
It's easier to make the case for AI triggering catastrophe via extant vectors.
C. Compromising critical infrastructure (cybersecurity, finance, information technology, energy, etc.)
Competent optimisers with comprehensive knowledge of the intricacies of human infrastructure could cause serious damage via leveraging said infrastructure in ways no humans can.
Consider the sheer breadth of their knowledge. LLMs can be trained on e.g. the entirety of Wikipedia, Arxiv, the internet archive, open access journals, etc.
An LLM with human like abilities to learn knowledge from text, would have a breadth of knowledge several orders of magnitude above the most well read human. They'd be able to see patterns and make inferences that no human could.
The ability of SSI to navigate (and leverage) human infrastructure would be immense. If said leverage was applied in ways that were unfavourable towards humans...
D. Assorted dystopian scenarios
(I'm currently drawing a blank here, but that is entirely due to my lack of imagination [I'm not sure what counts as sufficiently dystopian as to be worth mentioning alongside the other failure modes I listed]).
I don't think arguing for an existential threat that people find it hard to grasp gains that much more (or any really) mileage over arguing for other serious risks that people are more easily able to intuit.
Unless we're playing a cause Olympics* where we need to justify why AI Safety in particular is most important, stressing the "credible existential threat" aspect of AI safety may be counterproductive?
(* I doubt we'll be in such a position except relative to effective altruists, and they'd probably be more sympathetic to the less-than-airtight arguments for an existential threat we can provide).
I'm unconvinced that it's worth trying to convince others that misaligned advanced AI systems pose an existential threat (as opposed to just being really bad).
I actually think that the question of DSA is maybe taking up too much oxygen in this debate. Usually this debate is had with soft-Singularitarians. It's a bit rich for someone to claim that AGI is going to be awesome and then deny that it will be very powerful. Arguing about DSA is avoiding the really uncomfortable questions, tbh. (The third position in this debate, the "AGI never/not soon" types, would not be easily convinced of near-future DSA, but they also don't really matter for policy outcomes.)
There's another aspect of the debate which I think we need to pivot to, sooner rather than later. That is the question of why AI would be misaligned and incorrigable. To be quite honest, I still don't alieve alignment as all that difficult or unnatural, and I haven't been well persuaded by the arguments I've seen. I'm saying this because I expect that it's even harder to be convinced that alignment is hard if your paycheck is involved. Therefore, if we want to change the minds of Sam Altman and Demis Hassibis, I expect that hardness of alignment is a more difficult sell than DSA, which Sam and Demis probably already would like to believe for income/ego reasons.
I unchecked the "moderators may promote to frontpage" because Twitter thread, but if you think LessWrong would benefit if it was made a frontpage post, do let me know.
I thought this was well written and made some extremely important points. I definitely think it's frontpage worthy.