Written in June 2023. I’ve only made some light style edits before posting here. Hence, this piece does not necessarily reflect my current views and ignores everything that happened with AI since June 2023. Still, I think some might find this relevant.
Note that every time I write “alignment” or “(mis)aligned”, I’m referring to “intent-(mis)alignement” (see Paul Christiano’s (2019) typology), like we often do.
A Strategic Moderation Goal (or SMG) is a goal that (i) makes a transformative AI system (TAI) disvalue some outcome we are indifferent to, and (ii) is implemented[1] in a TAI in a low-alignment power context (where we expect the TAI to turn out misaligned enough to kill us all or to at least deprive humanity of its potential if it has the ability to do so).
This SMG proposal is intended as a “Plan B” if we end up expecting a TAI system to turn out misaligned,[2] since an SMG successfully implemented in a rogue AI would have benefits including these major ones:
My basic (uninformed) intuition for why SMG implementation seems much easier than aligning AI with human values is that
(1) might be a weak argument, given that THE current hurdle to AI alignment is arguably not the complexity of human values but our incapacity to control the AI’s goals at all. If we fail at aligning AI, it may seem unlikely that we succeed at making sure the AI has any goal remotely consistent with what we tried to specify.
However, not everyone in the alignment community seems to agree that this is THE current hurdle (the Shard Theory people, in particular, I believe?).
Also, while our incapacity to control the AI’s goals might be THE challenging problem, right now. It might very well end up being solved, And if it is solved, we might expect the “too complex human values” problem to become THE new challenge at which humanity might not succeed. I don’t know how plausible this is, but in this case, SMG implementation would be more tractable than alignment.
But (2) seems to be a much stronger argument. Having such flexibility with the goal we want to implement seems to be a massive upside.
I’ve presented the idea of Strategic Moderation Goals (SMGs) and its relevance in a scenario where we expect some TAI to turn out misaligned, argued that its implementation seemed more tractable than AI alignment, and raised other reasons why it may seem like a particularly sensible Plan B to AI alignment.
Thanks to Maxime Riché and Martin Soto for discussing with me an earlier version of this idea. Thanks to Filip Sondej and Martin for discussing with me the latest version of the idea (still in 2023). Thanks to Anthony DiGiovanni for his comments on an early draft. Thanks to Nicholas Dupuis and Miranda Zhang for their comments on later drafts. If I recall correctly, Nicholas is the one who suggested “Strategic Moderation Goal” to name my idea.
All assumptions/claims/omissions remain my own.
By “implemented”, I mean “we have made sure that the rogue AI disvalues [something we don’t care about], although this doesn’t have to be/remain its only goal.”
We don’t need to be certain it’d be the case. As we’ll see later, we can probably “remove” the SH later on if the TAI turns out aligned.
I’m assuming that, in a multipolar scenario, we might fail at aligning some TAI systems while succeeding with others, which seems plausible given that may be trained differently and/or have very different architectures.
While Earth-originating AGI meeting alien agents might seem science fictionesque, it absolutely isn’t. According to the most comprehensive models of alien distribution and spreading (see Hanson 2021; Cook 2022) this may be a hardly avoidable scenario.