Strategic Moderation Goals (a Plan B to AI alignment)

Jim Buhler

Written in June 2023. I’ve only made some light style edits before posting here. Hence, this piece does not necessarily reflect my current views and ignores everything that happened with AI since June 2023. Still, I think some might find this relevant.

Note that every time I write “alignment” or “(mis)aligned”, I’m referring to “intent-(mis)alignement” (see Paul Christiano’s (2019) typology), like we often do.

The basic idea

A Strategic Moderation Goal (or SMG) is a goal that (i) makes a transformative AI system (TAI) disvalue some outcome we are indifferent to, and (ii) is implemented^[1] in a TAI in a low-alignment power context (where we expect the TAI to turn out misaligned enough to kill us all or to at least deprive humanity of its potential if it has the ability to do so).

This SMG proposal is intended as a “Plan B” if we end up expecting a TAI system to turn out misaligned,^[2] since an SMG successfully implemented in a rogue AI would have benefits including these major ones:

(i) It might save us in a multipolar scenario where the rogue AI is in competition with an aligned AI,^[3] by making the former punishable (although the punishment is something harmless, from our perspective), such that it can’t easily bully/exploit/destroy a friendly AI. The SMG basically makes the rogue AI less likely to “win” over the aligned AI.
(ii) It may save the closest alien civilizations,^[4] for the same reason.
(iii) It, more broadly, makes TAIs more prone to cooperation and helps them reach more Pareto-efficient outcomes. This, in turn, likely increases the overall quality of the long-term future (see, e.g., Dafoe et al. 2021; Clifton et al. 2022).

1. Why this might be more tractable than AI alignment (and therefore a good Plan B)

My basic (uninformed) intuition for why SMG implementation seems much easier than aligning AI with human values is that

(1) SMG is much less complex than human values. Making sure an AI disvalues something benign and simple seems easier than making sure it has humanity-aligned preferences.
(2) Given that we’re assuming the AI will be misaligned, it is arguably likely okay if the SMG the AI ends up having is different from the one we originally tried to implement, as long as the SMG stays a goal involving the minimization of something we don’t care about.

(1) might be a weak argument, given that THE current hurdle to AI alignment is arguably not the complexity of human values but our incapacity to control the AI’s goals at all. If we fail at aligning AI, it may seem unlikely that we succeed at making sure the AI has any goal remotely consistent with what we tried to specify.

However, not everyone in the alignment community seems to agree that this is THE current hurdle (the Shard Theory people, in particular, I believe?).

Also, while our incapacity to control the AI’s goals might be THE challenging problem, right now. It might very well end up being solved, And if it is solved, we might expect the “too complex human values” problem to become THE new challenge at which humanity might not succeed. I don’t know how plausible this is, but in this case, SMG implementation would be more tractable than alignment.

But (2) seems to be a much stronger argument. Having such flexibility with the goal we want to implement seems to be a massive upside.

2. Other strengths of this proposal

Thanks to the flexibility an SMG can have (see consideration (2) above), its implementation seems more tractable than making the AI intrinsically disvalue (a) the act of “bullying”, itself; (b) whatever its counterparts disvalue so it's less likely to want to bully them (see, e.g. Farrelly 2022); or (c) Pareto-inefficient bargaining outcomes.
SMG implementation might be even more counterfactual than, e.g., surrogate goals, since this can’t be delegated to the AI itself. Misaligned AIs plausibly won’t disvalue anything, by default. There is no reason to expect the “Rogue AI doesn’t disvalue anything” problem to be solved if we don’t intervene.
- I should note that I don’t see any cogent argument in favor of assuming misaligned AIs will be pure maximizers who disvalue nothing. If they disvalue things, SMGs seem less counterfactual. However, it’d still be positive, as argued right below.
If the Rogue AI would have already disvalued things without the SMG intervention, the intervention is less counterfactual but still seems net positive. The more the rogue AI disvalues things, the less bargaining power it has, the less it is capable of doing things we don’t want it to do.
If the expectedly rogue AI surprisingly turns out to be aligned in the end, we could just remove the SMG or replace it with a surrogate goal. And if we can’t, for some reason, the SMG would still act as a (sub-optimal?) version of a surrogate goal, which is not terrible.

Conclusion

I’ve presented the idea of Strategic Moderation Goals (SMGs) and its relevance in a scenario where we expect some TAI to turn out misaligned, argued that its implementation seemed more tractable than AI alignment, and raised other reasons why it may seem like a particularly sensible Plan B to AI alignment.

Acknowledgment

Thanks to Maxime Riché and Martin Soto for discussing with me an earlier version of this idea. Thanks to Filip Sondej and Martin for discussing with me the latest version of the idea (still in 2023). Thanks to Anthony DiGiovanni for his comments on an early draft. Thanks to Nicholas Dupuis and Miranda Zhang for their comments on later drafts. If I recall correctly, Nicholas is the one who suggested “Strategic Moderation Goal” to name my idea.

All assumptions/claims/omissions remain my own.

^{^}
By “implemented”, I mean “we have made sure that the rogue AI disvalues [something we don’t care about], although this doesn’t have to be/remain its only goal.”
^{^}
We don’t need to be certain it’d be the case. As we’ll see later, we can probably “remove” the SH later on if the TAI turns out aligned.
^{^}
I’m assuming that, in a multipolar scenario, we might fail at aligning some TAI systems while succeeding with others, which seems plausible given that may be trained differently and/or have very different architectures.
^{^}
While Earth-originating AGI meeting alien agents might seem science fictionesque, it absolutely isn’t. According to the most comprehensive models of alien distribution and spreading (see Hanson 2021; Cook 2022) this may be a hardly avoidable scenario.

LESSWRONG
LW