In our previous post, we outlined a view of AI alignment we disagree with as a central assumption in current discussions of AI alignment, and suggested that it might be useful to push in a different direction, which we started to outline. Here, we’ll point out that we think alignment itself is the wrong goal, and leads to a suboptimal future. However, we think that if control is handed over to AI systems, we’ll have problems anyways, and we should look in other directions.
A key problem we implicitly noted with traditional AI alignment is that any AI designed to serve everyone risks, or guarantees, dissolving traditions, deviance, or resistance. These “values” differ between groups or don’t optimize well.
We can see this even without reference to AI. When you optimize culture too hard, you create cultural homogenization, and loss of meaning. In some sense, globalization and collective culture are functioning as a move in that direction, with observable impacts. We see global conglomerates competing with local mom-and-pop stores, and the resulting drive towards commoditization and profit takes away local community institutions. We’re not arguing that the real superintelligence is corporations, which is a misunderstanding of the idea of superintelligence. Instead, we’re pointing out that the dynamics which are concerning predate but are going to be accelerated by AI.
Anomie, the breakdown of any morality, began as a concern arising from uprooted values. This seems prophetic. We have a growing global monoculture, and the default is for people and ideas to compete globally, allowing for only one eventual global monocultural winner, the single most-memetically-reproductively-fit culture. (Not the one best for human flourishing, of course. As always, optimization pressure maximizes what it maximizes, not what you wanted.)
The problem here isn’t that AI isn’t smart or powerful enough to resolve the conflict or find a better answer, it’s more fundamental. As we argued in the previous post, conflicting and opposed worldviews are actually a positive feature of reality. No matter which views you embrace, when you align too well and too strongly, you wipe out the parts of humanity that resist alignment on those lines. There is no secret solution to Arrow’s impossibility result, we are actually stuck with a choice between conflict, or ignoring and ultimately erasing the cultural losers.
As Harry Law recently pointed out, “We are minded to believe that there have always been ‘values’ just as surely as there has been a long history of spirited discussion about them. Except that isn’t really true. People have always had commitments, responsibilities, preferences, tastes, aspirations, convictions and cares. But only in the last century or so has anyone bundled these things together as ‘values’ as we might understand them today.”
The elimination of the distinction between commitments, preferences, tastes, and responsibilities allows a total ordering over outcomes, which is an essential part of the view we oppose; collapsing multiple factors into a single dimension allows optimization at the price of ignoring the distinct things people care about.
Rejecting this simplification means that most of the extant problems in AI alignment still exist, but they are exacerbated by eliminating the easy way out, of just finding the “right answer” for what to optimize. The (currently-impossible) task of building systems that reliably do what is requested, and no more, that do not disobey, that are safe enough to delegate the alignment problem to doesn’t solve anything. Yes, “alignment” requires solving fundamental and in our view irresolvable conflicts, but that’s not even enough. Delegating our purpose is incoherent if humanity is the goal.
So we believe that no answer to the question of what values to optimize for exists - but this is not an admission of defeat on the part of human “values,” since those don’t exist. Instead, we argue that we need a solution for preserving humanity and improving the future despite not having an easy solution of allowing gradual disempowerment coupled with single-objective beneficial AI.
Zvi Moshowitz wrote about level of friction, with the two opposing principles that “By Default Friction is Bad” and “Friction can be Load Bearing.” We disagree; friction is critical almost everywhere, and just like a literal physical system, the default of reducing metaphorical friction is, in every sense, a slippery slope. As we argued in the previous post, removing barriers and boundaries is fundamentally contrary to having meaningful lives.
This is not to say that material progress is bad - just that Chesterton’s fence applies to not only the contingent solutions humans have found, but also to the features of reality that shaped humanity. The deeply human feelings that arise from our limits, and the tragedy of not just death, but all of life, is central to what led humanity to create meaning. If there’s a win condition, there’s no reason to continue playing. We can strive to eliminate disease, suffering, and death, and make progress - but if we win on all fronts, the problems we care about just lead to the real end of history, and the loss of future meaning. That is, these narrow goals aren’t what Venkatesh Rao calls human complete. (This draws on James Carse’s view of infinite games, where “the object is not winning, but ensuring the continuation of play.”)
Of course, one of the most human things to do, starting with babies, is to test the limits, to push back against rules, and to rage against the dying of the light. Teenage rebellion is certainly just as human as the adult boundaries that create it - but it’s also central to a child’s ability to define themself. Without the boundaries, self-definition suffers. Similarly, on a global scale, it would be supremely ironic if humanity’s battle against the fundamental limits of aging and disease was successful, but led to erasing much of the friction that provides meaning.
But humanity’s capacity is growing, and we are arguably no longer in our teenage years. Even without superhuman AI, how do we combine our growing and effectively infinite ability to reshape the future, with preserving the challenge posed by reality? It seems that at best, there are the challenges of building what Anders Sandberg calls Grand Futures, but these are visions of a unified humanity, humanity-versus-nature. The intra-human world is perhaps full of “idyllic lives with meaningful social relations. This may include achieving close to perfect justice, sustainability, or other social goals.” Alternatively, it might be “extreme states of bliss,” or “abolishing suffering,” that is, in our view, removing everything that makes human meaning possible.
Is this simply Stockholm syndrome? Are we suggesting ennobling Bostrom’s Dragon-Tyrant, instead of fighting it?
No. It is recognition that while these fights are worthwhile, there must be a future beyond it - we need the world, not the specific battle, to be worth the candle. So we should celebrate whenever we vanquish each of humanity’s foes, and seek out further victories, while being wary of misinterpreting progress as moving towards a disastrous goal of ending our collective boundaries.
Ceding the problem to AI would do the opposite - it would be quitting the chess game early and letting stockfish play on our behalf. Perhaps this is the best move, to save the tragedy of losing ever more pieces on our way to victory - but we should recognize that it also means we’re no longer playing the game, and we don’t have anything left.
But this is still the background to the problem, not even a definition of what could qualify as a solution! Unfortunately, we don’t have such solutions, so we will work backwards from what a good future might look like.
The first question, one that is central to some discussions of long-term AI risk, is how can humanity stay in control after creating smarter-than-human AI?
But given the question, the answer is overdetermined. We don’t stay in control, certainly not indefinitely. If we build smarter than human AI, which is certainly not a good idea right now, at best we must figure out how we are ceding control. If nothing else, power-seeking AI will be a default, and will be disempowering - even if it’s not directly an existential threat. Even if we solve the problem of treachery robustly, and build an infantilizing vision of superintelligent personal assistants, over long enough time scales, it’s implausible that we not only build that race of more intelligent systems, but do not then cede any power. (And if we did, somehow, the implications of keeping systems that are increasingly intelligent in permanent bondage seems at best morally dubious.)
So, if we (implausibly) happen to be in a world of alignment-by-default, or (even more implausibly) find a solution to intent alignment and agree to create a super-nanny for humanity, what world would we want? Perhaps we use this power to collectively evolve past humanity - or perhaps the visions of pushing for transhumanism before ASI to allow someone, some group to stay in control are realized. Either way, what then for the humans?
Some have argued that the best we can hope for is a retirement home. In this future, the AI, or whomever controls the future, creates and leaves us with appectably good, or even great, lives, or at least as close to what humans want as can be conveniently managed. This is the Wachowskis’ vision of the Matrix; a place to put people to keep them out of the way while the machines are in charge.
We accept this as an implausibly good but still far from ideal future. That is, we think this clearly suboptimal target is far better than the likely outcome. But what would something better look like? In the long term, we don’t know - and until we have a much clearer vision for what alignment success would look like, we don’t think a better outcome is plausible.. So until we find better answers, our response to what MacAskill is now calling”Grand Challenges,” the future if we survive the creation of AGI, involves planning for a good but non-ideal future. Not because we want that future, but because we don’t see a better alternative once we have agreed to cede control to AI. And while we don’t agree, it seems that most EAs think that our replacement is the right path forward.
There are a couple routes we are skeptical of. For example, does something like Futarchy work? Not if people all vote to get rid of the friction. And as explained above, and in the previous essay, by default, that’s what happens - slowly. And this gradual collective and voluntary disempowerment isn’t ideal, but it’s not the rapid involuntary gradual disempowerment that is identical to loss of control, played out over years instead of days or weeks.
But the idea of delegating technical pieces without losing control, what we called self-driving, and governance without disempowerment, is a narrow target, as argued here. And the failure modes laid out there are critical, and hard to avoid - it took us centuries to figure out some imperfect and still often failing version of giving government control, but not too much control, and building robust safeguards for collapse of such systems. So the problem is unsolved even for the relatively easy case of scaling human oversight of human systems.
We start with the fact that humanity does not have a robust solution to human governance, but we do have workable systems with a balance of power. But these fail in practice, and even when done well are almost certainly unstable in the limit. They need constant patching and pain to identify where they fail, and we constantly need to propose how to modify the system to account for the failures. This rhymes with AI alignment discussions about graceful corrigibility and modular alignment - but perhaps starts with the presumption that it’s not necessarily solvable.
In some sense, this is trying to elevate “well kept gardens die by pacifism” into a fundamental principle rather than a heuristic for human communities. Any strong optimizer (or obnoxious shitposter, or misguided idealist, or free-speech absolutist) who reduces the vibrance of the system and the comfort of others using the system is (possibly) a threat to enjoyment and success. That doesn’t mean never changing anything - but to the extent that they are pushing for changing the system they work within, there needs to be broad consensus that it is acceptable and within bounds.
It seems better to aim for graceful degradation of our alignment system, in place of scalable oversight. That is, in place of systems whose goal it is to keep things safe, we want systems that don’t rely on individual other systems to supervise them. This answer presumes that robust delegation is both critical, and fundamentally infeasible - we cannot scale optimization power of any component or coherent system indefinitely without degrading our illegible and conflicting values. Task–decomposition presupposes coherence of sub-goals, and scalable oversight presumes coherent and understandable motivations and plans.
We think that a very recent paper, “Handing over the Keys to the City,” has a good framework for thinking about this; there are different authorities which can be delegated, and we are advocating for what they would call controlled delegation of keys, ensuring that this delegation is a public decision. If and when AI is given control over any publicly important resource or function, it requires “strict guidelines regarding scope, method, and limits.”
But we still need to specify the relationship between ourselves and the AI systems, and between AI systems and other AI systems - and this is where we return to our view of what alignment that preserves human values could look like. And again, we certainly don’t have fully developed complete plans, just ideas and directions.
One important direction for a conservative view of alignment is building notions of responsibility, rather than aiming towards specific behaviors. For example, in the US system, the government has a responsibility not to infringe on free speech - so the right of free speech is a consequence, not a guarantee. This means we can extend this responsibility broadly; no-one has the requirement to enable speech, but there is an expectation that a public (non-government) forum needs a compelling argument for why they would restrict speech.
Extending this outwards, we can see why this framing is safer and more conservative than those involving creation of utopias. If people have the right to happiness, we need to change things to ensure it. If people have the right to pursue happiness, we must only ensure we do not interfere unduly. If people have the responsibility to care for others, they can and should balance that responsibility with other responsibilities. Conversely, if people have the right to be cared for (fed, clothed, housed,) then there is an effectively unlimited obligation to fix any systems which do not guarantee the outcome, and to take from others to ensure that it happens.
This points to responsibilities, not rights. That is, each AI system has a fundamental responsibility to ensure the impact of its actions are broadly acceptable, and to follow extant norms. This is similar to the conservative view of human rights - we have responsibilities to our families and communities, and to the world, but they do not have the right to demand from us. Charity is an obligation, but so is self-reliance. Governments may need to force redistribution to ensure stability and safety, but this is a failure of both charity and of self-reliance, and at most an unfortunate necessity.
For AI systems, placing responsibility to rules and structure over the rights or goals of the users seems like a useful framing. Having systems that cooperate when there is reason to do so, and defect only in ways that do not violate responsibilities otherwise, seems like a strong case for being aligned. This seems related to constitutional AI, but requires something far stronger - and requires that the constitution be clearly framed around broadly public agreed upon rules, carefully ensuring that we hand over the keys slowly and deliberately.
We venture to guess that building AI systems with responsibilities would still need to solve a number of core AI alignment challenges, but would not face others. For example, in places where norms conflict in the limit, as they almost always do, the solution is to avoid the edge cases and gracefully generalize, not find rules to resolve the conflict. We see this concretely in online forums where freedom of speech norms conflict - good participants tend to avoid inflammatory topics, or tread lightly when they could offend. If changes in policy are needed, they are proposed without smashing Overton windows directly.
And this partly points back to our view from the previous post, about future systems as our children. Here, we are looking not at the micro-scale question of why rules and structure is needed, but rather trying to ensure that we build boundaries and give them increasing levels of responsibility over time. This is how society propagates its values and structure, but still progresses and can adapt over time, and it’s more or less functional - but this works, as David argued separately, only as long as change isn’t too rapid.
One critical question of how to build such structures is in part about how human power dynamics emerge, and how these dynamics can or will scale to future systems. This would consider arguments and dynamics around why and how AI would care about human goals - as has been suggested, “would future AI care about humans?” might be similar to “Do humans care about ants?” But as AI emerges, it’s possible that we will have chances to change that.
Of course, ensuring that our AI systems will have dynamics that allow or encourage positive engagement and tradeoffs with our goals isn’t necessarily the optimal path towards a future, but it is a safer path to having emergent dynamics that work alongside humanity, and can be constrained and balanced in similar ways to human systems. It also allows us to consider working with strong human-level AI, or weakly superhuman AI, before alignment is solved - which seems increasingly likely, albeit deeply unsafe.
While still several steps removed from solutions, Ram’s work on emergent dominance hierarchies in repeated interactions seem like one type of foundational work for enabling alignment. Similarly, better understanding how preferences emerge among multi-actor systems, and how they can cooperate robustly, seems critical.
We also think that there’s a critical conceptual shift for AI progress, where goal-pursuit is failing. At most, in the limit we want to aim for quantilizers, not optimizers. Anthropic’s vision of constitutional AI doesn’t quite reach this point; it still uses a (non-democratically chosen, optimized-for) “constitution” as a way to prompt system behaviors. It does not address the concerns about control, nor can it, since the stated goal is attempting to build a helpful replacement for human tasks, and eventually, humans individually (and eventually, wholesale.)
But this isn’t enough, and isn’t even fully described; what humanity needs from properly collaborative non-domineering AI systems must go much further.
If “alignment” means collapsing the plurality of human commitments into a single objective and giving optimized machines the mandate to push us there, that interpretation of alignment is the wrong target. We argue that the path forward requires abandoning the fantasy of a perfectly aligned superintelligence that resolves all human conflicts and optimizes for our collective good. Such a system, even if buildable, would dissolve the very friction and boundaries that give human life meaning. Instead, our conclusion is that we must design AI systems that recognize their responsibilities within a messy, conflicted world; systems that can operate under constraints, respect competing norms, and gracefully handle the irreducible tensions in human values without trying to optimize them away.
This means accepting that the future, even on containing beneficial AI, will not and should not be a solved problem with a final answer. Like human governance, it requires constant vigilance, adaptation, and the preservation of productive conflict. We need systems that can participate in our social dynamics without steamrolling them, that understand the difference between a rule and a goal, and that recognize when to defer rather than optimize. The alternative, ceding control to systems designed to find "better" solutions than messy human compromise, leads inexorably toward the cultural homogenization and loss of meaning we've warned against.
Our stance is conservative in two senses. First, we would bind systems to responsibilities, with publicly legible duties and limits, rather than to goals or values. Second, we prefer graceful degradation over scalable control: polycentric arrangements, rate limits, veto points, sunset clauses, and recoverable paths when we get it wrong. If we can’t roll it back, we shouldn’t roll it out.
We don’t claim to have a complete or clear plan - but we do think we have a useful direction. So in our next post, we'll explore one possible concrete mechanism: multi-agent AI systems that learn through conflict with each other, developing something like empathy through resonance. Using the 1957 film "12 Angry Men" as our case study, we'll examine how conflicts between agents might create systems capable of making genuinely moral decisions rather than simply optimized ones.