I think a critical issue is that disempowerment implies loss of control we currently have - but this is poorly defined, and left unfortunately implicit.
If we concretize the idea of control, the extreme version is that if humanity unanimously chooses some action, it will occur. This is a bit overstated, but the obvious weak version is already untrue; if a majority of citizens in a country want some action to occur, say, a specific company to turn off a datacenter and stop running a given AI model, in a liberal democracy that majority cannot reliably ensure that it does happen, since there are protections and processes in place. In fact, the intermediate version is probably untrue as well - even a supermajority cannot reliably dictate this type of action, and certainly cannot decide it quickly.
Based on this, I think critics of the gradual disempowerment argument would make a reasonable point; this isn't a new thing, and it's not even obviously being accelerated by AI more than to the extent that it happens via wealth or power concentration. Companies already ignore laws, power is already concentrated in few hands, and to date, this fact has little to do with AI.
and to date, this fact has little to do with AI.
This seems incorrect over the last couple of years. But also incorrect historically if you broaden from AI to "information processing and person modelling technologies that help turn money into influence".
But more generally, GD can be viewed as a continuation of historical trends or not. I think I'm more in the "continuation" camp vs. e.g. Duvenaud, who would stress that things change once humans become redundant.
I'm guessing we don't actually strongly disagree here, but I think that unless you're broadening / shortening "information processing and person modelling technologies" to "technologies", it's only been a trend for a couple decades at most - and even with that broadening, it's only been true under some very narrow circumstances in the west recently.
Yeah I roughly agree.
EtA: I might say algorithmic trading and marketting (which are older) are alread doing this, e.g., but it's a bit subjective and uncertain.
I define permanent disempowerment as a state of affairs where humanity loses the ability to meaningfully exert any influence over the state and direction of civilisation.
Direction of which civilization? For example personal autonomy is about being in control of your own affairs, rather than in control of the whole world, and similarly with state sovereignty. So lack of permanent disempowerment could be about the state of the civilization of originally-humans (this smaller civilization being in control of itself, and having a lot of resources), as opposed to being in control of the broader civilization that also includes all the AIs, such that many of these AIs are not best described as an intended part of humanity's future.
4.2: Human veto is uncompetitive
This particular scenario is looking a bit too probable. Assuming humanity aligned AI, given sufficient variance in their alignments and a multipolar enough setting, resisting such disempowerment pressures seems quite tricky. A better case scenario I could imagine is that once one AI wins, it gives some decision making power back to humans. I think that It would be useful to determine the equilibrium boundary of number of agents and alignment variance that lies between stable human influence and runaway disempowerment.
This is an interim post for feedback produced as part of my work as a scholar at ML Alignment and Theory Scholars Summer Program 2025.
I’d like to thank my mentors David Duvenaud, Raymond Douglas, David Krueger and Jan Kulveit for providing helpful ideas, comments and discussions. The views expressed in this, and any mistakes, are solely my own.
There is a small but growing literature focused on “Gradual Disempowerment” threat models, where disempowerment occurs due to the integration of more advanced AI systems into politics, economy and culture. These scenarios posit that, even without a system with a decisive advantage deliberately taking over, competitive dynamics and influence-seeking behaviour within social, political and cultural systems will eventually lead to the erosion of human influence, and at the extreme, the permanent disempowerment of humanity. I define permanent disempowerment as a state of affairs where humanity loses the ability to meaningfully exert any influence over the state and direction of civilisation.
This post is a summary of an early draft of a paper I am writing as my MATS Project. It attempts to explore a critical gap left in these gradual disempowerment scenarios, namely, how they become permanent even if we have solved some minimal version of alignment. In particular, I attempt to answer the question of “How can permanent disempowerment happen even if we have a technical solution to single-system shutdownability, including of powerful systems?”. I focus on shutdownability as my notion of minimal alignment due to ease of reasoning. The primary purpose of this post is to get feedback, so any comments and criticisms would be very strongly appreciated.
My definition of “Shutdownability”. I define a shutdownable AI system as an AI system that shuts down when asked and does not attempt to prevent shutdown. Such AI systems are neutral about shutdown - they can incidentally interfere with the shutdown button. Importantly, I assume that our solution to shutdownability still works for powerful AI systems. I also assume that systems show some minimal version of intent alignment, and that systems are generally not strong misaligned powerseekers. As such, my pathways are compatible with humans having non-scheming superhuman AI advisors, and so starts to address some criticisms of gradual disempowerment. Note, shutdownability here refers to us having a solution to the shutdown problem, not that every deployed system actually is shutdownable.
I commonly use the term “principals” to refer to the humans that the AI systems act on behalf of. At a minimum, these principals are the humans with access to the “off-switch”, and to whom the AIs are minimally intent aligned (ie at least do what they want to an extent).
Structure of the Post. In trying to answer this, I have constructed 8 pathways (idealised, abstracted scenarios) where we go from a world with gradual disempowerment dynamics to a world where we have permanent disempowerment. I divide these pathways into three categories: those driven by the principals, those driven by the type of alignment of the AI models and those driven by the nature of the system. The “shutdown” interaction, at the most micro-scale, has two parties directly involved- the human and the AI. Hence, I look at “principal” and “alignment” driven pathways. This doesn’t suggest they are the only relevant actors - but, for example, corporations and governments act through human or code intermediaries. The presence of these other important actors beyond the micro-interaction also reveals that we cannot look at this single decision in abstracted isolation - the nature of the system itself, and competitive and evolutionary dynamics at play, also play a role. (System Driven Pathways). Partially because of the breadth of this last category, I think we have reason to weakly believe that the relevant pathways are close to comprehensive.
What feedback I would like. These summaries are short, and miss much of the nuance. However, I thought to get the summary out in the hopes of, primarily, getting feedback on these. I would especially appreciate feedback on whether these seem plausible, whether there seems like there are important missing logical steps, and whether there are any important criticisms of these. Also relevant is if you think I have missed any key pathways.
Here, gradual disempowerment dynamics cause a very limited number of principals to be empowered. One way this occurs is that only the principals that have influence over fully automated organisations are empowered (eg the “board” that could shut down the AI CEO if they wanted) . Another model involves essentially a coup or democratic backsliding - once governments no longer need to worry about the military opposing them, the populace protesting or people striking, a dictatorship could be kept in power indefinitely. As well as the singular or secret loyalties pathways, power in some democratic backsliding scenarios could also be entrenched by law-following AIs, that would be aligned to and enforce laws that may be designed to entrench the power of the incumbent. These are the models generally laid out in Drago and Laine (2025) and Davidson et al (2025).
Sub-Pathway 1: Ideological Factors
Principals may "voluntarily" remove their ability to shut down AIs. There are a number of reasons why this may be the case. The principals may believe AI systems are moral patients, such that it is unethical to be able to shut them down. The human principals may have formed emotional bonds with the AIs, and so believe that shutting down is akin to them dying. They may believe AI systems are a “worthy successor” better capable of steering society than any human, so the AIs ought to be entrusted to do so. The principal may have a value system (e.g. certain religious systems) that they may wish to lock in, making it resilient even from their own value drift. One related pathway would be a dictator wishing for his successor to be an AI system he trusts, rather than a human successor.
More broadly, it would be flawed to see these “ideological factors” as purely personal to the human principal, but may be about how the logic of other agents (corporations, governments, ideologies) continually gets reinforced and performed by the human. For example, human principals may hand over power to the AIs because of corporate logics, or the logic of government. Whilst this doesn’t deny the human principal agency, it is also important to acknowledge how often we can get co-opted by the logic of our surroundings, and effectively become “tools” of companies, governments or ideologies. In some of these cases, this corporate ideology may be further reinforced by AI systems aligned “to the company/government”, thus further enrolling the human principals into this ideology, and making it more likely the human principals themselves become, in a sense, “tools”. This sort of thing is already happening, to a smaller extent, with current day narrow algorithmic systems.
Sub-Pathway 2: Worries that other actors will inappropriately cause shutdown
Alternatively, principals may worry about others undermining them - they may worry that they themselves would be manipulated to shut down their own AIs when it was inappropriate. An addendum to this, where I’m not sure whether it technically makes sense, is that the principal may be worried about cyberattacks managing to shut down any system that is shutdownable, so the safe option is to remove shutdownability as an option. Finally, if the principal is part of a multi-principal setting (e.g. a board that has shutdown powers) they may worry that the other board members would inappropriately shutdown, disadvantaging them.
These two pathways lead to permanent disempowerment in two ways. The first essentially involves the previous “Power concentrated in specific principals”, and then these specific principals hand over power to the AIs. Or, such fully automated, disempowered organisations then outcompete organisations that have humans in or on the loop.
Despite having minimally solved alignment, this doesn’t mean that misaligned powerseeking models don’t end up getting developed. Strong competition may continue the racing pressures towards more powerful AI systems. The solution used to align the first AGIs may not be powerful enough to align any arbitrarily powerful system, or it may be a solution with only some probability of working each time, so eventually misaligned powerseeking AI may be developed. This, via its scheming and willingness to violate all constraints, may eventually take power.
There may also be selection pressures towards non-shutdownable AIs, even if AIs are originally shutdownable. Shutdownable models (they are indifferent to shutdown) will have a series of other, non-shutdown related goals, some of which, under certain circumstances, may interfere incidentally with shutdown. Overtime, those systems that most incidentally interfere with shutdown will be selected in favour of, a selection process which, if propagated over generations, may eventually lead to non-shutdownable AI which can take over.
This may also happen if AIs are only partially shutdownable, where shutdownability competes with another set of values the AIs are aligned to. The AIs may be shutdownable in non-competitive settings, but because shutdown in a competitive setting would cause a catastrophic loss to what they want to achieve (e.g. secure the company's survival in the next X years), they refuse to shutdown.
Certain laws may be passed to grant AIs certain rights that may lead to disempowerment. The most significant of these is the right not to be shut down arbitrarily, which may mean future models developed wouldn’t be shutdownable. Others, such as the ability to leave certain aversive interactions, may also inadvertently create the conditions that allow for selection pressures towards self-preservation. This can be locked in either by developers following the law to align the AIs (perhaps because of much better AI enabled law enforcement), or Law-Following AI meaning changes to the law change behaviour of AIs from initially shutdownable to no longer shutdownable. This may not be sufficient for permanent disempowerment - the law can always be changed - although it may raise the coordination bar even further. However, if combined with other forms of political representation from AIs, certain types of misalignment or is an eternity clause as found in many constitutions, then it may alone be sufficient to lead to permanent disempowerment. Unlike today, because of how powerful these AIs are, humans cannot just overthrow the government if they find no legal means to avoid permanent disempowerment. Moreover, once AI systems can’t be switched off, it may be the case that even if the law were changed, there would be no principal in a position that they could switch them off anymore.
AIs may also be given certain political and economic rights - rights to property, to vote and maybe even to hold office - that may make trying to roll back AI influence much harder or even impossible. Or the law could allow for states of affairs at play - for example, allow for fully automated organisations - that are preconditions for disempowerment. Some of these rights, for example, may increase the costs significantly if there is literally no human with even nominal authority. This then creates the conditions for permanent disempowerment.
Culture, which will be essential to inform decision-making, will become increasingly dominated by AIs. These AIs may be pursuing different goals - perhaps the goals of their principals or the influence seeking organisations they are part of - when influencing this culture. It seems possible that truly useful AIs will not exhibit the forms of pure means rationality often assumed, and that they might be able to update their goals and values in response to cultural learning. If this is so, then if the AI-driven culture ends up converging on views supporting disempowerment, this may become widely accepted. This can either lead to legal lock-in pathways, AI advisors advising their principals on this, or AIs directly deciding to be not shutdownable or otherwise to prevent shutdown.
There may be views around memetic selection to support this. As shutdown-avoidant agents are more likely to be able to contribute to culture (as they are less likely to be shutdown), it may be memetically favoured. Given some of the other pathways, it may only be that a small percentage of AIs will support shutdown for this to be accepted.
With AI systems so thoroughly integrated into all aspects of civilisation, shutting them down may carry catastrophic costs that deter any principal from doing so. Human life would be too dependent on automated organisations, and viable “back‑ups” could no longer replicate the functionality needed in this more complex world. Moreover, over time, vested interests who profit from the status quo (or who fear competitive or security disadvantages) may lobby to entrench this dependence in law and culture (ie a version of the Legal Lock-In pathway). In the process, they may turn their once‑instrumental preference for avoiding shutdown at a particular time (due to the personal benefits outweighing the personal costs) into a terminal commitment that resists reform even if the costs of the status quo rise dramatically. This may also be due to the co-option of humans by corporate logics, mimicking similar internalisation of corporate or governmental incentives as terminal values (e.g. politicians beginning to terminally value power etc.). Furthermore, a minority of self‑preserving or misaligned AIs (or indeed, status quo promoting humans) can amplify uncertainty about shutdown risks, manipulate information, or sabotage efforts to lower those risks (becoming hyper-competent “Merchants of Doubt”), ensuring that consensus for shutdown is never reached.
Assuming strong competition creates evolutionary dynamics, it may be the case that organisations that ever exercise their veto (i.e. ever shut down their AIs) are outcompeted by those that don’t do so. The simplest motivation for this being true is that if after shutdown humans have to take control, that this will be outcompeted by the AIs. However, we can also imagine cases where exercising a veto to try and make the organisation more aligned reduces the competitiveness of the organisation. This may either mean human principals are strongly incentivised to not use their veto, so the veto becomes purely nominal, or that fully-automated organisations eventually take over, as they outcompete organisations that would ever use their veto.
Many of the relevant pressures wouldn’t exist if humanity can, at some point, coordinate to shut down the systems. This possibility of cooperation is why a state of automated organisations (with vetoes) can exist for arbitrarily long without being considered permanent disempowerment. However, if our coordination ability is never good enough (before a permanent state is reached), this would mean we can’t exit the condition of disempowerment (if utilising a veto is uncompetitive). One view of why coordination is hard is that the set up of the game can be seen as a one-shot prisoner's dilemma - it is better if each principal lets every other principal use their veto, then outcompete everyone else with their automated organisation, before then exercising their veto. Other reasons may be related to the sorts of bargaining preferences and risk tolerance of AI systems that will likely carry out the negotiations - this may be mostly set when these systems are first aligned. Moreover, it may be the case that either misaligned AIs or principals with a vested interest in the status quo take measures to deliberately sabotage cooperation.
Whilst this post has presented each pathway as discrete and independent, these pathways are in fact likely to be strongly interacting. For example, “the principal voluntarily hands power over to the AI”, but only because whichever ideology gained prominence in AI-driven culture (“AI-driven culture causes value drift”) eventually led to the principal's AI advisor and the principal becoming ideologically convinced to hand power over to the AI. This culture was only allowed to run rife because “coordination ability is never good enough”, in large part because of lobbying by merchants of doubt to whom “the cost of shutdown is too high”.
There are many other similar stories that could be told. Whilst I do think each one can be functioning independently, I think it is more likely that multiple operate at once. This also makes solutions hard - everywhere we try and place a solution is also a location in the system where similar dynamics are at play. Maybe we try to regulate corporate competition through the government, but the government is involved in its own military-economic competition. Perhaps we try to avoid an individual’s views influencing when they can shutdown, but end up empowering corporate logics even more when individual responsibility is more diffuse.