I like this analogy, but there are a couple of features that I think make it hard to think about:
1. The human wants to play, not just to win. You stipulated that "the human aims to win, and instructs their AI teammate to prioritise winning above all else". The dilemma then arises because the aim to win cuts against the human having agency and control. Your takeaway is "Even perfectly aligned systems, genuinely pursuing human goals, might naturally evolve to restrict human agency."
So in this analogy, it seems that "winning" stands for the human's true goals. But (as you acknowledge) it seems like the human doesn't just want to win, but actually wants both some "winning" and some "agency". You've implicitly tried to factor the entirety of the human's goals into the outcome of the game, but you have left some of the agency behind, outside of this objective, and this is what creates the dilemma.
For an AI system that is truly 'perfectly aligned'---truly pursuing the human's goals, it seems like either
For an AI system not perfectly aligned, this becomes a different game (in the sense of game theory). It's a three player game between the AI partner, the human partner, and the opponent, each of which have different objectives (the difference between the AI and human partners is that the human wants some combination of 'winning' and 'agency' while the AI just wants 'winning'; probably the opponent just wants both of them to lose). One interesting dynamic that could then arise is that the human partner could threaten and punish the AI partner by making worse moves than the best moves they can see if the AI doesn't give them enough control. To stop the human from doing this, the AI either has to
In particular, (D) seems like it would be expensive for the AI partner as it requires playing without the queen (against an opponent with no such restriction), so maybe the AI will let the human play sometimes.
2. I don't think it needs to be a stochastic chess variant. The game is set up so that the human gets to play whenever they roll a 6 on a (presumably six-sided) die. You said this stands in for the idea that in the real world, the AI system makes decisions on a faster timescale than the human. But this particular mechanism of implementing the speed differential as a game mechanism comes at the cost of making the chess variant stochastic. I think that determinism is an important feature of standard chess. In theory, you can solve chess with an adversarial look-ahead search, mini-max, alpha-beta pruning, etc. But as soon as the dice becomes involved, all of the players involved have to switch to expecti-mini-max. Rolling a six can suddenly throw off the tempo in your delicate exchange or your whirlwind manoeuvre. Etc.
I'm a novice at chess, so it's not like this is going to make a difference to how I think about the analogy (I will struggle to think strategically in both cases). And maybe a sufficiently accomplished chess player is familiar with stochastic variants already. But for someone in-between who is familiar with deterministic chess, maybe it's easier to consider a non-stochastic variant of the chess game, for example where the human gets the option to play every 6 turns (deterministically), which gives the same speed differential in expectation.
Re: your first point. I think I'm still a bit confused here and that's partly why I wanted to write this down and have people poke at it. Following Sen (but maybe I'm misunderstanding him) I'm not completely convinced I know how to factor human agency into "winning". One part of me wants to say that whatever notion of agency I have, in some sense it's a property of world states and in principle I could extract it with enough monitoring of my brain or whatever, and then any prescribed tradeoff between "measured sense of agency" and "score" is something I could give to the machine as a goal.
So then I end up with the machine giving me the precise amount of leeway that lets me screw up the game just right for my preferences.
I don't see a fundamental problem with that, but it's also not the part of the metaphor that seems most interesting to me. What I'm more interested in is human inferiority as a pattern, and the way that pattern pervades the overall system and translates into computational structure, perhaps in surprising and indirect ways.
I'll reply in a few branches. Re: stochastic chess. I think there's a difference between a metaphor and a toy model; this is a metaphor, and the ingredients are chosen to illustrate in a microcosm some features I think are relevant about the full picture. The speed differential, and some degree of stochasticity, seem like aspects of human intervention in AI systems that seem meaningful to me.
I do agree that if one wanted to isolate the core phenomena here mathematically and study it, chess might not be the right toy model.
"I distinguish four types. There are clever, hardworking, stupid, and lazy officers. Usually two characteristics are combined. Some are clever and hardworking; their place is the General Staff. The next ones are stupid and lazy; they make up 90 percent of every army and are suited to routine duties. Anyone who is both clever and lazy is qualified for the highest leadership duties, because he possesses the mental clarity and strength of nerve necessary for difficult decisions. One must beware of anyone who is both stupid and hardworking; he must not be entrusted with any responsibility because he will always only cause damage."
— Kurt von Hammerstein-Equord (source)
The broader point is that even if AIs are completely aligned with human values, the very mechanisms by which we maintain control (such as scalable oversight and other interventions) may shift how the system operates in a way that produces fundamental, widespread effects across all learning machines
Would you argue that the field of alignment should be concerned with maintaining control beyond the point where AIs are completely aligned with human values? My personal view is that alignment research should ensure we're eventually able to align AIs with human values and that we can maintain control until we're reasonably sure that the AIs are aligned. However, worlds where relatively unintelligent humans remain in control indefinitely after those milestones have been reached may not be the best outcomes. I don't have time to write out my views on this in depth right now, but here's a relevant excerpt from the Dwarkesh podcast episode with Paul Christiano that I agree with:
Dwarkesh: "It's hard for me to imagine in 100 years that these things are still our slaves. And if they are, I think that's not the best world. So at some point, we're handing off the baton. Where would you be satisfied with an arrangement between the humans and AIs where you're happy to let the rest of the universe or the rest of time play out?"
Paul: "I think that it is unlikely that in 100 years I would be happy with anything that was like, you had some humans, you're just going to throw away the humans and start afresh with these machines you built. [...] And then I think that the default path to be comfortable with something very different is kind of like, run that story for a long time, have more time for humans to sit around and think a lot and conclude, here's what we actually want. Or a long time for us to talk to each other or to grow up with this new technology and live in that world for our whole lives and so on. [...] We should probably try and sort out our business, and you should probably not end up in a situation where you have a billion humans and like, a trillion slaves who would prefer revolt. That's just not a good world to have made."
The metaphor is a simplification, in practice I think it is probably impossible to know whether you have achieved complete alignment. The question is then: how significant is the gap? If there is an emergent pressure across the vast majority of learning machines that dominate your environment to push you from de facto to de jure control, not due to malign intent but just as a kind of thermodynamic fact, then the alignment gap (no matter how small) seems to loom larger.
W. Ross Ashby's Law of Requisite Variety (1956) suggests fundamental limits to human control over more capable systems.
This law sounds super enticing and I want to understand it more. Could you spell out how the law suggests this?
I did a quick search of LessWrong and Wikipedia regarding this law.
Enough testimonials, the Wikipedia page itself describes the law as based on the observation that in a two-player game between the environment (disturber) and a system trying to maintain stasis (regulator), if the environment has D moves that all lead to different outcomes (given any move from the system), and the system has R possible responses, then the best the system can do is restrict the number of outcomes to D/R.
I can see the link between this and the descriptions from Yuxi_Liu, Roman Leventov, and Ashby. Your reading is a couple of steps removed. How did you get from D/R outcomes in this game to "fundamental limits to human control over more capable systems"? My guess it that you simply mean that if the more capable system is more complex / has more moves available moves / more "variety" than humans then the law will apply with the human as the regulator and the AI as the disturber. Is that right? Could you comment on how you see capability in terms of variety?
The work of Ashby I'm familiar with is "An Introduction to Cybernetics" and I'm referring to the discussion in Chapter 11 there. The references you're giving seem to be invoking the "Law" of requisite variety in the context of arguing that an AGI has to be relatively complex in order to maintain homeostatis in a complex environment, but this isn't the application of the law I have in mind.
From the book:
The law of Requisite Variety says that R's capacity as a regulator cannot exceed R's capacity as a channel of communication.
In the form just given, the law of Requisite Variety can be shown in exact relation to Shannon's Theorem 10, which says that if noise appears in a message, the amount of noise that can be removed by a correction channel is limited to the amount of information that can be carried by that channel.
Thus, his "noise" corresponds to our "disturbance", his "correction channel" to our "regulator R", and his "message of entropy H" becomes, in our case, a message of entropy zero, for it is constancy that is to be "transmitted": Thus the use of a regulator to achieve homeostasis and the use of a correction channel to suppress noise are homologous.
and
A species continues to exist primarily because its members can block the flow of variety (thought of as disturbance) to the gene-pattern, and this blockage is the species’ most fundamental need. Natural selection has shown the advantage to be gained by taking a large amount of variety (as information) partly into the system (so that it does not reach the gene-pattern) and then using this information so that the flow via R blocks the flow through the environment T.
This last quote makes clear I think what I have in mind: the environment is full of advanced AIs, they provide disturbances D, and in order to regulate the effects of those disturbances on our "cognitive genetic material" there is some requirement on the "correction channel". Maybe this seems a bit alien to the concept of control. There's a broader set of ideas I'm toying with, which could be summarised as something like "reciprocal control" where you have these channels of communication / regulation going in both directions (from human to machine, and vice versa).
The Queen's Dilemma was a little piece of that picture, which attempts to illustrate this bi-directional control flow by having the human control the machine (by setting its policy, say) and the machine control the human (in an emergent fashion, that being the dilemma).
One thing I want to talk about is that restricting human agency is plausibly fine, and maybe even net-good, and that this is important for future AI governments.
One reason can be for example making the physical world making it more like the online world, thus making more dictatorial forms of government and policing necessary to prevent constant chaos.
Our large learning machines find patterns in the world and use them to predict. When these machines exceed us and become superhuman, one of those patterns will be relative human incompetence. How comfortable are we with the incorporation of this pattern into their predictions, when those predictions become the actions that shape the world?
My thanks to Matthew Farrugia-Roberts for feedback and discussion.
As artificially intelligent systems improve at pattern recognition and prediction, one of the most prominent patterns that they'll encounter in the world is human incompetence relative to their own abilities. This raises a question: how comfortable are we with these systems incorporating our relative inadequacy into their world-shaping decisions?
To illustrate the core dynamic at play, consider a chess match where White is played by an AI, while Black is controlled by a team consisting of a human and an AI working in tandem. The human, restricted to moving only the queen, gets to play whenever they roll a six on a die; otherwise, their AI partner makes the move. The human can choose to pass, rather than move the queen. The AI on the Black team can play any piece at any time, including the queen.[1]
If the human aims to win, and instructs their AI teammate to prioritise winning above all else, it might develop strategies that minimise the impact of human "interference" – perhaps by positioning pieces to restrict the queen's movement. As the performance gap between the human and the AI on Black widens, this tension between achieving performance on the task and preserving meaningful human agency becomes more pronounced.
The challenge isn't about explicit control – the human can still make any legal move with the queen when given the chance. Rather, it's about the subtle erosion of effective control. The AI, making more moves and possessing superior strategic understanding, could systematically diminish the practical significance of human input while maintaining the appearance of cooperation. This distinction between de jure and de facto control becomes critical. We might accept if our queen becomes accidentally boxed in during normal play, but we bristle at the thought of our AI partner deliberately engineering such situations to mitigate our "unreliability."
The broader point is that even if AIs are completely aligned with human values, the very mechanisms by which we maintain control (such as scalable oversight and other interventions) may shift how the system operates in a way that produces fundamental, widespread effects across all learning machines - effects that may be difficult to mitigate, because the nature of our interventions tends to enhance the phenomenon.[2]
We can view this as a systematic challenge, analogous to controlling contaminants in semiconductor manufacturing.[3] Just as chip fabrication must carefully manage unwanted elements through multiple processing stages, AI development might have to tackle the challenge of how patterns of human limitation influence system behaviour. The goal isn't to eliminate awareness of human limitations – which would be both impossible and counterproductive – but to understand and manage how this awareness shapes AI behaviour.
Even perfectly aligned systems, genuinely pursuing human goals, might naturally evolve to restrict human agency.[4] Any balance between capability and control may ultimately prove unsustainable – perhaps leading to a permanent loss of de facto human control or requiring fundamental changes in human capabilities to close the gap. In the interim, understanding and managing this tension will be one of the ongoing challenges of AI alignment and control.
Related Work
The Queen's Dilemma connects to several important traditions in social theory and systems thinking. Max Weber's analysis of bureaucratic rationalisation describes how rational systems of organisation can create an "iron cage" that constrains human agency while maintaining the appearance of freedom. This mirrors how AI systems might maintain formal mechanisms for human input while systematically diminishing their practical significance through optimisation.
Amartya Sen's capability approach provides a framework for understanding why preserving human agency matters beyond mere goal achievement, and his work on the relation between well-being, agency and freedom (Dewey Lectures 1984) informed the treatment here.
From cybernetics, several key ideas inform our analysis. W. Ross Ashby's Law of Requisite Variety (1956) suggests fundamental limits to human control over more capable systems. Herbert A. Simon's research on bounded rationality and organisational decision-making provides several key insights relevant to the Queen's Dilemma. Simon (1976) showed how organisations develop procedures and structures to bridge the gap between human cognitive limitations and the complexity of decisions they face.
These perspectives suggest the Queen's Dilemma might be understood as part of a broader pattern in how rational systems interact with human agency - one that has manifested in different forms across various domains of social organisation. The challenge of maintaining meaningful human agency in a world suffused with AI systems may require not just better control mechanisms, but a fundamental rethinking of the relationship between human agency and machine capability.
This dice roll mechanism represents the differential speeds of decision-making between humans and AI systems; the AI system acts much more quickly than you in the world, that is, it takes more turns.
In practice, this manifests in scenarios like scalable oversight, where we constrain AI systems to generate outputs that humans can effectively verify. This bias in the space of possible plans creates a kind of pressure, against which the optimisation pressure of task performance must push.
The analogy to semiconductor manufacturing speaks to the complex dynamics of a system involving humans, AIs, alignment mechanisms, and control structures all operating together. The interaction between these components creates emergent pressures that can systematically erode meaningful human agency, even when each individual component is working as intended.
Agency itself represents a special kind of preference – not just about what we want to achieve, but how we want to achieve it.