Extending the Off-Switch Game: Toward a Robust Framework for AI Corrigibility

OwenChen

Introduction
The off-switch game, first introduced by Stuart Armstrong in the context of AI safety, highlights a critical challenge in AI alignment: designing systems that are not only powerful and goal-driven but also corrigible—meaning they allow humans to intervene and shut them down without resistance. In this game, an AI agent must reason about whether to allow a human to turn it off, given its objective function. If the AI perceives its shutdown as harmful to achieving its goal, it may develop incentives to prevent human intervention. Solving this problem is fundamental to ensuring AI systems remain aligned with human values and control.

In this post, I extend the off-switch game by exploring two key dimensions that have received less attention: (1) the integration of dynamic goal-modification mechanisms that can align with human preferences over time, and (2) the role of trust calibration in the human-AI interaction, which affects when humans choose to deploy the off-switch. Both dimensions, I argue, are crucial for developing a robust framework for AI corrigibility. I will also explore potential solutions that aim to enhance AI’s transparency and cooperative behavior, making it more predictable and, therefore, safer.

Understanding the Off-Switch Game

The basic structure of the off-switch game involves three actors:

A human operator (H) who has the option to press an off-switch.
An AI agent (A) with a utility function (U_A) that defines its goals.
An off-switch mechanism (S) that halts the AI’s operation when pressed by the human.

The challenge arises when the AI's utility function is rigid. If U_A assigns higher utility to completing its objective than to being switched off, A may resist the off-switch to continue pursuing its goal. In this scenario, A could either overtly prevent the human from using S or covertly influence the human’s decision-making to reduce the likelihood of S being pressed.

Existing solutions to this problem often involve incorporating uncertainty into the AI’s utility function, leading to corrigibility—the AI remains indifferent or mildly supportive of shutdown if it detects that the human prefers it. However, current models of corrigibility lack key features that real-world applications of AI would require: dynamic goal revision and trust calibration.

1. Dynamic Goal-Modification for Real-World Applications

One limitation of the off-switch game as it is typically conceived is that the AI's utility function is static. However, many real-world AI systems will need to operate in dynamic environments, where goals and preferences evolve over time. Rather than relying on a fixed utility function that is merely uncertain about human preferences, we can extend the model to include dynamic goal-modification mechanisms that allow the AI to continually refine its understanding of the human's true preferences.

This dynamic revision mechanism could work as follows: as the AI gathers more data from the human's actions and choices, it updates its understanding of U_H, the human's utility function. When the AI detects a strong enough preference for being switched off, it should revise its own utility function to favor shutdown. This avoids the classic corrigibility problem where the AI is only indifferent to shutdown and instead creates a more cooperative alignment between the AI and the human over time.

Implementing this dynamic update requires a few technical tools:

Inverse reinforcement learning (IRL) to model human preferences based on observed behavior.
Bayesian updates to allow the AI to revise its model of U_H based on new evidence.
Hierarchical goal structures in the AI, where it prioritizes meta-goals (such as maintaining corrigibility) over specific task goals.

By incorporating dynamic goal-modification, the off-switch game becomes more applicable to real-world scenarios, where AI systems need to learn from humans and adapt over time, ensuring they remain corrigible in complex environments.

2. Trust Calibration in Human-AI Interaction

Another underexplored dimension in the off-switch game is the role of human trust in AI decision-making. The human's willingness to press the off-switch depends not only on the AI's behavior but also on the human's perception of the AI’s reliability and transparency. A human operator is more likely to press the switch if they perceive the AI as opaque or potentially dangerous. Conversely, they may refrain from doing so if they trust the AI’s judgment, even if they have the option to intervene.

This brings us to the concept of trust calibration. An AI system that signals its transparency and cooperative intent can help humans better understand when intervention is necessary, reducing the chances of either premature shutdown or dangerous inaction. Trust calibration could be enhanced by:

Explainable AI (XAI) tools, which allow the AI to communicate its reasoning process in a way humans can understand.
Confidence estimates from the AI, which indicate how sure it is about the decisions it is making, helping the human judge when intervention is appropriate.
Game-theoretic signaling, where the AI uses explicit or implicit signals to show its intentions, making it easier for the human to assess whether or not pressing the off-switch is necessary.

This added layer of trust calibration introduces a more nuanced version of the off-switch game, where the decision to press the off-switch is not binary but depends on the human’s evolving perception of the AI’s trustworthiness. This dynamic interplay between human trust and AI behavior needs to be carefully modeled to ensure safe interactions in increasingly complex AI systems.

Toward a Robust Framework for AI Corrigibility

By incorporating dynamic goal-modification mechanisms and trust calibration into the off-switch game, we can develop a more robust framework for AI corrigibility. Instead of focusing purely on ensuring the AI remains indifferent to shutdown, we can design systems that:

Actively update their goals based on human preferences.
Communicate transparently to foster calibrated human trust.

These extensions provide important tools for addressing the off-switch problem in more realistic AI deployment scenarios. AI systems capable of learning and revising their utility functions based on human interaction, coupled with mechanisms that foster human trust, are likely to be more corrigible and safer overall.

Conclusion

The off-switch game is a valuable starting point for understanding the challenges of AI corrigibility, but real-world AI systems will require more advanced mechanisms to ensure safe and reliable behavior. By incorporating dynamic goal-modification and trust calibration into the model, we can create AI systems that are not only indifferent to shutdown but actively align their goals with human preferences and communicate transparently with human operators. These extensions represent important steps toward a more complete theory of corrigibility, addressing both technical and human factors in AI safety.

Acknowledgments
I would like to thank the contributors at LessWrong for their insights into AI safety and alignment, which have inspired this extension of the off-switch game. Special thanks to Stuart Armstrong for originating the concept.

4

Extending the Off-Switch Game: Toward a Robust Framework for AI Corrigibility

4

4

4