Different Senses of Alignment by Default

Gabriel Weil

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Introduction

When people discuss whether AI alignment will happen "by default," they often talk past each other because this phrase masks fundamentally different views about the difficulty of the alignment problem and the interventions required to solve it. These differences matter enormously for policy. If alignment happens automatically through market forces, aggressive regulation is unnecessary and potentially harmful. If alignment requires heroic effort, modest interventions will fail. Understanding which world we inhabit should guide where we focus resources and political capital.

This post proposes a framework for distinguishing different views about alignment difficulty and introduces the concept of "affectable worlds"—scenarios where our choices have the greatest impact on outcomes. I focus primarily on policy-relevant interpretations of alignment difficulty rather than purely technical claims about specific training methods.

A Spectrum of Views on Alignment Difficulty

Rather than discrete categories, views on alignment difficulty form a spectrum from "alignment is guaranteed" to "alignment is nearly impossible." I identify five positions along this spectrum, ordered from most to least optimistic:

1. Strong Alignment by Default

Core claim: Advanced intelligence necessarily entails or strongly correlates with aligned behavior. It may be physically or even logically impossible to create a misaligned superintelligence.

Key assumption: Intelligence and good/beneficial-for-humans goals converge in the limit. Sufficiently advanced systems will recognize and adopt human values or otherwise avoid catastrophic actions. This implies a negation of the orthogonality thesis.

Policy implication: Minimal alignment-specific intervention needed. Focus on capability advancement and addressing other AI risks like misuse or economic disruption. Alignment interventions might be net-negative because they divert resources from real problems and delay technical solutions to certain AI risks.

Plausibility assessment: This view requires strong claims about the nature of intelligence and goals that lack strong empirical support. Moreover, even some forms of convergence might be unsatisfactory by human standards. For instance, if moral realism is true and ASI converges to the to pursue the objective good, that might still mean wiping out humans to tile the universe with hedonium.

Example adherents: Steven Pinker appears to hold this view, treating alignment risk as largely unfounded even conditional on AGI. Jonathan Barlow has argued for this position on the basis of virtue epistemology in work presented at a symposium organized by the Center for Law and AI Risk (CLAIR). Some AI capabilities skeptics also fall into this category, maintaining they wouldn't worry about misalignment even if AGI/ASI were somehow achieved.

2. Semi-Weak Alignment by Default

Core claim: While misalignment is technically possible, ordinary market incentives to build safe and useful products are strong enough to prevent deployment of dangerously misaligned systems.

Key assumption: Market forces generate strong enough incentives to prevent the deployment of misaligned frontier AI systems, and stronger aligned systems can contain risks from potentially misaligned weaker systems.

Policy implication: Light-touch regulation focused on transparency, and even this might be unnecessary, at least for mitigating misalignment risk. Market mechanisms and reputation concerns will drive adequate investments in safety. Tightening up corporate governance (responsiveness of corporate leaders to the interests of shareholders), so that market incentives are transmitted to relevant decision-makers, may also take on heightened importance.

Plausibility assessment: This view seems more plausible for near-term AI systems where harms are legible. It becomes less plausible as systems become more capable and risks become more catastrophic, diffuse, or delayed. I hold low, but non-negligible, credence in this view.

Example adherents: Based on his recent comments, Dean Ball appears to hold this view. Yan LeCun also probably fits in this category.

3. Weak Alignment by Default

Core claim: Market incentives alone may be insufficient, but adding normal (i.e., not extraordinary) levels of safety concern among lab employees, existing social norms, and default policy responses provides high confidence that alignment risks will be contained.

Key assumption: The combination of reputational concerns, employee pressure, existing regulatory frameworks, and social expectations generates strong enough incentives to ensure AI systems are tolerably safe; No heroic individual altruism or dramatic policy shifts are necessary.

Policy implication: Moderate regulations building on existing frameworks are all we need. Transparency requirements and whistleblower protections are justified, but we should be skeptical of costly new regulations.

Plausibility assessment: This depends heavily on whether alignment challenges become apparent gradually (allowing norms and policies to adapt) or suddenly (requiring anticipatory action). I place substantial credence on this view, particularly for scenarios with longer timelines and clearer warning signs.

Relation to p(doom): Much of the probability mass in low p(doom) estimates comes from some form of alignment by default. Some analysts like Liron Shapira explicitly condition their p(doom) estimates on the absence of any special policy response.

4. Tractable Alignment

Core claim: Alignment by default (in any of the above senses) does not hold, but alignment is tractable enough that some combination of targeted technical research, non-default but plausible policy interventions, and elevated (though not unrealistic) safety concerns and altruism among AI lab employees can avoid catastrophic misalignment.

Key assumption: The alignment problem is solvable with concerted effort. We need special interventions beyond business-as-usual, but these interventions are feasible given realistic resource allocation and political will.

Policy implication: Active intervention required. Substantial funding for technical safety research, proactive regulation of frontier AI development, liability reform to compel AI companies to internalize more risk, coordination mechanisms between labs and governments, and cultivation of safety culture in AI development.

Plausibility assessment: This is the working hypothesis of most AI safety researchers and the assumption underlying most technical and policy work on alignment. I place this view at the center of my action-relevant uncertainty, though my credence is a bit below 50%.

Example adherents: Most grantmaking by Coefficient Giving (formerly Open Philanthropy) in AI safety appears premised on [mattering most under] this view, as does most work on technical alignment and AI governance aimed at mitigating misalignment risk. This is also the view that motivates my own work.

5. Intractable Alignment

Core claim: Alignment is sufficiently difficult that catastrophic misalignment is likely unless we radically curtail AI capability development, likely through international coordination to pause or stop advancement.

Key assumption: The technical challenges of alignment combined with competitive pressures make solving alignment in time unlikely without dramatic intervention in capability development.

Policy implication: Aggressive restrictions on frontier AI development, aimed at slowing or halting progress toward AGI/ASI . International treaties limiting capability advancement, enforcement mechanisms, monitoring of compute and other inputs, and potentially global compute governance.

Plausibility assessment: This view motivates PauseAI and similar movements. I hold non-negligible credence in this view, comparable to my credence in Semi-Weak Alignment by Default. However, I worry that policy interventions optimized for this scenario may be counterproductive if we actually inhabit a Tractable Alignment world.

Example adherents: Many researchers associated with the Machine Intelligence Research Institute (MIRI) appear to hold this view, as do activists in the PauseAI and Stop AI movements.

The Concept of Affectable Worlds

Not all possible worlds are equally responsive to policy interventions, heroic acts of altruism. or technical breakthroughs. The concept of "affectable worlds" refers to scenarios where human choices—in policy, research prioritization, resource allocation, and institutional design—have the greatest marginal impact on outcomes.

In my view, the Tractable Alignment world is the most affectable, at least with regard to misalignment risk.

In Strong Alignment by Default worlds, any special effort to mitigate misalignment risk is unnecessary and wasteful. The problem solves itself. Our efforts might affect other outcomes (speed of AI development, distribution of benefits, misuse risk), but not alignment per se.

In Semi-Weak or Weak Alignment by Default worlds, special (i.e., beyond the default level) effort helps but isn't necessary. Existing mechanisms—market forces, social norms, and standard policy responses are likely to be sufficient to reduce misalignment risk to acceptable levels. Additional interventions have diminishing marginal returns.

In Intractable Alignment worlds, the problem is too hard relative to available solutions. Interventions must instead target capability development itself—restricting, pausing, or stopping advancement. These interventions face immense coordination problems and enforcement challenges, rendering them relatively intractable (though potentially necessary if this view is correct).

In Tractable Alignment worlds, special effort is both necessary and sufficient. This is where research funding, policy design, institutional coordination, and cultural shifts have maximum leverage. The problem is hard enough that is unlikely to be solved without special effort, but tractable enough that special effort is likely to succeed.

Action-Relevant Uncertainty Within Tractable Worlds

Even among people who broadly agree that we [should act as if we] inhabit a Tractable Alignment world, substantial uncertainty remains about which interventions are most important:

How much must we compel AI developers to internalize safety risks? In friendlier versions, transparency requirements and modest liability may suffice. In more hostile versions, comprehensive regulation with significant penalties may be needed.
What balance of technical research, governance work, and norm-shifting is optimal?
How much international coordination is required? Is uncoordinated domestic regulation of even voluntary self-regulation on the part of AI companies likely to offer substantial risk mitigation?
What capabilities thresholds trigger which safety requirements?

Implications for Research and Policy Prioritization

The Expected Value Calculation

Where should we focus our efforts and resources? Consider this framework:

Expected value of intervention = P(world) × Value of intervention in that world × Tractability of intervention

Even if our credence in Tractable Alignment worlds is well below 50%, it may still dominate expected value calculations because:

The best interventions in these worlds are relatively tractable compared to the best interventions in Intractable Alignment worlds
Interventions in the Tractable Alignment worlds are less likely to be wasted, compared to interventions in Alignment by Default worlds, and thus have greater expected counterfactual impact
There is high variance in outcomes within these Tractable Alignment worlds, depending on our special efforts

Conclusion

The phrase "alignment by default" conflates meaningfully different views with radically different policy implications. By distinguishing Strong, Semi-Weak, and Weak Alignment by Default from Tractable and Intractable Alignment, and introducing the concept of affectable worlds, we can have more productive conversations about AI safety policy.

My own decision-relevant uncertainty is concentrated in Tractable Alignment worlds—scenarios where our choices matter most. Rather than arguing about whether alignment is guaranteed or impossible, I believe we should focus on the empirical questions within the tractable range: What interventions work and are most cost-effective? What governance proposals are politically tractable, technically feasible, and likely to buy the greatest risk mitigation?

This doesn't mean ignoring tail scenarios. But it does mean centering our discussions and resource allocation on the scenarios where we have the most agency and the greatest opportunity to affect outcomes.