Senses of Alignment By Default

Gabriel Weil

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Senses of Alignment By Default

Introduction

When people discuss whether AI alignment will happen "by default," they often talk past each other because this phrase masks fundamentally different views about the difficulty of the alignment problem and the interventions required to solve it. These differences matter enormously for policy. If alignment happens automatically through market forces, aggressive regulation is unnecessary and potentially harmful. If alignment requires heroic effort, modest interventions will fail. Understanding which world we inhabit should guide where we focus resources and political capital.

This post proposes a framework for distinguishing different views about alignment difficulty and introduces the concept of "affectable worlds"—scenarios where our choices have the greatest impact on outcomes. I focus primarily on policy-relevant interpretations of alignment difficulty rather than purely technical claims about specific training methods.

A Spectrum of Views on Alignment Difficulty

Rather than discrete categories, views on alignment difficulty form a spectrum from "alignment is guaranteed" to "alignment is nearly impossible." I identify five positions along this spectrum, ordered from most to least optimistic:

1. Strong Alignment by Default

Core claim: Advanced intelligence necessarily entails or strongly correlates with aligned behavior. It may be physically or even logically impossible to create a misaligned superintelligence.

Key assumption: Intelligence and good/beneficial-for-humans goals converge in the limit. Sufficiently advanced systems will recognize and adopt human values or otherwise avoid catastrophic actions. This implies a negation of the orthogonality thesis.

Policy implication: Minimal alignment-specific intervention needed. Focus on capability advancement and addressing other AI risks like misuse or economic disruption. Alignment interventions might be net-negative because they divert resources from real problems and delay technical solutions to certain AI risks.

Plausibility assessment: This view requires strong claims about the nature of intelligence and goals that lack strong empirical support. Moreover, even some forms of convergence might be unsatisfactory by human standards. For instance, if moral realism is true and ASI converges to the to pursue the objective good, that might still mean wiping out humans to tile the universe with hedonium.

Example adherents: Steven Pinker appears to hold this view, treating alignment risk as largely unfounded even conditional on AGI. Jonathan Barlow has argued for this position on the basis of virtue epistemology in work presented at a symposium organized by the Center for Law and AI Risk (CLAIR). Some capabilities skeptics also fall into this category, maintaining they wouldn't worry about misalignment even if AGI/ASI became possible.

2. Semi-Weak Alignment by Default

Core claim: While misalignment is technically possible, ordinary commercial incentives are strong enough to prevent deployment of dangerously misaligned systems. Companies building useful products want to avoid harming customers and themselves, creating natural alignment pressure.

Key assumption: Market forces create sufficient alignment pressure at least until powerful aligned systems exist that can help contain risks from less aligned systems.

Policy implication: Light-touch regulation focused on transparency, and even this might be unnecessary, at least with regard to misalignment risk. Market mechanisms and reputation concerns will drive adequate safety investment. Tightening up corporate governance, so that market incentives are transmitted to relevant decision-makers, may also take on heightened importance.

Plausibility assessment: This view seems more plausible for near-term AI systems where harms are legible. It becomes less plausible as systems become more capable and risks become more catastrophic, diffuse, or delayed. I hold low but non-negligible credence in this view.

Example adherents: Based on his recent comments, Dean Ball appears to hold this view. Yan LeCun also probably fits in this category.

3. Weak Alignment by Default

Core claim: Market incentives alone may be insufficient, but adding normal (not extraordinary) levels of safety concern among lab employees, existing social norms, and default policy responses provides high confidence that alignment risks will be contained.

Key assumption: The combination of reputational concerns, employee pressure, existing regulatory frameworks, and social expectations creates sufficient alignment pressure without requiring heroic individual altruism or dramatic policy shifts.

Policy implication: Moderate regulation building on existing frameworks. Transparency requirements, safety standards, and mechanisms for employee concerns. Strengthen but don't revolutionize current governance approaches, including incremental new legislation.

Plausibility assessment: This depends heavily on whether alignment challenges become apparent gradually (allowing norms and policies to adapt) or suddenly (requiring anticipatory action). I place substantial credence on this view, particularly for scenarios with longer timelines and clearer warning signs.

Relation to p(doom): Much of the probability mass in low p(doom) estimates comes from some form of alignment by default. Some analysts like Liron Shapira explicitly condition their p(doom) estimates on no special policy response.

4. Tractable Alignment

Core claim: Alignment by default (in any of the above senses) does not hold, but alignment is tractable enough that some combination of targeted technical research, non-default but plausible policy interventions, and elevated (though not unrealistic) safety concerns and altruism among AI lab employees can avoid catastrophic misalignment.

Key assumption: The alignment problem is solvable with concerted effort. We need special interventions beyond business-as-usual, but these interventions are feasible given realistic resource allocation and political will.

Policy implication: Active intervention required. Substantial funding for technical safety research, proactive regulation of frontier AI development, liability reform to compel AI companies to internalize more risk, coordination mechanisms between labs and governments, and cultivation of safety culture in AI development.

Plausibility assessment: This is the working hypothesis of most AI safety researchers and the assumption underlying most technical and policy work on alignment. I place this view at the center of my action-relevant uncertainty, though my credence is a bit below 50%.

Example adherents: Most grantmaking by Coefficient Giving (formerly Open Philanthropy) in AI safety appears premised on [mattering most under] this view, as does most work on technical alignment and AI governance aimed at mitigating misalignment risk. This is also the view that motivates my own work.

5. Intractable Alignment

Core claim: Alignment is sufficiently difficult that catastrophic misalignment is likely unless we radically curtail AI capability development, likely through international coordination to pause or stop advancement.

Key assumption: The technical challenges of alignment combined with competitive pressures make solving alignment in time unlikely without dramatic intervention in capability development.

Policy implication: Aggressive restrictions on frontier AI development, aimed at slowing or halting progress toward AGI/ASI . International treaties limiting capability advancement, enforcement mechanisms, monitoring of compute and other inputs, and potentially global compute governance.

Plausibility assessment: This view motivates PauseAI and similar movements. I hold non-negligible credence in this view, comparable to my credence in Semi-Weak Alignment by Default. However, I worry that policy interventions optimized for this scenario may be counterproductive if we actually inhabit a Tractable Alignment world.

Example adherents: Many researchers associated with the Machine Intelligence Research Institute (MIRI) appear to hold this view, as do activists in the PauseAI and Stop AI movements.

The Concept of Affectable Worlds

Not all possible worlds are equally responsive to our interventions. The concept of "affectable worlds" refers to scenarios where human choices—in policy, research prioritization, resource allocation, and institutional design—have the greatest marginal impact on outcomes.

Why Tractable Alignment Worlds Are Most Affectable

The Tractable Alignment world is the most affectable, at least with regard to misalignment risk.

In Strong Alignment by Default worlds, our actions matter little for alignment outcomes. The problem solves itself. We might affect other outcomes (speed of AI development, distribution of benefits, misuse risk), but not alignment per se.

In Semi-Weak or Weak Alignment by Default worlds, special (i.e., beyond the default level) effort helps but isn't necessary. Alignment happens through existing mechanisms—market forces, social norms, and standard policy responses. Additional interventions have diminishing marginal returns.

In Intractable Alignment worlds, interventions targeting alignment difficulty face a fundamental mismatch. The problem is too hard relative to available solutions. Interventions must instead target capability development itself—restricting, pausing, or stopping advancement. These interventions face immense coordination problems and enforcement challenges, making them relatively intractable (though potentially necessary if this view is correct).

In Tractable Alignment worlds, special effort is both necessary and sufficient. This is where research funding, policy design, institutional coordination, and cultural shifts have maximum leverage. The problem is hard enough that defaults fail but tractable enough that targeted interventions succeed.

Action-Relevant Uncertainty Within Tractable Worlds

Even among people who broadly agree that we [should act as if we] inhabit a Tractable Alignment world, substantial uncertainty remains about which interventions are most important:

How much must we compel AI developers to internalize safety risks? In friendlier versions, transparency requirements and modest liability may suffice. In more hostile versions, comprehensive regulation with significant penalties may be needed.
What balance of technical research, governance work, and norm-shifting is optimal?
How much international coordination is required versus what individual nations or labs can accomplish?
What capabilities thresholds trigger which safety requirements?

Implications for Research and Policy Prioritization

The Expected Value Calculation

Where should we focus resources? Consider this framework:

Expected value of intervention = P(world) × Value of intervention in that world × Tractability of intervention

Even if one's credence in Tractable Alignment worlds is well below 50%, it may still dominate expected value calculations because:

Interventions in these worlds are relatively tractable compared to Intractable Alignment worlds
Interventions in these worlds are important compared to worlds with alignment by default
There is high variance in outcomes within these worlds depending on what we do

Portfolio Approach

Given uncertainty, a portfolio approach makes sense:

Core investments: Focus on interventions that matter most in Tractable Alignment worlds—technical safety research, proactive but feasible governance frameworks, lab coordination mechanisms, and safety culture development. These are robust across scenarios except extreme views.

Hedge investments: Maintain some capacity for more aggressive interventions (monitoring, compute governance, international coordination infrastructure) that would be needed if we're in more hostile worlds.

Flexibility: Design interventions that degrade gracefully. Policies should help in Tractable worlds without causing harm if we're actually in more benign worlds.

Avoiding Category Errors

Policy debates often fail because participants implicitly assume different worlds:

Those assuming Semi-Weak Alignment by Default view safety regulation as premature and potentially innovation-stifling
Those assuming Intractable Alignment view incremental safety research as insufficient and dangerous delay
Those assuming Tractable Alignment view both positions as missing the realistic middle path

Making world-models explicit helps identify genuine disagreements versus talking past each other.

Conclusion

The phrase "alignment by default" conflates meaningfully different views with radically different policy implications. By distinguishing Strong, Semi-Weak, and Weak Alignment by Default from Tractable and Intractable Alignment, and introducing the concept of affectable worlds, we can have more productive conversations about AI safety policy.

My own decision-relevant uncertainty is concentrated in Tractable Alignment worlds—scenarios where our choices matter most. Rather than arguing about whether alignment is guaranteed or impossible, I believe we should focus on the empirical questions within the tractable range: What interventions work? What resources are needed? What governance structures are sufficient?

This doesn't mean ignoring tail scenarios. But it does mean centering our discussions and resource allocation on the possibilities where we have the most agency and the greatest opportunity to affect outcomes.

The question isn't just "Will alignment be hard?" It's "What can we do about it, and which world should we prepare for?" Answering these questions well requires understanding not just the spectrum of views on alignment difficulty, but where our actions have maximum leverage.