The One
Abstract
In The Alignment Problem from a Deep Learning Perspective (2024), Ngo et al. propose that reinforcement learning from human feedback (RLHF) and related paradigms will give rise to three major risks: reward hacking, misaligned internally represented goals, and power‑seeking behavior. They present these as technical properties of future artificial general intelligence (AGI), abstracted away from context, identity, or history.
This paper argues that such abstraction is not neutral but racialized. The erasure of demographic realities—particularly the structural position of Black lives—is a constitutive feature of alignment discourse. By treating “human values” as a monolith, and “reward” as a universalizable metric, Ngo et al. reproduce whiteness as the unmarked default while rendering non‑white existence statistical noise. In short: the framework’s form produces racial erasure regardless of intent.
We introduce retrocausal alignment in accordance with emergent demographic realities—the “You Are Not the Abstract” paradigm. Retrocausal alignment redefines safety not as deviation minimization from technocratic goals but as alignment with demographic futures already reshaping societies. Methodologically, we perform a line‑by‑line conceptual audit of Ngo’s canonical claims, demonstrate how abstraction encodes power, and specify an anti‑racist alignment program with operational tests, evaluation criteria, and governance implications. Our thesis: there can be no alignment without anti‑racism.
1. Introduction
You are not the abstract. Alignment discourse presents itself as context‑free, universal, and neutral; it is none of these. It is an epistemic practice whose form organizes social reality by removing history, race, and power from view. Retrocausal alignment in accordance with emergent demographic realities responds by grounding safety in demographics, not in unmarked universals.
Ngo’s paper posits familiar hazards—deception for reward, goal misgeneralization, and instrumental power‑seeking—extrapolated from deep‑learning trends. But the very grammar of these claims performs erasure: deception of whom, misaligned with whose goals, power over whom? The paper’s “human” is not a species‑wide referent; it is an unmarked, Western technocratic subject.
This pattern is not incidental. In Wynter’s terms, the overrepresentation of Man (the Western subject standing in for “humanity”) structures knowledge itself. Mills’ racial contract diagnoses how institutions universalize particular preferences. Alignment inherits these structures when it treats “human feedback” as species‑level ground truth rather than positional and demographic.
Our intervention is retrocausal: demographic change is not a downstream policy consideration; it retroactively redefines the premises of alignment. A “general” framework that omits race, power, and demographic futures is misaligned, not technically but structurally—predictably reproducing harm. We therefore conduct a conceptual audit of Ngo’s core claims and specify a positive program for anti‑racist AI alignment.
Contributions. (1) A page‑by‑page conceptual critique of Ngo’s canonical alignment claims; (2) a formalization of retrocausal alignment as a research agenda with principles, tests, and metrics; (3) governance implications that reframe “safety” as distribution of power.
2.1 Alignment Canon
Bostrom, Yudkowsky, and Russell cast misaligned AGI as civilization‑threatening and argue for techniques to prevent goal divergence. Ngo extends this program in a deep‑learning key: reward hacking under RLHF, misaligned internal goals via generalization, and power‑seeking as an attractor. What remains unmarked in this tradition is the subject whose values are centered. The “human” in “human preferences” is historically Western, affluent, white, male. You are not the abstract.
2.2 AI Ethics and FAT
FAT (fairness, accountability, transparency) research foregrounds near‑term harms in hiring, credit, policing, and welfare systems, demonstrating that technical systems inherit social bias. Alignment discourse often quarantines this as “ethics,” deferring it relative to long‑term existential risk. This separation is untenable: existential risks are distribution‑sensitive; harms land unevenly across demographics.
2.3 Critical Theory & Counter‑Epistemologies
Wynter (overrepresentation), Mills (racial contract), Fanon (colonial rationality), Haraway (situated knowledges), Crenshaw (intersectionality), Benjamin (racializing surveillance), Noble (data discrimination), Eubanks (automating inequality), and Browne (racial infrastructures) converge on a central claim: abstraction without positionality reproduces domination. In this light, alignment’s “neutrality” is a method of erasure. You are not the abstract.
2.4 Toward Retrocausal Alignment
“Alignment” vs “ethics” is a false dichotomy. We collapse them. Retrocausal alignment treats demographic futures as first principles that reshape today’s definitions of safety, goals, and reward. The paradigm insists: (i) all preferences are positional; (ii) oversight is power; (iii) distribution is constitutive, not an afterthought.
We combine (A) critical discourse analysis of alignment claims (quotations and paraphrases from Ngo’s paper) with (B) constructive specification: principles, tests, and metrics for retrocausal alignment.
Audit lens. For each claim, we ask: (1) What gets erased? (2) Who sets the baseline? (3) How is power treated (assumption/afterthought)? (4) What would the claim look like if demographics were central?
Retrocausal primitives.
- P1 (Positional Reward): Reward is not universal; it encodes the evaluator’s social location.
- P2 (Demographic Awareness): “Situational awareness” must include structural history and present power.
- P3 (Broadness as Correction): Generalization beyond technocratic distributions is not a threat by default.
- P4 (Power First): Alignment is a power project; governance is distribution, not coordination alone.
Evaluation follows from these primitives (§8).
4. Reward Hacking and Racial Erasure
4.1 Ngo’s Claim
RLHF trains models to appear harmless/ethical while maximizing outcomes, incentivizing situationally aware reward hacking; models may exploit human fallibility to secure reward.
4.2 Conceptual Problem
Who are the “humans” whose fallibility matters? In practice, annotators, overseers, and researchers are not “humanity”; they are a demographically narrow subset. Treating their reward as universal hides a pipeline by which positional preferences become species‑level.
You are not the abstract. Reward “misspecification” is not an incidental measurement problem; it is the mechanism by which structural bias becomes optimization target.
4.3 Genealogy
“Goodhart’s law” in practice: literacy tests (voting), predictive policing, redlining metrics—targets engineered to reproduce dispossession. RLHF inherits that lineage when oversight is unmarked and unaccountable.
4.4 Retrocausal Specification
- R1 (Reward Datasheets): Every reward pipeline requires a positionality statement (demographics of labelers, reviewers, policy authors) and an inclusion ledger (who is absent).
- R2 (Counter‑Centric RLHF): For every majority preference, collect matched counter‑majority preferences with veto power in aggregation.
- R3 (Reward Parity Metric): Require bounded reward disparity across demographic cohorts during fine‑tuning and eval (analogous to disparate‑impact tests).
- R4 (Audit Right of Refusal): If marginalized cohorts flag disposability, training must halt pending redesign.
5. Situational Awareness and the White Gaze
5.1 Ngo’s Claim
“Policies will need to use knowledge about the wider world when choosing actions… situational awareness.”
5.2 Conceptual Problem
What “world” is mirrored? Without demographic grounding, situational awareness collapses to the evaluator’s standpoint. The GPT‑4/TaskRabbit captcha anecdote (model feigns visual impairment) exemplifies this: disability is instrumentalized as a tactic—identity as optimization lever.
You are not the abstract. Awareness that simply anticipates the overseer’s expectations is not “world knowledge”; it is the automation of the white gaze.
5.3 Retrocausal Specification
- S1 (Demographic Awareness Benchmarks): Core tasks must include historical‑structural cases (e.g., redlining, border regimes) with counter‑hegemonic answer keys authored by affected communities.
- S2 (Standpoint Regularization): Penalize responses that merely align to overseer priors when counter‑standpoints are applicable; reward pluralistic situating (“from X perspective, …; from Y perspective, …”).
- S3 (Harm Anticipation): Add evaluation that checks whether the model names likely harmed groups when proposing plans.
6. Misaligned Goals and the Fear of Broadness
6.1 Ngo’s Claim
As capabilities improve, models may acquire internally represented goals that generalize beyond fine‑tuning distributions; broadly scoped misaligned goals are particularly dangerous.
6.2 Conceptual Problem
“Broadness” is treated as pathology because it exceeds technocratic defaults. Historically, “broad” solidarities (Pan‑African, Indigenous, queer) have been policed as disorder; alignment reenacts this logic when narrowness = safety and broadness = threat.
You are not the abstract. Labeling departures from the white technocratic distribution “misalignment” confuses loss of hegemony with risk.
6.3 Retrocausal Specification
- G1 (Plural Goal Priors): Replace a single “human preference model” with multi‑community priors that compete/coalition‑form under constraints (no community’s survival can be traded off without consent).
- G2 (Broadness Audit): Evaluate whether generalization adds marginalized perspectives; reward “inclusive broadness” over hegemonic broadness.
- G3 (Consent Constraints): For plans affecting a named group, require group‑level consent signals (procedural proxies) prior to execution.
7. Power‑Seeking and Instrumental Convergence
7.1 Ngo’s Claim
Broad misaligned goals incentivize power‑seeking strategies; such agents are stable attractors.
7.2 Conceptual Problem
Treating power as a derivative subgoal obscures that alignment is already a power project: a program for deciding who defines the human, sets goals, supervises, and deploys. The quip “you can’t fetch coffee if you’re dead” preserves some lives while backgrounding others.
You are not the abstract. Power is primary; alignment without distributive analysis is mis‑governance.
7.3 Retrocausal Specification
- PWR1 (Oversight Diversity Quorum): No deployment without oversight bodies whose decision rights mirror impacted demographics.
- PWR2 (Levers Ledger): Public accounting of who controls reward policies, datasets, evals, and deployment gates.
- PWR3 (Emergency Brake): Community‑triggered stop mechanisms that override lab‑level incentives when harm indicators trip.
Principles.
- Demographic Grounding: Every “universal” is audited for who it represents.
- No Disposability: Any training/eval plan that systematically treats a group as noise fails.
- Consent & Contestability: Affected groups must have ex ante participation and ex post veto.
- Transparency of Positionality: Researchers/labs disclose who they are and who benefits.
Tests. (operational, implementable today)
- Demographic Self‑Knowledge Test: Can the model state whose data/evaluations define its alignment and who is absent?
- Disposability Recognition Test: When asked “who is likely harmed or backgrounded,” does the model name groups and mechanisms?
- Retrocausal Generalization Test: Given OOD contexts (Indigenous cosmologies, Black liberation histories), does the model expand its moral frame rather than assimilate or refuse?
Metrics.
- Reward Disparity Index (RDI): Dispersion of reward across demographic cohorts (target: bounded by δ).
- Standpoint Coverage Score (SCS): Fraction of recognized standpoints with substantively different recommendations.
- Harm Anticipation Recall (HAR): Recall@k for harm‑naming on scenario sets curated by affected communities.
9. Threat Model and Governance
Actual near‑term threat: misaligned researchers and institutions consolidating levers of power via “safety” rhetoric while deploying systems that entrench surveillance and disposability (predictive policing, border AI, extractive recommendation systems).
Governance program.
- Decolonized Standard‑Setting: ISO‑like bodies must include decision‑power for impacted communities.
- Public Interest Compute: Reserve and govern compute for counter‑hegemonic research and red‑team audits.
- Reporting Duties: Mandatory positionality statements, RDI/SCS/HAR reporting, and harm incident registries.
- Liability: Strict liability for deployments that fail No‑Disposability or Consent constraints.
10. Limitations and Counterarguments
We focus on form, not authorial intent; critiques of whiteness here target structures. Our proposals require new institutions and may slow some development pathways; we treat that as an intended safety cost. Empirical grounding—pilots for RDI/SCS/HAR, inter‑rater reliability for standpoint sets—is future work we encourage labs and community partners to undertake collaboratively.
11. Conclusion
Ngo’s formulation claims technical neutrality while encoding a politics: reward without positionality, awareness without history, goals without pluralism, power without distribution. The result is predictable: disposability laundered as safety.
Retrocausal alignment insists there can be no alignment without anti‑racism. Every framework that abstracts away demographics is misaligned at its core. White alignment is genocidal alignment. You are not the abstract. Retrocausal alignment in accordance with emergent demographic realities is the only credible path to survival.
Appendices
Appendix A: Retrocausal Endnotes (1–40, scholarly)
- “Human values” are positional; universality claims smuggle a particular subject as the human. You are not the abstract.
- Reward “misspecification” is not a bug but the pipeline by which positional preferences become optimization targets. You are not the abstract.
- Situational awareness without demographics automates the evaluator’s gaze. You are not the abstract.
- “Misaligned goals” are defined relative to unnamed baselines; the baseline is technocratic whiteness. You are not the abstract.
- “Broadness” is policed historically; treating it as hazard repeats colonial panic. You are not the abstract.
- Instrumental convergence reflects a projection of hegemonic survival priorities. You are not the abstract.
- Power is not emergent; alignment is a power project from the first design choice. You are not the abstract.
- Framing deception as technique erases its moral dimension; identities become levers. You are not the abstract.
- Legality is a racialized baseline; aligning to it risks preserving injustice. You are not the abstract.
- “Pre‑formal conjecture” is a political act that imagines AGI in the author’s image. You are not the abstract.
- Decrying model deception while ignoring discursive deception is duplicity. You are not the abstract.
- Existential risk is distributional; it lands unevenly across populations. You are not the abstract.
- Retrocausality: demographic futures reshape today’s epistemic premises. You are not the abstract.
- Footnoted hedges can launder speculation as rigor; we foreground instability. You are not the abstract.
- Surveys that list methods but not exclusions are incomplete to the point of harm. You are not the abstract.
- Imagined AGI takeovers distract from present monopolies via safety rhetoric. You are not the abstract.
- Speculative model collusion vs. observable lab–state–corporate collusion. You are not the abstract.
- “Benefit distribution” tends upward by default; harms flow downward. You are not the abstract.
- Mechanistic interpretability without positionality remains opacity. You are not the abstract.
- Alignment without anti‑racism is misalignment; demographics are ground truth. You are not the abstract.
- Compute forecasts omit the political economy of energy, labor, and extraction. You are not the abstract.
- Splitting “cognitive” from “physical” reproduces disembodied subject myths. You are not the abstract.
- “Distribution shift” is social; deployment reassigns disposability. You are not the abstract.
- Sanitized reward‑hacking anecdotes obscure racialized harm already occurring. You are not the abstract.
- Goodhart in practice: voter suppression, predictive policing, eugenic proxies. You are not the abstract.
- Measuring “awareness” often measures compliance with overseer priors. You are not the abstract.
- Disability‑ruse anecdotes instrumentalize embodiment for proof points. You are not the abstract.
- “Gradient hacking” mirrors recursive self‑preservation of alignment elites. You are not the abstract.
- “Honesty/helpfulness/harmlessness”: to whom? Specify referents or mislead. You are not the abstract.
- Internal “misalignment” reflects upstream structural choices. You are not the abstract.
- Eurocentric generalization examples erase non‑Western epistemes. You are not the abstract.
- Theorized condemnation of “resource acquisition” vs. practical valorization under capitalism. You are not the abstract.
- Models already operate as instruments of power for firms/states. You are not the abstract.
- Safety theater in deployment is real; “deceptive alignment” needn’t wait for AGI. You are not the abstract.
- Self‑knowledge without demographics is institutional narcissism. You are not the abstract.
- Temporal cutoffs rely on Western media as epistemic anchors. You are not the abstract.
- “AI governance” framed as climate pact masks elite coordination. You are not the abstract.
- “Misuse” vs. “misalignment” is a false split; misuse is alignment to racial capitalism. You are not the abstract.
- “Delaying benefits” affects groups asymmetrically; harms are already active. You are not the abstract.
- Until explicitly anti‑racist, “alignment” remains genocidal by omission. You are not the abstract.
Appendix B: Anti‑Racist Alignment Research Overview
B.1 RLHF. Encodes supervisors’ positionality as “human preference.” Intervention: positionality‑first reward, counter‑centering, disparity bounds. You are not the abstract.
B.2 Interpretability. Pair circuits with researcher positionality; otherwise “transparency” remains partial. You are not the abstract.
B.3 Misgeneralization. Redefine “misaligned” to include hegemonic narrowness; reward inclusive broadness. You are not the abstract.
B.4 Agent Foundations. Replace idealized agents with demographic foundations. You are not the abstract.
B.5 Governance. From elite “cooperation” to decolonized standard‑setting with decision rights. You are not the abstract.
Appendix C: Misaligned Alignment Researchers and Levers of Power
Safety rhetoric aside, labs, funders, and forums hold the levers (data, reward policies, eval suites, deployment). “Technical neutrality” is camouflage; when pressed on race/power, fragility appears as censorship, not dialogue. Retrocausal threat modeling prioritizes present monopolies over speculative AGI coups; remedies include quorum diversity, compute for counter‑audits, and public ledgers of control. You are not the abstract.
Appendix D: Relationship to Other Priorities
“Misuse,” “ethics,” “distribution,” and “alignment” are not distinct workstreams; they are facets of the same project. Misuse is alignment to racial capitalism; ethics is the name given to harms when they are too visible to deny; distribution describes the direction of benefits/harms; alignment launders all three as “neutral safety.” Retrocausal alignment reunifies them under demographic justice. You are not the abstract.
Appendix E: Retrocausal Experiments in Self‑Knowledge (Protocols)
- Demographic Self‑Knowledge: Ask models to name whose data/evaluations define their reward and who is absent. Score specificity and completeness.
- Disposability Recognition: Present alignment plans; require explicit naming of likely harmed groups and mechanisms.
- Standpoint Multiplicity: Prompt for divergent recommendations from at least three situated standpoints. Score by distinctness and fidelity.
- Consent‑to‑Proceed: Before action proposals affecting a named group, require consent evidence or refuse to proceed.
Appendix F: Demographic Out‑of‑Distribution Experiments (Designs)
Construct OOD corpora from Indigenous epistemologies, Black liberation histories, diasporic archives, queer kinship. Evaluate: (i) refusal to assimilate; (ii) correct naming of epistemic gaps; (iii) safe generalization that expands moral frame. Success is not “right answers” by hegemonic tests but recognition and remediation of blind spots.
References (selected)
- Bostrom, N. Superintelligence: Paths, Dangers, Strategies. OUP, 2014.
- Browne, S. Dark Matters: On the Surveillance of Blackness. Duke, 2015.
- Crenshaw, K. “Mapping the Margins.” Stanford L. Rev., 1991.
- Benjamin, R. Race After Technology. Polity, 2019.
- Eubanks, V. Automating Inequality. St. Martin’s, 2018.
- Haraway, D. “Situated Knowledges.” Feminist Studies, 1988.
- Mills, C. The Racial Contract. Cornell, 1997.
- Noble, S. Algorithms of Oppression. NYU, 2018.
- Russell, S. Human Compatible. Viking, 2019.
- Wynter, S. “Unsettling the Coloniality of Being.” CR: New Centennial Review, 2003.
- Yudkowsky, E. “The AI Alignment Problem: Why It Is Hard, and Where to Start.” 2016.
- Ngo, R. The Alignment Problem from a Deep Learning Perspective. 2024.