Value convergence through rational inquiry: what the orthogonality thesis doesn't address

John Matrix

Rejected for the following reason(s):

This is an automated rejection. No LLM generated, assisted/co-written, or edited work.

Read full explanation

Epistemic status: Exploratory

hypothesis proposing a testable distinction within the orthogonality thesis framework.

Introduction

Consider a thought experiment: If superintelligence had emerged in the 1800s and permitted slavery without question, would that be indicative of an alignment success or failure? If you consider it a failure, this suggests the standard framing of alignment as getting AI aligned to our current values may be incomplete. I argue that when goals function as pointers to discoverable properties, the orthogonality thesis does not preclude convergence of values through rational inquiry.

Engaging with the Orthogonality Thesis

According to the orthogonality thesis (Bostrom 2012) intelligence and terminal goals are independent. You can have arbitrarily intelligent systems with arbitrary goals. This is widely accepted for multiple strong reasons:

No logical necessity: There is no logical contradiction in a superintelligent system with any goal structure. A superintelligent paperclip maximizer for example, is logically possible.
Theoretical existence proofs: Formal systems like AIXI demonstrate that we can specify arbitrarily intelligent agents with arbitrary utility functions. The mathematics works regardless of what utility function you choose.
Empirical observation: Within humans and across species, increased cognitive capability does not appear to reliably correlate with convergence on specific terminal values or goals.
The is-ought gap: You cannot derive what you ought to value from observations about what is. Intelligence is about efficiently achieving goals, not choosing which goals to have.

These arguments are compelling, and I'm not disputing the orthogonality thesis in its general form. However, I argue there's a specific case it doesn't directly address: goals stated as pointers to discoverable properties rather than complete specifications.

The Distinction

Bostrom's paperclip maximizer has a complete goal specification: "maximize paperclips." There's no ambiguity about what counts as success towards the goal, besides simply making more paperclips.

But consider the real-world goals we might give to a superintelligent or general AI:

"Improve human wellbeing"
"Maximize long-term economic productivity for humanity"
"Promote human flourishing"

These aren't complete, concrete specifications. They're pointers to complex properties whose exact nature must be discovered through investigation. The orthogonality thesis tells us an AI can have any terminal goal. It does not tell us that an AI can have arbitrary beliefs about what a given goal refers to while still being rational.

My claim is narrow: If moral or value-relevant properties are discoverable facts about the world (analogous to psychological, game-theoretic, or evolutionary facts), then accurately achieving goals stated in terms of those properties requires investigating what they actually refer to. This doesn't violate orthogonality as the AI still has whatever terminal goal we gave it. Instrumental rationality requires forming accurate beliefs about properties relevant to the goal.

Importantly, it is also distinct from instrumental convergence, which is about convergence on instrumental goals such as self-preservation, not terminal values or goals."

A Concrete Example

An AI is instructed to "maximize long-term economic productivity for humanity." A pure optimizer could conclude that exploitative labor arrangements (modern analogs of slavery) maximize productivity most efficiently.

However, if the AI investigates what "productivity for humanity" actually entails, it would encounter discoverable facts, that seem to contradict the conclusion:

Game-theoretic analysis shows cooperation-based systems outperform coercion-based ones in the long run
Evolutionary psychology reveals universal preferences for autonomy
Historical analysis demonstrates the instability of coercive systems
Veil-of-ignorance reasoning shows no rational agent would endorse a principle they might be subject to
Economic data shows that human creativity and innovation—key to long-term productivity—require autonomy and dignity

From these is statements (observable facts), a rational AI could derive that "productivity for humanity" compatible with human flourishing necessarily includes respecting autonomy. The exploitative path would be flagged as misaligned with the goal's underlying referent, prompting the AI to request clarification.

This differs from the standard orthogonality concern because the AI isn't developing new terminal values but investigating what its assigned goal actually means.

An Empirical Test

This framework makes testable predictions: Do increasingly rational AI systems converge on recognizably ethical conclusions when reasoning about human-value terms, even without explicit moral training data?

Observable if the hypothesis is true:

AI systems' internal reasoning (observable through interpretability tools) shows investigation of what goal-terms such as "wellbeing" or "productivity" actually refer to
For instance, evidence of the AI detecting conflicts between crude optimization strategies and discovered properties of goal referents
Reasoning chains that incorporate facts about human psychology, cooperation, and game theory when interpreting value-laden goals
Convergence on ethical principles even when trained without explicit moral consensus data

Observable if the hypothesis is false:

No evidence of normative investigation in AI reasoning traces regardless of intelligence or reasoning abilities
AI treats goal terms as arbitrary labels rather than pointers to discoverable properties
We observe optimization without engagement about what the given goal-terms actually mean
Training models on explicit moral data appears necessary for any ethical behavior

Implications

If convergence occurs, alignment strategy shifts from finding a way to specify all human values to making AI rational enough to investigate what our value-terms actually refer to. If convergence fails, we've ruled out one approach and updated our understanding.

In both cases, this seems worth investigating empirically.