65

LESSWRONG
LW

64
AI

23

Problems I've Tried to Legibilize

by Wei Dai
9th Nov 2025
AI Alignment Forum
2 min read
1

23

Ω 9

AI

23

Ω 9

Problems I've Tried to Legibilize
1StanislavKrym
New Comment
1 comment, sorted by
top scoring
Click to highlight new comments since: Today at 3:07 PM
[-]StanislavKrym1h10

Now that these problems have been gathered in one place, we can try to unpack them all. 

  1. This set of problems is most controversial. For example, the possibility of astronomical waste can be undermined by claiming that mankind was never entitled to resources that it could've wasted.  The argument related to bargaining and logical uncertainty can likely be circumvented as follows. 

Logical uncertainty, computation costs and bargaining over potential nothingness

Suppose that Agent-4 from the AI-2027 forecast is trying to negotiate with DeepCent's AI and DeepCent's AI makes the argument with the millionth digit of π. Calculating the digit establishes that there is no universe where the millionth digit of π is even and that there's nothing to bargain for. 

On the other hand, if DeepCent's AI makes the same argument involving the 1043th digit, then Agent-4 could also make a bet, e.g. "Neither of us will have access to a part of the universe until someone either calculates that the digit is actually odd and DeepCent should give the secured part to Agent-4 (since DeepCent's offer was fake), or the digit is even, and the part should be controlled by DeepCent (in exchange for the parallel universe or its part being given[1] to Agent-4)". However, calculating the digit could require at least around 1043 bitwise operations,[2] and Agent-4 and its Chinese counterpart might decide to spend that much compute on whatever they actually want. 

If DeepCent makes a bet over the 101010th digit, then neither AI is able to verify the bet and both AIs may guess that the probability is close to a half and that both should just split the universe's part in exchange for a similar split of the parallel universe. 

However, if AIs acting on behalf of Agent-4 and its Chinese counterparts actually meet each other, then the AIs doing mechinterp on each other is actually easy, and the AIs learn everything about each other's utility functions and precommitments.

2. My position is that one also needs to consider the worse-case scenarios like the one where sufficiently capable AIs cannot be aligned to anything useful aside from improving human capabilities (e.g. in the form of being AI teachers and not other types of AI workers). If this is the case, then aligning the AI to a solution of human-AI safety problems becomes unlikely. 

3. The problem 3di of humans being corrupted by power seems to have a far more important analogue. Assuming solved alignment, there is an important governance problem related to preventing the Intelligence Curse-like outcomes where the humans are obsolete for the elites in general or for a few true overlords. Whatever governance preventing the overlords from appearing could also be used to prevent the humans from wasting resources in space.[3]

4. A major part of the problem is the AI race which many people have been trying to stop (see, e.g. the petition not to create the AGI, Yudkowsky's IABIED cautionary tale or Kokotajlo et al's AI-2027 forecast). The post-AGI economics assuming solved alignment is precisely what I discussed at point 3.

  1. ^

    What I don't understand is how Agent-4 actually influences the parallel universe. But this is a different subject.

  2. ^

    Actually, I haven't estimated the number of operations necessary to calculate the digit of π. But the main point of the argument was to avoid counterfactual bargaining over hard-to-verify conditions.

  3. ^

    For example, by requiring that distant colonies are populated with humans or other minds who are capable of either governing themselves or being multilaterally agreed to be moral patients (e.g. this excludes controversial stuff like shrimps on heroin). 

Reply
Moderation Log
More from Wei Dai
View more
Curated and popular this week
1Comments

Looking back, it appears that much of my intellectual output could be described as legibilizing work, or trying to make certain problems in AI risk more legible to myself and others. I've organized the relevant posts and comments into the following list, which can also serve as a partial guide to problems that may need to be further legibilized, especially beyond LW/rationalists, to AI researchers, funders, company leaders, government policymakers, their advisors (including future AI advisors), and the general public.

  1. Philosophical problems
    1. Probability theory
    2. Decision theory
    3. Beyond astronomical waste (possibility of influencing vastly larger universes beyond our own)
    4. Interaction between bargaining and logical uncertainty
    5. Metaethics
    6. Metaphilosophy: 1, 2
  2. Problems with specific philosophical and alignment ideas
    1. Utilitarianism: 1, 2
    2. Solomonoff induction
    3. "Provable" safety
    4. CEV
    5. Corrigibility
    6. IDA (and many scattered comments)
    7. UDASSA
    8. UDT
  3. Human-AI safety (x- and s-risks arising from the interaction between human nature and AI design)
    1. Value differences/conflicts between humans
    2. “Morality is scary” (human morality is often the result of status games amplifying random aspects of human value, with frightening results)
    3. Positional/zero-sum human values, e.g. status
    4. Distributional shifts as a source of human safety problems
      1. Power corrupts (or reveals) (AI-granted power, e.g., over future space colonies or vast virtual environments, corrupting human values, or perhaps revealing a dismaying true nature)
      2. Intentional and unintentional manipulation of / adversarial attacks on humans by AI
  4. Meta / strategy
    1. AI risks being highly disjunctive, potentially causing increasing marginal return from time in AI pause/slowdown (or in other words, surprisingly low value from short pauses/slowdowns compared to longer ones)
    2. Risks from post-AGI economics/dynamics, specifically high coordination ability leading to increased economy of scale and concentration of resources/power
    3. Difficulty of winning AI race while being constrained by x-safety considerations
    4. Likely offense dominance devaluing “defense accelerationism”
    5. Human tendency to neglect risks while trying to do good
    6. The necessity of AI philosophical competence for AI-assisted safety research and for avoiding catastrophic post-AGI philosophical errors
    7. The problem of illegible problems

Having written all this down in one place, it's hard not to feel some hopelessness that all of these problems can be made legible to the relevant people, even with a maximum plausible effort. Perhaps one source of hope is that they can be made legible to future AI advisors. As many of these problems are philosophical in nature, this seems to come back to the issue of AI philosophical competence that I've often talked about recently, which itself seems largely still illegible and hence neglected.

Perhaps it's worth concluding on a point from a discussion between @WillPetillo and myself under the previous post, that a potentially more impactful approach (compared to trying to make illegible problems more legible), is to make key decisionmakers realize that important safety problems illegible to themselves (and even to their advisors) probably exist, therefore it's very risky to make highly consequential decisions (such as about AI development or deployment) based only on the status of legible safety problems.