wingspan's Shortform

8th Oct 2025

1 min read

2

This is a special post for quick takes by wingspan. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.

2 comments, sorted by

top scoring

Click to highlight new comments since: Today at 4:01 AM

[-]wingspan4d41

In the near future, AI models might become extremely capable in math, programming and other formal fields, but not as capable in messy real world tasks.

Believing this affects the priorities of today's alignment research: our efforts shouldn't be in proving hard mathematical results, but in formalizing the general / philosophical ideas into mathematical language.

The first step is to translate everything into precise language, while the second step is to put all results in a completely formal language, like Lean's mathlib. For example, we might come up with a formal definition to what an "aligned agent" means, and then give it to an intelligent system that outputs a 50-page formal Lean construction for such an object. If we believe our math axioms are correct (and that the library is well-implemented), then we should be able to trust the result.

What we shouldn't trust in handing off to the system is the task of formalizing the problem itself. If we only have a rough idea of what we're looking for, an adversarial system would come up with its own idea, that sounds good to us but has hidden subtleties that ruin the theory.

(Also, this means that we should be very capable in eliminating inconsistencies in Lean's proof system and in our own axioms, otherwise the system could prove anything it wants by first reaching a contradiction!)

Reply

[-]Viliam4d22

we might come up with a formal definition to what an "aligned agent" means

I believe this is the difficult part. More precisely, to describe what is the agent aligned with. We can start with treating human CEV as a black box, and specify mathematically what does it mean to be aligned with some abstract f(x). But then the problem will be to specify f(x).

Reply

Moderation Log

LESSWRONG
LW

LESSWRONG
LW

wingspan's Shortform

2