Wikitag Activity

A generalization of Aumann's Agreement Theorem across $M$ objectives and $N$ agents, without assuming common priors. This framework also encompasses Debate, CIRL, Iterated Amplification as well. See Nayebi (2025) for the formal definition, and see From Barriers to Alignment to the First Formal Corrigibility Guarantees for applications.

^{^}

People used to refer to scalable oversight as a set of AI alignment techniques, but they usually work on the level of incentives to the AIs, and have less to do with architecture.

LESSWRONG
is fundraising!
LW

LESSWRONG
is fundraising!
LW

Recent Wikitag Activity

Recent Wikitag Activity