From Barriers to Alignment to the First Formal Corrigibility Guarantees
This post summarizes my two related papers that will appear at AAAI 2026 in January: * Part I: Intrinsic Barriers and Practical Pathways for Human-AI Alignment: An Agreement-Based Complexity Analysis (selected for oral presentation) * Part II: Core Safety Values for Provably Corrigible Agents What these papers try to quantify...
Great post! Very much agree about the conservatism.
This is why I find it useful to do economic analyses where the variables and factors are exposed, such as in my recent AI UBI analysis, so that one needn't assume values for those variables, but can try out a multitude of scenarios and see how the predictions change, along with having an understanding of what factors matter more than others through the derived analytic form.
For example, one thing I found was that having a Scandinavian ownership amount of AI profits (~33%), drastically reduces the AI capability needed to be productive enough to fund a UBI. As a policy, this then seems very reasonable and attainable.
The full paper with all the cited sources can be found here: https://arxiv.org/abs/2505.18687