At the start of this year I made a stab at resolving the core alignment problem and attempted to develop an original approach to foundations, the Atelier Framework. In this post, I will outline the high level motivation and goals.
There has been a recent push to bridge the gap between high level philosophical AI Safety literature and contemporary machine learning. My goal is to move in a completely orthogonal direction and construct a model of the alignment problem that is highly general and completely removed from modern paradigms.
Why on earth would you want that? If we want a theory of alignment that is far reaching and robust to major technological shifts then we will need a way to reason about these concepts that is both formal yet not based on early 21st century AI. Major paradigms in AI are young and the field is rapidly changing. The paper that pioneered model-free RL is only a decade old, GANs are only 9 years old, and the transformer was introduced only 6 years ago. There's no reason to think this pace will slow down, and plenty of reasons to expect it to continue accelerating.
All this change while progress is being driven by human researchers. Imagine the incredible advances and breakthroughs that will occur in the period immediately before and after the deployment of AGI, when a small army of AI research assistants work in tandem with human computer scientists. Further, it is plausible that one of the immediate goals an AI system will be used for will be helping research better ways to construct AGI. A rapid cycle of self improvement will ensue.
What paradigms can you be confident will still be around when we need to align the systems of tomorrow? And what about the systems that come after that?
For that reason, I am not satisfied with the current machine learning centered alignment frameworks. We desperately need concrete yet general ways to reason about the alignment problem.
Here we have the following desiderata from our formalism:
Based on work done while on a contract for the LTFF.