At the start of this year I made a stab at resolving the core alignment problem and attempted to develop an original approach to foundations, the Atelier Framework. In this post, I will outline the high level motivation and goals.

There has been a recent push to bridge the gap between high level philosophical AI Safety literature and contemporary machine learning. My goal is to move in a completely orthogonal direction and construct a model of the alignment problem that is highly general and completely removed from modern paradigms.

Why on earth would you want that? If we want a theory of alignment that is far reaching and robust to major technological shifts then we will need a way to reason about these concepts that is both formal yet not based on early 21st century AI. Major paradigms in AI are young and the field is rapidly changing. The paper that pioneered model-free RL is only a decade old, GANs are only 9 years old,  and the transformer was introduced only 6 years ago. There's no reason to think this pace will slow down, and plenty of reasons to expect it to continue accelerating.  

All this change while progress is being driven by human researchers. Imagine the incredible advances and breakthroughs that will occur in the period immediately before and after the deployment of AGI, when a small army of AI research assistants work in tandem with human computer scientists. Further, it is plausible that one of the immediate goals an AI system will be used for will be helping research better ways to construct AGI. A rapid cycle of self improvement will ensue. 

What paradigms can you be confident will still be around when we need to align the systems of tomorrow? And what about the systems that come after that?

For that reason, I am not satisfied with the current machine learning centered alignment frameworks. We desperately need concrete yet general ways to reason about the alignment problem.


Here we have the following desiderata from our formalism:

  1. Highly General
    As stated in the introduction, we would like our formalism to be as free as possible from contemporary Machine Learning paradigms, in the hope that this means that it will generalize despite large shifts in technology and practices.
    Further, we would like to avoid making excessive presumptions about the nature of AGI. We do not want to assume the system is necessarily agentic or perfectly rational. This is also intended to allow us to consider both polar and multipolar scenarios. 
  2. Applicable
    In contrast with the above requirement, the framework needs to be actually useful for humanity to avoid existential catastrophe. For example we should be able to relate our framework to an AGI that is emerging from contemporary LLMs. 
  3. Physical
    Our framework should describe a physical system, and be bound by the laws of physics. This is vital if we wish to be able to use it to prove things the alignment problem that we face.
  4. Probing Alignment Bounds
    On a more somber note we believe that a good framework to explore properties of alignment should also be one that allows us to place hard bounds on alignment, if it exists. If aligning a system of superior intelligence is actually an unobtainable goal, we would like our framework to give us the tools to prove that that is the case.
  5. An Accessible Formal Framing Of Alignment Problem
    Finally, we aim to redefine the alignment problem in a way that succinctly captures its essence for mathematicians and physicists, focusing on concrete, calculable challenges. This paves the way for future work and makes it easier to bring multidisciplinary insights into the field. 

    (Informally, I would like a succinct framing of the core alignment problem that could be used to nerd snipe physicists and mathematicians who do not have a background in machine learning.)

Based on work done while on a contract for the LTFF.

New Comment