This project is the origins of the Archetypal Transfer Learning (ATL) method 

 

This is the abstract of my research proposal submitted to AI Alignment Awards. I am publishing this here for community feedback. You can find the link to the whole research paper here.


Abstract

We are entering a decade of singularity and great uncertainty. Across all disciplines, including wars, politics, human health, as well as the environment, there are concepts that could prove to be a double edged sword. Perhaps the most powerful factor in determining our future is how information is distributed to the public. It can be both transformational and empowering using advanced AI technology – or it can lead to disastrous outcomes that we may not have the foresight to predict with our current capabilities.

Goal misgeneralization is defined as a robustness failure for learning algorithms in which the learned program competently pursues an undesired goal that leads to good performance in training situations but bad performance in novel test situations. This research proposal tries to capture what might be a better description of this problem and solutions from a Jungian perspective.

This proposal covered key AI alignment topics, from goal misgeneralisation to other pressing issues. It offers a comprehensive approach for addressing critical questions in the field.

  • reward misspecification and hacking
  • situational awareness 
  • deceptive reward hacking 
  • internally-represented goals 
  • learning broadly scoped goals 
  • broadly scoped goals incentivizing power-seeking, 
  • power seeking policies would choose high reward behaviors for instrumental reasons  
  • misaligned AGIs gain control of the key levers of power 

These above-mentioned topics were reviewed to check the viability of approaching the alignment problem through a Jungian approach. 3 key concepts emerged from the review:

  • By understanding how humans use patterns to recognize intentions at a subconscious level, researchers can leverage on Jungian archetypes and create systems that mimic natural decision-making. With this insight into human behavior, AI can be trained more effectively with archetypal data.
  • Stories are more universal in human thought than goals. Goals and rewards will always yield the same problems encountered in alignment research. AI systems should utilize the robustness of complete narratives to guide its responses. 
  • Values-based models can serve as the moral compass for AI systems in determining what is a truthful and responsible response or not.  Testing this theory is essential in continuing progress on alignment research.

A list of initial methodologies were added to present an overview of how the research will proceed once approved.

 

In conclusion, alignment research should look into the possibility of replacing goals and rewards in evaluating AI systems. By understanding that humans think consciously and subconsciously through Jungian archetypal patterns, this paper proposes  that complete narratives should be leveraged in training and deploying AI models. 
 

A number of limitations were included in the last section. The main concern is the need to hire Jungian scholars or analytical psychologists - as they will define what constitutes archetypal data and evaluate results. They will also be required to influence the whole research process with a high moral ground and diligence. They will be difficult to find.
 

AI systems will impact our future significantly, so it is important that they are developed responsibly. History has taught us what can happen when intentions are poorly executed: the deaths of millions through the use of wrong ideologies  haunt us and remind us of the need for caution in this field. 


 

New Comment
2 comments, sorted by Click to highlight new comments since:

I did a quick skim of the full paper that you linked to. In my opinion, this project is maybe a bad idea in principle. (Like trying to build a bridge out of jello - are Jungian archetypes too squishy and malleable to build a safety critical system out of?) But it definitely lacks quick sanity checks and a fail-fast attitude that would benefit literally any alignment project. The sooner any idea makes contact with reality, the more likely it is to either die gracefully, wasting little time, or to evolve into something that is worthwhile. 

The proposal is trying to point out a key difference in the way alignment reasearch and Carl Jung understood pattern recognition in humans. 

I stated as one of the limitations of the paper that:

"The author focused on the quality of argument rather than quantity of citations, providing examples or testing. Once approved for research, this proposal will be further tested and be updated."

I am recommending here a research area that I honestly believe that can have a massive impact in aligning humans and AI.