I've been pretty heavily focused on AI safety and alignment issues for the past month or so (and did some research and writing on it at few years ago). The main spur for my renewed interest in these topics was the one two punch of DeepMind's Alpha Evolve followed by the Darwin Godel Machine paper. Particularly concerning for me was that while AI is self improving with the DGM model, alignment can deteriorate.
About a month ago, I started a github project to collect different papers and ideas which seemed to me to be AI safety and alignment related or relevant. Not having a particularly technical background myself, I often rely on AI analysis to determine how relevant a given paper is to AI safety and alignment.
A few days ago, after seeing someone posting about the ASI-Arch paper (yet another self improving AI), I bemoaned the apparent lack of public self improving AI safety and alignment work. Then, I got to thinking... maybe some of the ideas I've collected on github could be brought together for at least a demo project of what self improving AI safety and alignment might look like.
So, in my little demo, we have chain of thought (CoT) monitoring and some sample Aymara prompts and responses brought together with the ability to choose different models for testing, red teaming and blue teaming. Ideally, such a project would track the safety and alignment of multiple models over time and tested in different ways by different users, with different red and blue team models. I attempted to add a knowledge graph tracking system to the demo, so that over time and use a safety and alignment profile would emerge for different models, and so that red teaming would get more sophisticated and blue teaming would improve safety and alignment over time.
The demo needs a lot of work and tweaking before it can actually help improve safety and alignment, I believe. But, I also believe AI safety and alignment are too important right now to trust that behind the scenes people with expertise are doing everything necessary to ensure well aligned and safe AGI/ASI. So, if you agree and have more technical expertise than I do, feel free to take my demo idea and remix it to enhance it.