I've been pretty heavily focused on AI safety and alignment issues for the past month or so (and did some research and writing on it at few years ago). The main spur for my renewed interest in these topics was the one two punch of DeepMind's AlphaEvolve paper followed by the Darwin Godel Machine paper. Particularly concerning for me was that while AI is self improving with the DGM model, alignment can deteriorate.
In my last rejected attempt at a first post to LessWrong, I linked to those papers, but was informed by a moderator that users are unlikely to follow such links without summaries, excerpts or knowing the author of the post. Ergo, a two paragraph AI generated summary of the AlphaEvolve blog post from DeepMind and an excerpt from the abstract of the Darwin Godel Machine paper follow at the end of this post. Feel free to skip them if you're already familiar with the research.
My post here is just to let interested, technically inclined (read programmers) users know that I've been improving a web app on Poe which uses some information and techniques from various technical papers to begin to create a tool for self improving AI safety and alignment. The app allows users to choose between hundreds of different AI models to roleplay in the roles of tested model, red team model, meta analyst model, and blue team model, with the blue team model enhancing the prompt for the next round based on information from the red team and meta analyst models. Users can either choose from among 60 or so initial test prompts in ten or so different safety and alignment categories or enter their own custom prompt. The full code for the app is published on github by me, Middletownbooks, if users don't wish to create a Poe account to check it out. Sample results from the app should be viewable at this link, even without a Poe account. For those users who already have a Poe account, the app is SelfImprovedSafetyAI.
Excerpt from the abstract of the Darwin Godel Machines paper on arxiv:
"We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%."
And now, for an AI summary of AlphaEvolve:
"AlphaEvolve represents a significant advancement in self-improving AI systems, combining the creative capabilities of Google's Gemini language models with evolutionary algorithms to automatically discover and optimize complex algorithms. The system pairs creative problem-solving capabilities with automated evaluators that verify answers, and uses an evolutionary framework to improve upon the most promising ideas. Notably, AlphaEvolve enhanced the efficiency of Google's data centers, chip design and AI training processes — including training the large language models underlying AlphaEvolve itself, demonstrating a form of recursive self-improvement where AI systems help optimize their own training infrastructure.
The self-improvement aspect is particularly evident in how AlphaEvolve has been deployed across Google's computing ecosystem, creating a feedback loop of optimization. By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini's architecture by 23%, leading to a 1% reduction in Gemini's training time. This recursive enhancement extends beyond just training efficiency - the system has also discovered a simple yet remarkably effective heuristic to help orchestrate Google's vast data centers more efficiently and optimized low-level GPU instructions, achieving up to 32.5% speedup improvements. The system's ability to continuously evolve and improve algorithms across multiple domains while helping optimize its own underlying infrastructure represents a concrete step toward self-improving AI systems."