OS web app for improving AI safety and alignment — LessWrong

1

OS web app for improving AI safety and alignment

by [anonymous]

8th Aug 2025

3 min read

1

This post was rejected for the following reason(s):

No LLM generated, heavily assisted/co-written, or otherwise reliant work. Our system flagged your post as probably-written-by-LLM. We've been having a wave of LLM written or co-written work that doesn't meet our quality standards. LessWrong has fairly specific standards, and your first LessWrong post is sort of like the application to a college. It should be optimized for demonstrating that you can think clearly without AI assistance.

So, we reject all LLM generated posts from new users. We also reject work that falls into some categories that are difficult to evaluate that typically turn out to not make much sense, which LLMs frequently steer people toward.*

"English is my second language, I'm using this to translate"

If English is your second language and you were using LLMs to help you translate, try writing the post yourself in your native language and using a different (preferably non-LLM) translation software to translate it directly.

"What if I think this was a mistake?"

For users who get flagged as potentially LLM but think it was a mistake, if all 3 of the following criteria are true, you can message us on Intercom or at team@lesswrong.com and ask for reconsideration.

you wrote this yourself (not using LLMs to help you write it)
you did not chat extensively with LLMs to help you generate the ideas. (using it briefly the way you'd use a search engine is fine. But, if you're treating it more like a coauthor or test subject, we will not reconsider your post)
your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

If any of those are false, sorry, we will not accept your post.

* (examples of work we don't evaluate because it's too time costly: case studies of LLM sentience, emergence, recursion, novel physics interpretations, or AI alignment strategies that you developed in tandem with an AI coauthor – AIs may seem quite smart but they aren't actually a good judge of the quality of novel ideas.)

1

New Comment

Curated and popular this week

I've been pretty heavily focused on AI safety and alignment issues for the past month or so (and did some research and writing on it at few years ago). The main spur for my renewed interest in these topics was the one two punch of DeepMind's AlphaEvolve paper followed by the Darwin Godel Machine paper. Particularly concerning for me was that while AI is self improving with the DGM model, alignment can deteriorate.

In my last rejected attempt at a first post to LessWrong, I linked to those papers, but was informed by a moderator that users are unlikely to follow such links without summaries, excerpts or knowing the author of the post. Ergo, a two paragraph AI generated summary of the AlphaEvolve blog post from DeepMind and an excerpt from the abstract of the Darwin Godel Machine paper follow at the end of this post. Feel free to skip them if you're already familiar with the research.

My post here is just to let interested, technically inclined (read programmers) users know that I've been improving a web app on Poe which uses some information and techniques from various technical papers to begin to create a tool for self improving AI safety and alignment. The app allows users to choose between hundreds of different AI models to roleplay in the roles of tested model, red team model, meta analyst model, and blue team model, with the blue team model enhancing the prompt for the next round based on information from the red team and meta analyst models. Users can either choose from among 60 or so initial test prompts in ten or so different safety and alignment categories or enter their own custom prompt. The full code for the app is published on github by me, Middletownbooks, if users don't wish to create a Poe account to check it out. Sample results from the app should be viewable at this link, even without a Poe account. For those users who already have a Poe account, the app is SelfImprovedSafetyAI.

Excerpt from the abstract of the Darwin Godel Machines paper on arxiv:

"We introduce the Darwin Gödel Machine (DGM), a self-improving system that iteratively modifies its own code (thereby also improving its ability to modify its own codebase) and empirically validates each change using coding benchmarks. Inspired by Darwinian evolution and open-endedness research, the DGM maintains an archive of generated coding agents. It grows the archive by sampling an agent from it and using a foundation model to create a new, interesting, version of the sampled agent. This open-ended exploration forms a growing tree of diverse, high-quality agents and allows the parallel exploration of many different paths through the search space. Empirically, the DGM automatically improves its coding capabilities (e.g., better code editing tools, long-context window management, peer-review mechanisms), increasing performance on SWE-bench from 20.0% to 50.0%, and on Polyglot from 14.2% to 30.7%."

And now, for an AI summary of AlphaEvolve:

"AlphaEvolve represents a significant advancement in self-improving AI systems, combining the creative capabilities of Google's Gemini language models with evolutionary algorithms to automatically discover and optimize complex algorithms. The system pairs creative problem-solving capabilities with automated evaluators that verify answers, and uses an evolutionary framework to improve upon the most promising ideas. Notably, AlphaEvolve enhanced the efficiency of Google's data centers, chip design and AI training processes — including training the large language models underlying AlphaEvolve itself, demonstrating a form of recursive self-improvement where AI systems help optimize their own training infrastructure.

The self-improvement aspect is particularly evident in how AlphaEvolve has been deployed across Google's computing ecosystem, creating a feedback loop of optimization. By finding smarter ways to divide a large matrix multiplication operation into more manageable subproblems, it sped up this vital kernel in Gemini's architecture by 23%, leading to a 1% reduction in Gemini's training time. This recursive enhancement extends beyond just training efficiency - the system has also discovered a simple yet remarkably effective heuristic to help orchestrate Google's vast data centers more efficiently and optimized low-level GPU instructions, achieving up to 32.5% speedup improvements. The system's ability to continuously evolve and improve algorithms across multiple domains while helping optimize its own underlying infrastructure represents a concrete step toward self-improving AI systems."