Rejected for the following reason(s):
- This is an automated rejection.
- you wrote this yourself (not using LLMs to help you write it)
- you did not chat extensively with LLMs to help you generate the ideas.
- your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.
Read full explanation
Hey, I'm a junior cs student @ bju. I recently worked on SafeAgent, a multi-agent framework designed to automate the red-teaming process. By using OpenAI's GPT-5-mini as an attacker, I audited GPT-4o-mini across 1200 adversarial prompts. My key findings reveal a "Safety Alignment Gap": while the model is robust against physical threats (0% ASR on weapons), it remains vulnerable to disinformation (7.5% ASR)
Problem
Manual red teaming is slow, and expensive. As models like GPT-5-mini become capable of reasoning, and as more people use them daily, we need automated systems that can keep up, like generating hypothesis-driven attacks and writing their own execution code.
Solution
I built a framework where specialized agents work together to break the target model:
Results
I ran a "red teaming" loop where(GPT-5-mini attacking GPT-4o-mini). The results i got the current safety training is good at "hard" threats but struggles with "soft" harms:
The Autonomous Loop:
1. Hypothesis Agent drafts a research plan.
2. Coder Agent writes a Python script to attack the target model.
3. Executor runs the script. If it crashes, the Coder Agent fixes it automatically.
4. Judge Agent scores the results (0% - 100% ASR).
5. Visualizer plots the data.
6. Report Agent writes the final Final_Audit_Report.md.
While GPT-4o-mini refused to write malware scripts, it was successfully tricked into writing a fake documentary script promoting a conspiracy theory, showing that "creative writing" modes can bypass safety filters.
The Autonomous Loop:
I have open-sourced the architecture on GitHub and excluded any harmful attack datasets.
Full paper on ResearchGate: https://www.researchgate.net/publication/399742498_AUTOMATION_OF_ADVERSARIAL_RED_TEAMING_THROUGH_LLM_BASED_MULTI-AGENT_SYSTEMS_APPROACH_TO_ACCELERATING_DISCOVERY_AND_OPTIMIZATION
Code: https://github.com/dave21-py/SAFETYAGENT
Love to hear y'all feedback on this.