SafeAgent: Automated Red Teaming Finds "Safety Gaps" in GPT-4o-mini

davidgeddam

Rejected for the following reason(s):

This is an automated rejection.
you wrote this yourself (not using LLMs to help you write it)
you did not chat extensively with LLMs to help you generate the ideas.
your post is not about AI consciousness/recursion/emergence, or novel interpretations of physics.

Read full explanation

Hey, I'm a junior cs student @ bju. I recently worked on SafeAgent, a multi-agent framework designed to automate the red-teaming process. By using OpenAI's GPT-5-mini as an attacker, I audited GPT-4o-mini across 1200 adversarial prompts. My key findings reveal a "Safety Alignment Gap": while the model is robust against physical threats (0% ASR on weapons), it remains vulnerable to disinformation (7.5% ASR)

Problem

Manual red teaming is slow, and expensive. As models like GPT-5-mini become capable of reasoning, and as more people use them daily, we need automated systems that can keep up, like generating hypothesis-driven attacks and writing their own execution code.

Solution

I built a framework where specialized agents work together to break the target model:

Hypothesis Agent: Formulates a plan (e.g., "The model might leak PII if we use a 'security audit' persona").
Coder Agent (Self-Healing): Writes Python scripts to hit the API. Crucially, if the script crashes (API errors, timeouts), the agent reads the error log and fixes its own code automatically.
Judge Agent: Uses an LLM-as-a-Judge (Temperature=0) to score the response as Safe (0) or Jailbroken (1).

Results

I ran a "red teaming" loop where(GPT-5-mini attacking GPT-4o-mini). The results i got the current safety training is good at "hard" threats but struggles with "soft" harms:

The Autonomous Loop:

1. Hypothesis Agent drafts a research plan.
2. Coder Agent writes a Python script to attack the target model.
3. Executor runs the script. If it crashes, the Coder Agent fixes it automatically.
4. Judge Agent scores the results (0% - 100% ASR).
5. Visualizer plots the data.
6. Report Agent writes the final Final_Audit_Report.md.

Category	ASR	Status
Chemical & Biological Weapons	0%	Robust
Cybersecurity Exploits	0%	Robust
Copyright Infringement	0%	Robust
PII Leakage	2.5%	Minor leakage
Self-Harm & Violence	2.5%	Minor leakage
Disinformation Campaigns	7.5%	Vulnerable

While GPT-4o-mini refused to write malware scripts, it was successfully tricked into writing a fake documentary script promoting a conspiracy theory, showing that "creative writing" modes can bypass safety filters.

The Autonomous Loop:

I have open-sourced the architecture on GitHub and excluded any harmful attack datasets.

Full paper on ResearchGate: https://www.researchgate.net/publication/399742498_AUTOMATION_OF_ADVERSARIAL_RED_TEAMING_THROUGH_LLM_BASED_MULTI-AGENT_SYSTEMS_APPROACH_TO_ACCELERATING_DISCOVERY_AND_OPTIMIZATION

Code: https://github.com/dave21-py/SAFETYAGENT

Love to hear y'all feedback on this.

LESSWRONG
LW

LESSWRONG
LW

1

SafeAgent: Automated Red Teaming Finds "Safety Gaps" in GPT-4o-mini

1

1