Message

Perusha Moodley

Message

Stress-Testing Alignment Audits With Prompt-Level Strategic Deception

by Oliver Daniels, Perusha Moodley, and David Lindner

code, paper, twitter thread copied below: Introduction Are alignment auditing methods robust to deceptive adversaries? In our new paper, we find black-box and white-box auditing methods can be fooled by strategic deception prompts: The core problem: We want to audit models for hidden goals before deployment. But future misaligned AIs...

Feb 10•17

Vulnerability in Trusted Monitoring and Mitigations

by Wen Xing and Perusha Moodley

As large language models (LLMs) grow increasingly capable and autonomous, ensuring their safety has become more critical. While significant progress has been made in alignment efforts, researchers have begun to focus on designing monitoring systems around LLMs to ensure safe outcomes, for example, using trusted monitoring systems to oversee potentially...

Jun 7, 2025•17

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

by Maria Kapros, Ana Kapros, and Perusha Moodley

Introduction TL;DR Today’s AI systems are becoming increasingly agentic and interconnected, giving rise to a future of multi-agent (MA) systems (MAS). It is believed that this will introduce unique risks and thus require novel safety approaches. Current research evaluating and steering MAS is focused on behavior alone i.e inputs and...

Apr 21, 2025•10