Ana Kapros

Message

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

Introduction TL;DR Today’s AI systems are becoming increasingly agentic and interconnected, giving rise to a future of multi-agent (MA) systems (MAS). It is believed that this will introduce unique risks and thus require novel safety approaches. Current research evaluating and steering MAS is focused on behavior alone i.e inputs and...

Apr 21, 202510

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts

TL;DR This project compares the effectiveness of top-down and bottom-up activation steering methods in controlling refusal behaviour. In line with prior work,[1] we find that top-down methods outperform bottom-up ones in behaviour steering, as measured using HarmBench. Yet, a hybrid approach is even more effective (providing a 36% relative improvement...

Feb 12, 20257

LESSWRONG
LW

LESSWRONG
LW

Ana Kapros

Ana Kapros

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts

Ana Kapros

Ana Kapros

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

Comparing the effectiveness of top-down and bottom-up activation steering for bypassing refusal on harmful prompts

Introduction

TL;DR

TL;DR

1. Introduction