Maria Kapros

10mo

Introduction

TL;DR

Today’s AI systems are becoming increasingly agentic and interconnected, giving rise to a future of multi-agent (MA) systems (MAS). It is believed that this will introduce unique risks and thus require novel safety approaches. Current research evaluating and steering MAS is focused on behavior alone i.e inputs and outputs. However, we hypothesize that internal-based techniques might provide higher signal in certain settings. To study this hypothesis, we sought to answer the following questions in multi-LLM agent Iterated Prisoner’s Dilemma (PD):

Can we find meaningful safety-relevant internal representations?
Do mechanistic interpretability (MI) techniques enable more robust steering compared to baselines e.g. prompting?

We observe that PD agents systematically develop internal representations associated with deception. If these... (read 1378 more words →)

Replying toW2SG: Introduction

Maria Kapros2y

W2SG: Introduction

Wasn't aware of it. Thanks!

W2SG: Introduction

Maria Kapros

Epistemic status: Naive and exploratory, reflects my primary conceptual understanding, awaiting a technical deep dive. 99% of ideas are not my own, rather distilled from the resources hyperlinked throughout.

Many alignment researchers err towards local optimization i.e. seek low-hanging fruits and leverage incremental improvements. Fast and imperfect iterative improvement is necessary but sometimes it should be integrated within a principled and holistic strategy.

I fear that many people interested in W2SG will default to testing ideas on the basis of informed heuristics, without robust inside views to justify them. While in aggregate we might converge on effective solutions, my general thesis is that W2SG needs people adopting an AI scout approach.

This introductory post and... (read 2843 more words →)

LESSWRONG
LW

LESSWRONG
LW

Maria Kapros

Maria Kapros

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

W2SG: Introduction

Provable AI Alignment (ProvAIA)

Weak-To-Strong Generalization (W2SG)

Maria Kapros

Maria Kapros

Maria Kapros

Feature-Based Analysis of Safety-Relevant Multi-Agent Behavior

W2SG: Introduction

Provable AI Alignment (ProvAIA)

Weak-To-Strong Generalization (W2SG)

Introduction

TL;DR