x

LESSWRONG

LW

Guido Bergman — LessWrong

Guido Bergman

Guido Bergman

Message

7

Ω

4

1

1

2y

Guido Bergman

7

Ω

4

2y

Avoiding jailbreaks by discouraging their representation in activation space

This project was completed as part of the AI Safety Fundamentals: Alignment Course by BlueDot Impact. All the code, data and results are available in this repository. Abstract The goal of this project is to answer two questions: “Can jailbreaks be represented as a linear direction in activation space?” and...

Sep 27, 2024•8