Avoiding jailbreaks by discouraging their representation in activation space
This project was completed as part of the AI Safety Fundamentals: Alignment Course by BlueDot Impact. All the code, data and results are available in this repository. Abstract The goal of this project is to answer two questions: “Can jailbreaks be represented as a linear direction in activation space?” and...
Sep 27, 20248