Breaking Circuit Breakers
A few days ago, Gray Swan published code and models for their recent “circuit breakers” method for language models.[1]1 The circuit breakers method defends against jailbreaks by training the model to erase “bad” internal representations. We are very excited about data-efficient defensive methods like this, especially those which use interpretability...