x

LESSWRONG

LW

Nice C. Ineza — LessWrong

Nice C. Ineza

Nice C. Ineza

Message

-1

2

1

20d

Nice C. Ineza

-1

20d

Spending days on Geometric Safety: Mapping the Activation Space of Refusal

In my journey of learning more about interpretability and robustness, on understanding and/or testing various assumptions, I thought it would be helpful to post while documenting my attempt to locate "unsafe behaviors" in transformer activation space. I will not be analyzing outputs, but treating representations as geometric objects that can...

Nice C. Ineza's Shortform