Spending days on Geometric Safety: Mapping the Activation Space of Refusal
In my journey of learning more about interpretability and robustness, on understanding and/or testing various assumptions, I thought it would be helpful to post while documenting my attempt to locate "unsafe behaviors" in transformer activation space. I will not be analyzing outputs, but treating representations as geometric objects that can...
Mar 281