This is a special post for quick takes by Nice C. Ineza. Only they can create top-level comments. Comments here also appear on the Quick Takes page and All Posts page.
I am increasingly inclined to look into treating model representations as directions in activation space rtaher than individuals neurons is where maybe i can uncover more on mech. interp.
Wondering if there could be ''feature directions"that is corresponding to when a model could go nuts or just to an unsafe code generation or jailbreak like behavior.
Geometry could be our solution, just a thought!