x

LESSWRONG

LW

danielbalsam — LessWrong

danielbalsam

danielbalsam

Message

-1

2

3y

danielbalsam

-1

3y

Refusal in LLMs is mediated by a single direction

danielbalsam2y00

Great post -- thanks for sharing. I am trying to replicate this work and was able to do so for several models but having a lot of trouble reproducing this for the Llama 3 models. I am able to sometimes success in some narrow prompts but not others. Are there any suggestions you have or anything else non-obvious for that model family?

DSLT 0. Distilling Singular Learning Theory

danielbalsam2y10

Hi! I am in the process of reading this sequence and would love some supplemental lecture materials (particularly at the intersection of alignment research) and was very excited by the prospect of the lectures form the June summit, however the YouTube channels appears to 404 now. Is there somewhere else I can listen to these lectures?