LESSWRONG
LW

danielbalsam
-1020
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
Refusal in LLMs is mediated by a single direction
danielbalsam1y00

Great post -- thanks for sharing. I am trying to replicate this work and was able to do so for several models but having a lot of trouble reproducing this for the Llama 3 models. I am able to sometimes success in some narrow prompts but not others. Are there any suggestions you have or anything else non-obvious for that model family?

Reply
DSLT 0. Distilling Singular Learning Theory
danielbalsam1y10

Hi! I am in the process of reading this sequence and would love some supplemental lecture materials (particularly at the intersection of alignment research) and was very excited by the prospect of the lectures form the June summit, however the YouTube channels appears to 404 now. Is there somewhere else I can listen to these lectures?

Reply
No posts to display.