LESSWRONG
LW

686
lewtun
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
lewtun2y10

Do you have any plans to release the instructions in RefusalBench? I understand the reasons to not provide many details of your underlying technique, but given the limitations you highlight with AdvBench, wouldn't access to RefusalBench provide safety researchers with a better benchmark to test new models on?

Reply
No wikitag contributions to display.
No posts to display.