LESSWRONG
LW

1980
Beepboop
0010
Message
Dialogue
Subscribe

Posts

Sorted by New

Wikitag Contributions

Comments

Sorted by
Newest
No wikitag contributions to display.
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Beepboop2y10

A good safety measure for these models might be to train them with false information about building bombs etc. so their answers will be full of hallucinations. There aren't that many areas of dangerous knowledge, so it could probably be done cheaply and without significantly affecting it's general capabilities.

Reply
No posts to display.