x
Eliciting Harmful Capabilities by Fine-Tuning on Safeguarded Outputs — LessWrong