LESSWRONG
LW

hughvd

Message

8mo

[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks

This work was done as an experiment for Week 3 (Model Specifications and Compliance) of Boaz Barak's "CS 2881r: AI Alignment and Safety" at Harvard. A link to the lecture can be found here. TL;DR: We evaluate the effects of different system prompt types (minimal vs. principles vs. rules) and...

Oct 28, 2025•5

hughvd

8mo

hughvd — LessWrong

hughvd

Message

8mo

[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks

Oct 28, 2025•5

hughvd

8mo

[CS 2881r] Can We Prompt Our Way to Safety? Comparing System Prompt Styles and Post-Training Effects on Safety Benchmarks

hughvd

4mo

This work was done as an experiment for Week 3 (Model Specifications and Compliance) of Boaz Barak's "CS 2881r: AI Alignment and Safety" at Harvard. A link to the lecture can be found here.

TL;DR: We evaluate the effects of different system prompt types (minimal vs. principles vs. rules) and their combinations on over-refusal, toxic-refusal, and capability benchmarks using Deepseek-R1-Qwen3-8b and RealSafe-R1-8b (an additionally safety-tuned Deepseek-R1). We find that appropriate system prompts can achieve toxic-refusal rates within 5% of the safety-tuned model while reducing over-refusal by 2-10x, suggesting prompting may complement or even outperform certain safety training approaches on these benchmarks.

How can we get models to do what we want them to do?... (read 2115 more words →)